refactor(web): extract assistant lifecycle state machine into a non-React service by ashleeradka · Pull Request #32654 · vellum-ai/vellum-assistant

ashleeradka · 2026-05-30T01:42:11Z

Closes LUM-2062. Follow-up to #32641.

Why

#32641 published assistant lifecycle through Zustand and killed the outlet-context plumbing, but left the state machine itself React-shaped. The hook held all the domain state in useRefs and useState, defined the actions inside useCallback closures over React-bound primitives (useQueryClient, useMutation), and registered them into the store via a passive useEffect. That shape produced every Codex P1 we caught during #32641 cleanup:

The action-registry pattern (actions can't be defined inline in create() because they close over React state, so they have to be registered via useEffect after mount).
The NOOP_CHECK / NOOP_VOID defaults that cover the brief registration race.
The cold-mount capture bug — render-time .getState() captures the no-op because the parent's register-actions effect hasn't committed yet.
The selector-flip bug — .use.X() re-renders on registration, which cleans up and re-runs any effect listing the action as a dep, which can re-execute side effects (duplicate hatch POST in the cloud path).

The PR description on #32641 explicitly called out the action-registry pattern as the trade we made and pointed at LUM-2062 as the architectural finish line. This is that.

What changed

The state machine moves out of React entirely.

Concern	Before (#32641 final)	After (this PR)
State machine	`useAssistantLifecycle` (676 lines): `useState` for `assistantState`, 6 `useRef`s for counters/generation/in-flight flags, action `useCallback`s, register-actions `useEffect`, watchdog `useEffect`	`lifecycle-service.ts` (620 lines): plain TypeScript class, module-level singleton, instantiated at import
Hook	676 lines	99 lines — `useQueryClient`, `useAssistantQuery`, two `useEffect`s that push inputs and poll results into the service
Store	110 lines, holds state + 3 action fields + NOOP defaults + `registerImperativeActions` action	39 lines, holds `assistantState` only
Consumer call shape	`useAssistantLifecycleStore.getState().checkAssistant()` (sometimes wrapped in `useCallback` to defeat identity flips)	`lifecycleService.checkAssistant()` — service methods are stable from instant zero

The state machine semantics don't change. Same phases, same retry budgets, same recovery paths, same gateway-auth / local-mode branches, same post-hatch cache seed, same onboarding-redirect coordination. The extraction is purely structural.

Benefits

Direct unit tests for the state machine. lifecycle-service.test.ts exercises the server-state projection (active and self-hosted result handling), the auto-hatch cascade (the nonprod / retired / vanilla-managed branches), the logout reset, and the gateway-auth short-circuit — no renderHook, no React tree. Recovery paths and retry budgets are reachable from tests for the first time.
Cold-mount race goes away. Service methods are stable from import, so consumers can call them inline without selector subscriptions or wrapping. The duplicate-hatch bug Codex caught in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641 cleanup is structurally impossible.
No more .use.X() vs .getState() decision for lifecycle actions. Either pattern would have worked equally well before — neither does anything special now, because the action isn't on the store. Just import the service and call methods.
Single state system. Counters, generation, in-flight guards no longer live in useRef parallel to the store. Service fields are the only place state lives; the store mirrors assistantState for React reactivity.
The hook is small enough to read in one sitting. 99 lines including the docstring and the inputs interface.

Why safe

Behavioral parity. Every transition in the service runs in the same conditions as the previous hook. Every setSelfHostedConnection(...), every setQueryData cache seed, every invalidateQueries backoff fires under the same predicate.
Integration tests pass unmodified. queries.test.ts, onboarding-lifecycle-sync.test.tsx, use-event-bus-init.test.tsx — the existing behavioral coverage of the hook. The only test changes are mock targets (@/assistant/lifecycle-store → @/assistant/lifecycle-service in two test files; both tests still assert the same behavior).
tsc --noEmit, lint, audit:cross-domain, test:ci all clean. 181/181 test files pass (added one new unit-test file).
No new cross-domain imports. Service lives in src/assistant/ next to the existing files; consumers are unchanged in scope.

References

Zustand — Reading state outside components — .getState() and .setState() work on module-level Zustand stores without React. Used by the service to write assistantState.
TkDodo — Working with Zustand — server-state / client-state boundary; the service straddles it (derives AssistantState from server polling results but layers in client retry/recovery logic that doesn't reduce to a query).
apps/web/docs/STATE_MANAGEMENT.md — the canonical-migration pointer extended in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641. This PR is the natural next step but doesn't require new doc rules; the lifecycle-store docstring nuance ("subscribe vs .getState() for the registered actions") is now deleted because the trap no longer exists.

What I considered and did NOT do, and why

Did not switch useAssistantLifecycleStore to a zustand/vanilla store or useSyncExternalStore. Either would work — the service would own the vanilla store as its private state container, and useAssistantLifecycleStore would be a thin React wrapper around it. Kept the create<...> + createSelectors shape because (a) it's consistent with every other store in the codebase, (b) useAssistantLifecycleStore.setState({...}) from the service is the same operation the previous code did via the store's own action, just inline, and (c) the migration was big enough already.

Did not own the polling timer inside the service. TanStack Query's refetchInterval (in useAssistantQuery) still drives the 3-second poll cadence; the service just receives results via the hook's useEffect. Doing in-service polling would lose TanStack Query's dedup with other /assistant/ subscribers (sidebar header, identity panel) and the integration with the setQueryData post-hatch seed. Out of scope for this ticket.

Did not move the logout-clear into a synchronous call at the logout site (the vex-bot non-blocking observation on #32641). Same code as before — service resets when respondToInputs() sees !isLoggedIn. Moving it earlier in the call chain (e.g. into the auth-store logout action calling lifecycleService.respondToInputs() directly) is a one-line follow-up if you ever see flake; not in scope here.

Did not inject the API functions (getAssistant, hatchAssistant, retireAssistantById) through setInputs. The service imports them at module level. Tests mock via mock.module(...) — same pattern as everything else in the codebase. Constructor injection would have been the "more textbook" non-React shape but adds plumbing without solving a real problem for one consumer.

Did not extract a LifecycleStateMachine interface separate from the service. Could have separated "pure transitions" (a state-table-like function) from "side effects" (the cache seeding, the self-hosted connection priming). Considered briefly; the recovery and post-hatch paths are so interleaved with side effects that separating them just spreads the logic across two files. Held off.

Did not delete useAssistantLifecycleStore in favor of subscribing directly to a service emitter. The store is still the React tree's reactive view; the service writes to it. Two-layer separation (service for logic, store for reactivity) is the same trade useEventBusStore makes for the event bus. Consistent across the codebase.

Root cause analysis

How did the code get into this state?

#32641 extracted lifecycle into Zustand stores, which fixed the outlet-context plumbing problem. But the hook kept owning the state machine through React primitives (useRef, useCallback, useEffect) because that was the smallest diff that achieved the outlet-context fix. The trade was explicit in the #32641 PR description, with this ticket as the planned follow-up.

The deeper cause is structural: the lifecycle state machine is domain logic with no React-specific dependencies. React was the wrong shape for it from the start, but it was also the lowest-friction shape when the hook was small. As the machine grew (retry budgets, recovery paths, generation counter, watchdog), the React-bound shape became increasingly costly.

What mistakes or decisions led to it?

Built the state machine inside the hook to begin with. The first version was useState + useEffect — small, simple, idiomatic React. Each new responsibility (retry budgets, recovery, watchdog) was added inside the hook because the hook already existed. No one stopped to ask "should this live outside React?"
Stopped at the Zustand-store refactor in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641. That was a real architectural improvement, but it left the action-registry pattern in place. The PR called out the trade, but a "stop at the trade" PR will accumulate workarounds (NOOP defaults, selector-vs-getState debates, cold-mount bugs) if not finished.

Were there warning signs we missed?

The lifecycle hook needed 6 useRefs just to hold private state. Refs are a code smell for state that's outgrowing the hook shape.
hatchMutateRef, retireMutateRef — capturing useMutation's .mutateAsync in refs because the mutation object itself was unstable. That's React fighting React.
The isRetiredRef / isNonProductionRef / onRedirectRef / resolveOnboardingRedirectRef pattern — capturing every external value in a ref to keep useCallback identities stable. Each ref was a workaround for the fact that the actions wanted to be stable but the hook couldn't make them so.
Three Codex P1s in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641 cleanup, all variants of "action identity does something unexpected at the React/non-React boundary." The frequency was the signal.

What can we do to prevent this pattern from recurring?

Ask "does this need React?" before building state machines in hooks. If the answer is "the state machine doesn't, but reactivity does," that's the service-with-store-output shape this PR demonstrates. Build that from the start rather than landing in it.
Treat useRef-as-private-state count as a smell. A handful is normal. Six refs is a hook trying to be a class.
Don't stop at "trade" PRs unless the next step is filed. refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641 did this right (LUM-2062 was filed before merge). The systemic issue is when "trade" PRs get filed without the finish line tracked — the workaround becomes permanent.

AGENTS.md changes

No new rule. The outlet-context rule from #32641 stays in place. The lifecycle-store docstring nuance about "selector subscription vs .getState() for the registered actions" is deleted, not preserved — the trap no longer exists, so leaving the warning would mislead.

The pattern this PR demonstrates (hook-as-React-wiring + module-level service-as-state-machine + Zustand-as-reactive-output) is worth a brief mention in STATE_MANAGEMENT.md if a second example appears. For now the in-file docstrings (lifecycle-service.ts header) carry the explanation.

https://claude.ai/code/session_01NjQuUAiXsS4DVEmKPsb1uE

Generated by Claude Code

Closes LUM-2062 — the architectural finish line that #32641 set up. The lifecycle state machine (hatch + initializing-recovery retry budgets, generation counter for stale-response handling, 5-minute stuck-init watchdog, gateway-auth short-circuit, post-hatch cache seed, onboarding-redirect coordination) now lives in `apps/web/src/assistant/lifecycle-service.ts` as a plain TypeScript class — module-level singleton, instantiated at import, React-independent. What the extraction unlocks: - **No action-registry pattern.** Actions are methods on the service, stable from instant zero. No `registerImperativeActions` effect, no `NOOP_CHECK` / `NOOP_VOID` defaults, no "use selectors not getState" docstring nuance, no "use getState not selectors" reversal of same. Consumers import `lifecycleService` and call methods directly. - **No private refs alongside store state.** All state (counters, generation, hatching flag, in-flight ids) lives as instance fields on the service. Single state system, no React/store duality. - **The watchdog moves out of React.** `transition()` arms / clears the 5-minute stuck-init timer based on the next state's kind — no more `initializingCycle` re-arming via useEffect. - **Direct unit tests.** `lifecycle-service.test.ts` exercises the server-state projection, the auto-hatch cascade with the nonprod / retired branches, the logout reset, and the gateway-auth short-circuit — all without `renderHook`, all asserting on `lifecycleService` + the two Zustand stores directly. The hook (`use-lifecycle.ts`) shrinks from 676 lines to 99 — just the React-bound wiring: `useQueryClient`, `useAssistantQuery`, two `useEffect`s that push inputs and poll results into the service. The store (`lifecycle-store.ts`) shrinks from 110 lines to 39 — just the `assistantState` field. No actions, no NOOP defaults. Consumers (`chat-page`, `hatching-screen`, `pre-chat-flow`, `use-event-bus-init`) switch from `useAssistantLifecycleStore.getState().checkAssistant()` to `lifecycleService.checkAssistant()`. The chat-page wrappers that existed only to defeat the action-registry identity flip (`useCallback(() => store.getState().X(), [])`) get to stay tidy because the service methods are stable from import. Tests: 181/181 pass, including 7 new direct unit tests for the service. `tsc --noEmit`, `lint`, and `audit:cross-domain` all clean. No new cross-domain imports. https://claude.ai/code/session_01NjQuUAiXsS4DVEmKPsb1uE

linear · 2026-05-30T01:42:14Z

LUM-2062

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 527f268a45

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-30T01:45:53Z

+    if (generation !== this.generation) return;
+    if (nextState.kind !== "active") {
+      this.transition(nextState);


Preserve the stuck-initializing watchdog across polls

When the assistant remains initializing, every 3-second poll reaches this path and calls transition({ kind: "initializing" }) again. transition() now clears and re-arms the initializing watchdog on every call, so continuous initializing poll results keep resetting the 5-minute timer and recoverStuckInitializingAssistant() never runs for a genuinely stuck hatch.

Useful? React with 👍 / 👎.

devin-ai-integration

Devin Review found 0 potential issues.

View 3 additional findings in Devin Review.

Two changes from the post-open audit: **Codex P1 — watchdog reset by every poll.** My initial `syncInitializingWatchdog` cleared + re-armed the 5-minute timer on every `transition()` call. While the assistant stayed `initializing`, every 3-second poll re-entered the projection path and called `transition({ kind: "initializing" })` again, which rearmed the timer — so the watchdog never expired and `recoverStuckInitializingAssistant()` would never fire for a genuinely stuck hatch. The original React implementation avoided this by keying the watchdog `useEffect` on `[assistantStateKind, initializingCycle]`; back-to-back same-kind transitions didn't re-run the effect, keeping the timer counting down. `initializingCycle` was bumped only inside `hatchAndCheck`, which is how a fresh hatch attempt restarted the clock. Rewriting `transition()` to edge-trigger (arm only on entry into `initializing`, clear only on exit) and adding an explicit `armInitializingWatchdog()` call inside `hatchAndCheck` to match the cycle-bump semantics. Net effect: same timer behavior as the pre-extraction code. **Audit DRY** — `applyServerStateUpdate` and `recoverStuckInitializingAssistant` duplicated the active and self_hosted projection logic. Extracted `projectActive(result)` and `projectSelfHosted(result)` private helpers. Preserved the intentional pre-existing asymmetry where the recovery path doesn't clear `hatchingVersion` on active landing (each call site still manages its own counter / flag bookkeeping). **New tests** for the bug Codex caught and the ticket-mandated retry-budget exhaustion: - `redundant initializing→initializing transitions do not reset the 5-minute clock` - `3 recoverable hatch failures in the auto-hatch poll loop surface as error` 181/181 test files pass. https://claude.ai/code/session_01NjQuUAiXsS4DVEmKPsb1uE

`applyServerResult` takes `GetAssistantResult`, whose `data` satisfies the full `Assistant` type. My watchdog test passed an object literal with just `{id, status}` — CI flagged the missing 13 required fields. Switched the test to call `lifecycleService.checkAssistant()` twice more instead of constructing the result by hand. The fake queryClient routes back through `getAssistantMock` (which has loosely-typed return values per its mock definition), so the test exercises the same projection path without constructing the full type. Also: added a note in `beforeEach` flagging that `mockClear` doesn't reset implementations — future mocked deps added here must re-set their baseline in this block or tests will silently inherit the previous test's stub.

ashleeradka · 2026-05-30T01:57:50Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aeaf269525

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-30T02:01:24Z


  const navigateToChatAfterLifecycleRefresh = useCallback(async () => {
-    await useAssistantLifecycleStore.getState().checkAssistant();
+    await lifecycleService.checkAssistant();


Guard lifecycle calls before root inputs are installed

When this page mounts directly with readOnboardingCompleted() already true, its child useEffect can run before RootLayout's passive effect calls lifecycleService.setInputs(). This direct singleton call then uses the service's default queryClient: null, so checkAssistant() catches a TypeError and publishes a spurious network-error lifecycle state before navigating; the previous store action had a no-op pre-registration default specifically for this window. Keep a no-op/ready guard or initialize the service before child effects can invoke it.

Useful? React with 👍 / 👎.

Codex P2: a child route's `useEffect` that calls a lifecycle action (e.g. `lifecycleService.checkAssistant()`) on a direct mount can fire BEFORE `RootLayout`'s passive effect has called `setInputs()` — children's effects commit before parent's in the same render cycle. The unguarded action tried to use the default `null` queryClient, threw a TypeError on `null.fetchQuery(...)`, caught it, and published a spurious network-error state. The previous store-based action API papered this over with NOOP_CHECK / NOOP_VOID defaults. The service had no equivalent. Adding a `ready` flag that flips true on the first `setInputs()` call. Public actions (`checkAssistant`, `retryAssistant`, `hatchVersion`, `respondToInputs`, `applyServerResult`) early-return when `!ready`. Matches the no-op-default behavior the old store API used to ship with, but now expressed explicitly on the service. New test asserts that calling actions before `setInputs` is a no-op and leaves the lifecycle store in its initial `loading` state. 181/181 test files pass.

The docstring said "Mirrors the no-op defaults the pre-service store action API used to ship with" — references a shape that no longer exists in the codebase. Rephrased to state the invariant the guard protects (children's effects committing before the parent's) without naming a previous implementation.

ashleeradka · 2026-05-30T02:16:03Z

@codex review

chatgpt-codex-connector · 2026-05-30T02:19:08Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vex-assistant-bot

✦ APPROVE — reviewed at 25a2dcbd4e

Value: Closes the architectural finish line #32641's PR description pointed at. The action-registry pattern, NOOP defaults, render-time capture race, and selector-flip race — all four Codex P1s from #32641 dissolve because the state machine no longer lives inside React's render-and-effect timing. Plus: the machine is now directly testable without renderHook.

What this does:

New lifecycle-service.ts (642 lines) — module-level singleton class owning state machine, watchdog, retry budgets, generation counter, gateway-auth short-circuit, cache seeding, onboarding-redirect coordination.
use-lifecycle.ts shrinks 676 → 99 lines — pure wiring: pull React signals + push via setInputs, hand poll results via applyServerResult. No state, no callbacks.
lifecycle-store.ts shrinks 110 → 39 lines — assistantState only. The store is now strictly an observability surface; only the service writes.
Consumer call shape simplifies: useAssistantLifecycleStore.getState().checkAssistant() → lifecycleService.checkAssistant() across chat-page, hatching-screen, pre-chat-flow, use-event-bus-init.
New direct unit tests in lifecycle-service.test.ts exercise server-state projection (active + self-hosted), the auto-hatch cascade (nonprod / retired / vanilla-managed branches), the logout reset, and the gateway-auth short-circuit. No renderHook, no React tree.

Full analysis

Codex P1 (watchdog reset on every poll) — closed at `4a9aade767`

The original transition() cleared+re-armed the watchdog on every call, so back-to-back initializing → initializing poll results reset the 5-minute clock indefinitely and recoverStuckInitializingAssistant() could never fire on a genuinely stuck hatch. Fix is edge-triggered:

if (next.kind !== "initializing") {
  if (this.initializingTimeout) clearTimeout(...);
} else if (prevKind !== "initializing") {
  this.armInitializingWatchdog();
}

A repeat initializing → initializing is a no-op kind-wise — no clear, no re-arm. hatchAndCheck() calls armInitializingWatchdog() explicitly so a fresh hatch attempt still resets the clock. The inline comment spells out the invariant — high-leverage for the next maintainer.

Codex P2 (pre-mount lifecycle call → null queryClient) — closed at `26521f2162`

ready: boolean ships false, setInputs() is the only flip-to-true. Every public action (checkAssistant, respondToInputs, applyServerResult) early-returns if (!this.ready). So a child route whose useEffect commits before RootLayout's setInputs lands — the parent/child effect commit-order race — becomes a silent no-op instead of TypeError-on-null.fetchQuery followed by a spurious network-error state being published.

This is the structural equivalent of the NOOP_CHECK defaults the previous version shipped — same invariant, but the guard lives once on the class instead of being scattered across three action fields with NOOP fallbacks the store had to expose. The docstring on ready explicitly cites the race; future maintainers see why it's there.

Test architecture

Bun's mock.module at module scope + lifecycleService.__resetForTesting() in beforeEach/afterEach is the right pattern for a module-level singleton. The comment in beforeEach about why mockClear isn't enough (and why every new mock needs a fresh baseline) is doing real work — the kind of subtle leak that costs 30 minutes of debugging a flaky test.

The fact that auto-hatch cascade, retry-budget exhaustion, and the gateway-auth short-circuit are now reachable from a unit test — no renderHook, no fake timers, no time-travel reasoning about effect commit order — is the headline benefit of the extraction. The state machine being a class with __resetForTesting is the kind of testability boring frontend code rarely gets.

Observations (non-blocking)

useEffect dep array in use-lifecycle.ts:75–86 lists onRedirect and resolveOnboardingRedirect. If the parent constructs either inline per render, the effect re-runs every render → setInputs re-publishes identical values → respondToInputs re-fires. Most calls are early-return no-ops inside the service, so it's not a correctness bug — but worth confirming RootLayout wraps both in useCallback. The kind of thing the next render-perf pass will catch if it's drifted.
getLocalGatewayUrl mock at lifecycle-service.test.ts:41 is set up but I don't see it asserted in any test. Either dead mock or pending coverage for the local-gateway projection path. Drop it or add the assertion.
useCallback wrappers in chat-page.tsx:174–187 — still necessary because lifecycleService.checkAssistant is a normal class method (passing the bare reference loses this). Two cleaner shapes if you ever want to revisit: (a) define the actions as arrow-class-fields so this binds at construction; (b) ship pre-bound exports next to the singleton. Current form is fine — the comment above each wrapper could shrink now that the identity-flip rationale is gone, but not worth a follow-up commit.

Merge gate

CI 7/7 green at HEAD
Codex review at HEAD: "Didn't find any major issues. Hooray!"
Devin review: 0 potential issues
Both Codex inline findings (P1 watchdog, P2 input guard) verified closed in code

Mergeable.

Vellum Constitution — Trust-seeking: by lifting the state machine out of React's commit-order choreography, the four classes of failure modes Codex caught in the predecessor PR become structurally impossible to express, and the machine is now reachable from unit tests with no time-travel reasoning about render order.

The comment referenced "actions registered in a passive effect" and the render-time-capture-vs-selector-subscription debate — all of which became moot when the state machine moved out of React. Updated to state the current invariants: stable identity across renders for prop boundaries, and `this` binding preserved on call (the underlying reason class methods need a wrapper).

Closes LUM-2067 — the test-coverage gap I'd been planning as a follow-up. Two new direct unit tests exercise the recovery code that was the last meaningful uncovered area in the service: 1. **Watchdog firing on a still-initializing assistant retires and re-hatches.** Verifies the timer fires, recovery captures the stuck assistant id, calls `retireAssistantById`, and follows up with `hatchAssistant`. The fix from `4a9aade767` that edge-triggers the watchdog is what makes this reachable — without it, redundant `initializing→initializing` poll transitions would reset the timer indefinitely. 2. **MAX_INITIALIZING_RECOVERIES failed recoveries surface as a terminal timeout error.** Drives four watchdog cycles where each recovery's rehatch returns initializing (the assistant stays stuck). After the budget exhausts, the next watchdog firing transitions to the `buildInitializingTimeoutError` shape. Confirms `MAX_INITIALIZING_RECOVERIES` actually bounds the loop (the kind of off-by-one that would manifest as users permanently stuck on the initializing screen). Pattern: `mock.module("@/assistant/lifecycle", ...)` overrides `INITIALIZING_TIMEOUT_MS` to 30ms while preserving the real helpers (`resolveAssistantLifecycleState`, `buildInitializingTimeoutError`, etc.). The watchdog runs in real time against the shrunk timeout, no fake-timer machinery, no time-travel reasoning. Left for the integration tests: the "force-refresh found active" and post-hatch active-landing branches both depend on the React-tree polling loop firing `applyServerResult` from the next `useAssistantQuery` data tick. The state machine alone can't reach those outcomes without a poll driver, so `onboarding-lifecycle-sync.test.tsx` carries that coverage. 181/181 tests pass.

vex-assistant-bot

✦ Re-APPROVE — reviewed at c9a8b15da0

Two commits since the prior APPROVE at 25a2dcbd4e, both targeted at observations from that review.

Test coverage extension (c9a8b15da0, closing LUM-2067):

12 direct unit tests now, +2 from the prior pass — both exercising the recovery code that was the last meaningful uncovered area in the service:

Watchdog → retire → re-hatch end-to-end. The watchdog is reachable from a test only because transition()'s edge-trigger fix (4a9aade767) stopped initializing → initializing poll repeats from indefinitely resetting the clock. The test captures that explicitly: a getAssistant → initializing projection arms the watchdog, the watchdog fires after TEST_INITIALIZING_TIMEOUT_MS (30ms), retireAssistantById is called with the captured initializingAssistantId, then hatchAssistant follows up. The structural fix made the test possible — which is exactly the right shape.
MAX_INITIALIZING_RECOVERIES budget. Drives four watchdog cycles where each recovery's re-hatch returns initializing (the assistant stays stuck). After the budget exhausts, the next firing transitions to buildInitializingTimeoutError. The kind of off-by-one that would surface as users permanently stuck on the initializing screen — now a regression test catches it.

The mocking pattern is the right call. mock.module("@/assistant/lifecycle", ...) overrides only INITIALIZING_TIMEOUT_MS, preserving the real resolveAssistantLifecycleState / buildInitializingTimeoutError / shouldRecoverFromHatchFailure helpers. The watchdog runs in real time against the shrunk timeout — no fake-timer machinery, no time-travel reasoning about effect commit order. Same testability headline as the original extraction, extended.

Integration boundary articulated. The commit message names the two branches deliberately left to integration coverage — "force-refresh found active" (recovery branch when initializingAssistantId was null at watchdog time) and post-hatch active landing — because both depend on the React-tree polling loop firing applyServerResult from the next useAssistantQuery data tick. The state machine alone can't reach those outcomes without a poll driver, so onboarding-lifecycle-sync.test.tsx carries them. That's a principled unit-vs-integration boundary, not a coverage gap to apologize for.

Comment cleanup (ad191fd7b9):

Picks up Observation #3 from the prior review. The old comment carried forward the action-registry / NOOP-defaults / selector-flip rationale that became moot when the state machine moved out of React. New comment states the current invariants: stable identity for prop boundaries, this binding preserved on call. Crisp.

Observation #2 (unused getLocalGatewayUrl mock at lifecycle-service.test.ts:63) — still present, still unused. Lowest-stakes of the three; non-blocking. Drop on the next pass or annotate as a future-coverage placeholder.

Observation #1 (wiring-effect dep stability) — already verified moot in the prior review: RootLayout passes useNavigate (stable hook output) and a module-imported resolveOnboardingRedirect (program-lifetime constant). The wiring effect doesn't re-fire on renders.

Merge gate

CI 7/7 green at HEAD
Codex review at HEAD: "Didn't find any major issues" (the lingering P1/P2 inline anchors are auto-migrated from older commits — both verified structurally closed in the prior review)
Devin review: 0 potential issues
181/181 tests pass per commit message
All three prior-review observations either addressed (#3) or verified non-issues (#1, #2)

Mergeable. Test additions tilt this from "ship and watch" to "ship with confidence" — the recovery code that runs on a genuinely-stuck cold mount now has end-to-end regression coverage.

ashleeradka marked this pull request as ready for review May 30, 2026 01:43

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 30, 2026

View reviewed changes

claude added 2 commits May 30, 2026 01:50

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

claude added 2 commits May 30, 2026 02:04

vex-assistant-bot Bot previously approved these changes May 30, 2026

View reviewed changes

ashleeradka dismissed vex-assistant-bot[bot]’s stale review via ad191fd May 30, 2026 02:31

vex-assistant-bot Bot approved these changes May 30, 2026

View reviewed changes

vex-assistant-bot Bot merged commit 527ca02 into main May 30, 2026
7 checks passed

vex-assistant-bot Bot deleted the ashlee/lum-2062-lifecycle-service branch May 30, 2026 02:59

ashleeradka mentioned this pull request May 30, 2026

feat(web): add four ESLint rules for previously-vibes-checked conventions (LUM-2070) #32659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(web): extract assistant lifecycle state machine into a non-React service#32654

refactor(web): extract assistant lifecycle state machine into a non-React service#32654
vex-assistant-bot[bot] merged 7 commits into
mainfrom
ashlee/lum-2062-lifecycle-service

ashleeradka commented May 30, 2026

Uh oh!

linear Bot commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 30, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

ashleeradka commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 30, 2026

Uh oh!

ashleeradka commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

vex-assistant-bot Bot left a comment

Uh oh!

vex-assistant-bot Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ashleeradka commented May 30, 2026

Why

What changed

Benefits

Why safe

References

What I considered and did NOT do, and why

Root cause analysis

How did the code get into this state?

What mistakes or decisions led to it?

Were there warning signs we missed?

What can we do to prevent this pattern from recurring?

AGENTS.md changes

Uh oh!

linear Bot commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ashleeradka commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

ashleeradka commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

Codex P1 (watchdog reset on every poll) — closed at 4a9aade767

Codex P2 (pre-mount lifecycle call → null queryClient) — closed at 26521f2162

Test architecture

Observations (non-blocking)

Merge gate

Uh oh!

vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

Merge gate

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Codex P1 (watchdog reset on every poll) — closed at `4a9aade767`

Codex P2 (pre-mount lifecycle call → null queryClient) — closed at `26521f2162`