feat(server,chart): flip snapshot default + drop workspaces PVC (Phase 5)#871
Conversation
Phase 5 of the multi-replica rollout. Snapshot mode is now the default
session-store path; LOBU_SESSION_STORE=file is the explicit opt-out for
legacy single-replica / local-dev paths. The chart's workspaces PVC
defaults to disabled — workers fall back to pod-local ephemeral
{WORKSPACE_DIR}/{input,output,temp} (cleared at run start) and PG-backed
session.jsonl, with the per-conversation advisory lock pinning a
conversation's runs to one pod for the run lifetime.
- agent-worker: invert hydrate + cleanup gates to `!== "file"`
- server: invert acquireConversationLock + /agent-history fallback gates;
only forward `LOBU_SESSION_STORE=file` (the opt-out) to workers
- chart: `app.workspaces.enabled` defaults to false; comments updated
- test: `default-off` test now exercises the LOBU_SESSION_STORE=file
opt-out path
helm template charts/lobu (defaults) no longer renders app-pvc.yaml;
helm template charts/lobu --set app.workspaces.enabled=true still
renders the PVC + volume mount for back-compat.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR makes Postgres-backed session snapshots the Phase 5 default (opt-out via LOBU_SESSION_STORE=file). It updates Helm defaults/docs, inverts worker snapshot gating and messages, changes gateway orchestration/env forwarding, updates server routing to prefer DB snapshots unless in file-mode, adds a worker-scoped DELETE snapshot route, purges snapshots on session reset, and adds/updates integration tests. ChangesPhase 5 Snapshot Default Inversion
Sequence Diagram(s)sequenceDiagram
participant Worker
participant Gateway
participant Postgres
Worker->>Gateway: DELETE /worker/transcript/snapshot (worker JWT, org/agent/conv)
Gateway->>Postgres: DELETE FROM public.agent_transcript_snapshot WHERE org/agent/conv
Postgres-->>Gateway: deleted count
Gateway-->>Worker: { deleted: <count> }
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/agent-worker/src/openclaw/transcript-snapshot.ts (1)
16-20:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the module contract comment to match actual snapshot writes.
The header still says cleanup snapshots all terminal statuses, but runtime now only persists
completed. Please align this doc block to avoid incorrect assumptions in future changes.Also applies to: 26-28
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/agent-worker/src/openclaw/transcript-snapshot.ts` around lines 16 - 20, Update the module docblock in packages/agent-worker/src/openclaw/transcript-snapshot.ts to reflect the actual behavior: instead of claiming snapshots are taken on every terminal status in OpenClawWorker.cleanup(), state that the runtime persists snapshots only for 'completed' runs; adjust the note about hydrate filters to indicate the hydrate query targets terminal_status='completed' so failed/timeout/cancelled runs are not persisted, and change the same wording in the duplicate comment region around lines 26-28 to match this corrected contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 663-671: The gateway currently only sets
commonEnvVars.LOBU_SESSION_STORE="file" when snapshotModeEnabled is false, but
doesn't clear any forwarded WORKER_ENV_LOBU_SESSION_STORE when
snapshotModeEnabled is true, which can silently flip the worker to file-mode;
update the logic in embedded-deployment.ts (where snapshotModeEnabled and
commonEnvVars are handled) to explicitly remove/normalize the worker-facing key
when snapshotModeEnabled is true (e.g., delete or set
commonEnvVars.LOBU_SESSION_STORE = undefined) so the forwarded
WORKER_ENV_LOBU_SESSION_STORE cannot override the gateway's snapshot-default
behavior.
---
Outside diff comments:
In `@packages/agent-worker/src/openclaw/transcript-snapshot.ts`:
- Around line 16-20: Update the module docblock in
packages/agent-worker/src/openclaw/transcript-snapshot.ts to reflect the actual
behavior: instead of claiming snapshots are taken on every terminal status in
OpenClawWorker.cleanup(), state that the runtime persists snapshots only for
'completed' runs; adjust the note about hydrate filters to indicate the hydrate
query targets terminal_status='completed' so failed/timeout/cancelled runs are
not persisted, and change the same wording in the duplicate comment region
around lines 26-28 to match this corrected contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: da3aac61-62c4-4a6d-83cc-b9899eb26257
📒 Files selected for processing (8)
charts/lobu/templates/deployment.yamlcharts/lobu/values.yamlpackages/agent-worker/src/openclaw/transcript-snapshot.tspackages/agent-worker/src/openclaw/worker.tspackages/server/src/gateway/__tests__/agent-transcript-snapshot.test.tspackages/server/src/gateway/gateway/index.tspackages/server/src/gateway/orchestration/impl/embedded-deployment.tspackages/server/src/gateway/routes/public/agent-history.ts
| // Forward the snapshot-mode flag so workers know to hydrate from | ||
| // Postgres and write back on cleanup. Mirrors gateway-side | ||
| // process.env so the lock acquisition above and the worker's | ||
| // session-store selection stay in lockstep. | ||
| if (snapshotModeEnabled) { | ||
| commonEnvVars.LOBU_SESSION_STORE = "snapshot"; | ||
| // session-store selection stay in lockstep. Phase 5: snapshot is | ||
| // the default; only forward LOBU_SESSION_STORE=file (the opt-out) | ||
| // so the worker sees the same mode the gateway picked. | ||
| if (!snapshotModeEnabled) { | ||
| commonEnvVars.LOBU_SESSION_STORE = "file"; | ||
| } |
There was a problem hiding this comment.
Keep worker session-store mode deterministic in snapshot-default path.
In snapshot mode, this no longer overwrites LOBU_SESSION_STORE. A forwarded WORKER_ENV_LOBU_SESSION_STORE=file can now silently flip only the worker to file-mode while the gateway still behaves as snapshot-mode. Please explicitly clear/normalize this env key when snapshotModeEnabled is true.
Suggested fix
- if (!snapshotModeEnabled) {
- commonEnvVars.LOBU_SESSION_STORE = "file";
- }
+ if (snapshotModeEnabled) {
+ delete commonEnvVars.LOBU_SESSION_STORE;
+ } else {
+ commonEnvVars.LOBU_SESSION_STORE = "file";
+ }As per coding guidelines, "WORKER_ENV_* gateway vars are forwarded to workers with the prefix stripped (WORKER_ENV_FOO=bar → FOO=bar)."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Forward the snapshot-mode flag so workers know to hydrate from | |
| // Postgres and write back on cleanup. Mirrors gateway-side | |
| // process.env so the lock acquisition above and the worker's | |
| // session-store selection stay in lockstep. | |
| if (snapshotModeEnabled) { | |
| commonEnvVars.LOBU_SESSION_STORE = "snapshot"; | |
| // session-store selection stay in lockstep. Phase 5: snapshot is | |
| // the default; only forward LOBU_SESSION_STORE=file (the opt-out) | |
| // so the worker sees the same mode the gateway picked. | |
| if (!snapshotModeEnabled) { | |
| commonEnvVars.LOBU_SESSION_STORE = "file"; | |
| } | |
| // Forward the snapshot-mode flag so workers know to hydrate from | |
| // Postgres and write back on cleanup. Mirrors gateway-side | |
| // process.env so the lock acquisition above and the worker's | |
| // session-store selection stay in lockstep. Phase 5: snapshot is | |
| // the default; only forward LOBU_SESSION_STORE=file (the opt-out) | |
| // so the worker sees the same mode the gateway picked. | |
| if (snapshotModeEnabled) { | |
| delete commonEnvVars.LOBU_SESSION_STORE; | |
| } else { | |
| commonEnvVars.LOBU_SESSION_STORE = "file"; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts` around
lines 663 - 671, The gateway currently only sets
commonEnvVars.LOBU_SESSION_STORE="file" when snapshotModeEnabled is false, but
doesn't clear any forwarded WORKER_ENV_LOBU_SESSION_STORE when
snapshotModeEnabled is true, which can silently flip the worker to file-mode;
update the logic in embedded-deployment.ts (where snapshotModeEnabled and
commonEnvVars are handled) to explicitly remove/normalize the worker-facing key
when snapshotModeEnabled is true (e.g., delete or set
commonEnvVars.LOBU_SESSION_STORE = undefined) so the forwarded
WORKER_ENV_LOBU_SESSION_STORE cannot override the gateway's snapshot-default
behavior.
…eset Codex P1 on PR #871. With snapshot mode default-on, the `/new` reset path was only unlinking the local session.jsonl — the next worker boot would rehydrate from Postgres and the user-visible "Starting fresh" would silently resurrect the conversation we just flushed. Two new write paths covering both legs of the reset: - Worker → gateway: new `DELETE /worker/transcript/snapshot` route scoped to the JWT's (org, agent, conv); worker calls `clearSnapshots` from `transcript-snapshot.ts` after the local unlink in the session-reset branch. - Gateway bridge: `chat-response-bridge` runs a `DELETE … USING agents` to scope to the right org without re-deriving it from the payload (the bridge doesn't carry org). Belt-and-suspenders for the case where the worker exits before its own purge call lands. Both paths gate on `LOBU_SESSION_STORE !== "file"` so file-mode opt-outs see no behaviour change. Adds an integration test that seeds three snapshots in the reset conversation + one in a sibling conversation, calls DELETE, and verifies (a) all three reset rows go, (b) the sibling row survives, (c) a second DELETE is idempotent.
…ma (#874) PR #871 flipped LOBU_SESSION_STORE default to snapshot mode. PR #865 added a startup assertion that the worker throws if snapshot mode is on but WorkerConfig.runId is missing. Together those broke every Telegram chat in prod with: "Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runId is missing — runs-queue dispatch did not stamp runId on the job payload" The gateway-side MessageConsumer correctly sets data.runId (line 149) and data.runJobToken (line 185) before dispatch. job-router writes the full payload to SSE. The worker reads payload.runId / payload.runJobToken in payloadToWorkerConfig (sse-client.ts:925-935). The missing link was JobEventSchema. Its inner payload object used plain z.object(...) which is strict-strip by default — runId and runJobToken were silently dropped at safeParse, so payload.runId was always undefined and the assertion fired on every message. Fix: declare runId + runJobToken explicitly on the schema, and add .passthrough() so future MessagePayload fields (mcpConfig, nixConfig, egressConfig, preApprovedTools, exec*, organizationId, networkConfig…) don't regress the same way. Tests: - new regression test feeds a job event with runId + runJobToken through handleEvent and asserts they reach handleThreadMessage (pre-fix: undefined; post-fix: preserved) - new test pins payloadToWorkerConfig's mapping of runId/runJobToken - new test confirms the legacy direct-enqueue path (no runId) still threads undefined cleanly
…ire + session-file utilities (#930) * wip(spike): delete dead worker module lifecycle wiring The worker's session loop invokes `initModuleWorkspace`, `onSessionStart`, and `collectModuleData` from `agent-worker/src/modules/lifecycle.ts`. Each call resolves `moduleRegistry.getWorkerModules()`, which by registration intent surfaces every module that has `onBeforeResponse` in its prototype. In practice, the only modules ever registered are `ModelProviderModule`s (Claude OAuth, ChatGPT OAuth, Bedrock, Gemini CLI, the API-key catalog module) — all inheriting from `BaseModule`, whose lifecycle hooks are no-ops. Three `for` loops over no-op methods run on every session start, plus three `await import("../modules/lifecycle")` dynamic imports that violate the static-import rule documented in CLAUDE.md. Delete the call sites in `openclaw/worker.ts` and the whole `agent-worker/src/modules/` directory. The `WorkerModule` / `BaseModule` interface surface in `@lobu/core` and `gateway/modules/module-system.ts` stays — it's public-API, and removing it is a separate, larger decision (see REPORT.md "Do not do"). Re-adding the call sites later, when an actual module needs them, is ~20 LOC. Validation: `make typecheck` clean from worktree root (server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`. * wip(spike): hoist MessagePayload + JobType into @lobu/core `MessagePayload` is the gateway↔worker wire contract: produced by `MessageConsumer` and `EmbeddedDeploymentManager.dispatch*`, consumed by `GatewayClient.handleThreadMessage` / `handleExecJob` over SSE. Both sides need to agree on the shape, but the type was declared twice and had already drifted: the worker's copy in `agent-worker/src/gateway/ types.ts` was missing `organizationId`, `networkConfig`, `egressConfig`, `mcpConfig`, `nixConfig`, and `preApprovedTools` (all present on the gateway side). The worker's zod schema was patched with `.passthrough()` to keep the extra fields from being stripped at parse time (PR #871 regression), but the static type silently lied — TS would have happily accepted reads of e.g. `payload.preApprovedTools` as `undefined` even when the gateway always populates it. Move both `MessagePayload` and `JobType` (plus the worker-side `QueuedMessage` envelope) into `packages/core/src/worker/wire.ts` and re-export them from `@lobu/core`. The worker and gateway both import from there; `queue-producer.ts` and `gateway/types.ts` keep a re-export so the existing `from "./types"` / `from "../infrastructure/queue/ queue-producer"` paths continue to resolve without churn. Type unification: - `agentOptions: AgentOptions` (was `Record<string, any>` on the gateway, `AgentOptions` on the worker). The worker reads `.model`, `.allowedTools`, `.disallowedTools`, `.timeoutMinutes` directly, so `AgentOptions` is the honest shape. `Record<string, any>` is assignable into it via the `[key: string]: unknown` index signature. - `platformMetadata: Record<string, unknown>` (was `Record<string, any>` on the gateway, a richer named interface on the worker). The unknown flavour forces three new `typeof === "string"` guards inside `base-deployment-manager.ts` where the env-var builder reads `originalMessageTs`, `botResponseTs`, and `teamId` off the bag. - `teamId?: string` (was `string` on the gateway, `string | undefined` on the worker). The worker SSE zod schema marks it `.optional()`, and the worker has explicit fallback logic (`payload.teamId ?? platformMetadata.teamId`). The gateway-side `buildMessagePayload` always sets a non-empty string, so concrete callers are unaffected. Validation: `make build-packages` clean. `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`. * wip(spike): hoist session.jsonl parser to @lobu/core Two HTTP surfaces parse the worker's `.openclaw/session.jsonl`: - the worker's own `/session/messages` and `/session/stats` (rooted at `WORKSPACE_DIR`), and - the gateway's `/session/messages` and `/session/stats` proxy fallback (rooted at `workspaces/<agentId>`, used when the worker is offline). The gateway proxies to the worker when the worker is online and falls back to its own copy otherwise — so the two parsers MUST agree on the JSONL line shape. They had drifted. The worker used `safeJsonParse` (a core util that debug-logs malformed lines); the gateway inlined a bare `try { JSON.parse } catch {}`. The worker carried two extra `SessionEntry` fields (`tokensBefore`, `firstKeptEntryId`) that no parser actually read. The display-projection logic in `entryToMessage` was duplicated character-for-character. Move the shared parts (`SessionEntry`, `ParsedMessage`, `parseSessionEntries`, `entryToMessage`) into `packages/core/src/utils/session-file.ts` and have both call sites import from `@lobu/core`. `safeJsonParse` wins as the canonical line parser — it's already in core, and the debug-log on malformed lines is strictly more useful than silently swallowing them. What does NOT move: `findSessionFile`. The two call sites have intentionally different path policies (worker scans its own `WORKSPACE_DIR` one level deep; gateway scans `workspaces/<agentId>` three levels deep with a `SAFE_AGENT_ID` regex guard against path traversal). Collapsing them would force one side to inherit the other's policy — not a behaviour change to make silently. Dropped from the canonical `SessionEntry`: `tokensBefore` and `firstKeptEntryId`. They were declared on the worker's local interface but nothing reads them — only `agent-worker/src/__tests__/ memory-flush-runtime.test.ts` mentions them in a fixture. Reintroduce when an actual consumer needs them. Validation: `make build-packages` clean. `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`. * wip(spike): delete wire-type re-export shims, import from @lobu/core directly Spike 8a5f686 hoisted `MessagePayload`/`JobType`/`QueuedMessage` into `@lobu/core` but left three re-export shims so existing import paths kept resolving. The user's global CLAUDE.md is explicit ("Do NOT add backwards- compatibility shims, deprecated annotations, or fallback aliases"), and codex flagged the leftover aliases on review. Delete them: - `packages/server/src/gateway/infrastructure/queue/queue-producer.ts` drop `export type { JobType, MessagePayload } from "@lobu/core"`. - `packages/server/src/gateway/orchestration/base-deployment-manager.ts` drop `import type { MessagePayload } from "../infrastructure/queue/ queue-producer.js"` + the trailing `export type { MessagePayload }` alias; pull the type from `@lobu/core` instead. - `packages/agent-worker/src/gateway/types.ts` shrinks to its only remaining concern: the worker-only `ResponseData = ThreadResponsePayload & { originalMessageId: string }`. The wire-contract types are gone. Every importer migrated to `import type { MessagePayload, ... } from "@lobu/core"`: - worker: `sse-client.ts`, `message-batcher.ts`, `__tests__/message- batcher.test.ts`. - gateway: `orchestration/{impl/embedded-deployment,message-consumer}.ts`, `services/platform-helpers.ts`, three orchestration tests (`embedded-deployment.test.ts`, `orchestration-harden.test.ts`, `base-deployment-grants.test.ts`). Validation: `make build-packages` clean. `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`. * wip(spike): delete dead module-lifecycle and dispatcher surfaces Spike 83a2cf4 deleted the worker call sites for `initModuleWorkspace`, `onSessionStart`, and `collectModuleData` but left the interfaces and no-op base implementations they targeted. Codex flagged the dead surfaces on review; the rule from CLAUDE.md is "remove the old code entirely instead of keeping it alongside the new code." Verified zero callers anywhere in the monorepo before deleting: for sym in onSessionStart onSessionEnd initWorkspace \ onBeforeResponse generateActionButtons handleAction \ ActionButton ModuleSessionContext DispatcherModule \ DispatcherContext getContainerAddress getWorkerModules; do grep -rn "\\b$sym\\b" packages/ scripts/ examples/ config/ done After this commit the only matches are `initWorkspaceProvider` in `packages/server/src/workspace/` (different concept, unrelated). The sibling `WorkerContext` symbol in `packages/server/src/gateway/routes/ internal/*.ts` is a Hono variable-context type; the deleted one was the worker-module hook context and is removed only from `packages/core/src/ modules.ts`. `BaseProviderModule` (the only `BaseModule` subclass — Claude / ChatGPT / Bedrock / Gemini / API-key catalog) was verified to NOT override any of the deleted methods (it overrides `isEnabled`, `buildEnvVars`, `getModelOptions`, and the provider-specific surface only), so trimming `BaseModule` is type-safe. Deletions: - `packages/core/src/modules.ts`: `WorkerContext`, `WorkerModule`, `ActionButton`, `ModuleSessionContext` interfaces; `IModuleRegistry.getWorkerModules` + `ModuleRegistry.getWorkerModules` method. - `packages/core/src/index.ts`: `ActionButton` and `ModuleSessionContext` type re-exports. - `packages/server/src/gateway/modules/module-system.ts`: `DispatcherContext`, `DispatcherModule` interfaces; the `WorkerModule<TModuleData>` and `DispatcherModule<TModuleData>` clauses on `BaseModule implements`; the `initWorkspace`, `onSessionStart`, `onSessionEnd`, `onBeforeResponse`, `generateActionButtons`, `handleAction`, `getContainerAddress` methods on `BaseModule`; and the `getContainerAddress(): string` member on the `OrchestratorModule` interface. Validation: `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`. `make build-packages` clean. * wip(spike): delete unused dynamic-import plugin path in moduleRegistry `ModuleRegistry.registerAvailableModules(modulePackages?: string[])` exists to dynamically `await import(packageName)` plugin packages at startup. Both call sites pass no args — the loop body is unreachable: packages/agent-worker/src/index.ts:42 → `registerAvailableModules()` packages/server/src/gateway/services/core-services.ts:844 → same Verified monorepo-wide: grep -rn "registerAvailableModules" packages/ scripts/ examples/ config/ Two hits, both no-arg. The method is also the only dynamic `import()` in `@lobu/core`, which conflicts with the project memory rule "No dynamic imports — always static `import`" (`feedback_no_dynamic_imports`). Deletions: - `packages/core/src/modules.ts`: `IModuleRegistry.registerAvailableModules` signature and `ModuleRegistry.registerAvailableModules` method (~40 lines including the `@example` block). - `packages/agent-worker/src/index.ts:42`: drop the no-arg call. - `packages/server/src/gateway/services/core-services.ts:844`: drop the no-arg call; the comment above the remaining `initAll()` is updated to read "Initialize all registered modules" (registration now happens at the explicit `moduleRegistry.register(...)` sites earlier in `core-services.ts`). Behaviour: zero change. Both code paths previously entered the function, iterated an empty array, and returned. Removing them removes the only dynamic-import in `@lobu/core` and drops one no-op step out of every gateway/worker boot. Validation: `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean in `packages/agent-worker`. `make build-packages` clean. * wip(spike): dedupe ModelOption and PrefillSkill; @lobu/core is the source Two type duplicates that codex flagged on review: 1. `ModelOption` was declared in BOTH `packages/core/src/api-types.ts:38` AND `packages/server/src/gateway/modules/module-system.ts:5`. Same shape (`{label: string; value: string}`), declared twice. Every provider module imported the server-local copy; the core export was carried only by the public SDK surface. 2. `PrefillSkill` was declared in BOTH `packages/core/src/api-types.ts:94` AND `packages/server/src/gateway/auth/settings/token-service.ts:4`. Same shape. The token-service file also carried a local `PrefillMcpServer` interface that was structurally identical to core's `PrefillMcp` except for a narrowed `type: "sse" | "streamable-http" | "stdio"` vs core's `type?: string`. The narrowed type is declaration- only and never narrowed at runtime, so widening to core's shape is behaviour-neutral; the token route emits whatever string the operator passed in either way. `@lobu/core` wins (it's the public API surface; deleting it would be a breaking change for SDK consumers). The server-local duplicates go. Migrated imports — every consumer now pulls `ModelOption` / `PrefillMcp` / `PrefillSkill` from `@lobu/core` directly: - `packages/server/src/gateway/auth/gemini/cli-module.ts` - `packages/server/src/gateway/auth/api-key-provider-module.ts` - `packages/server/src/gateway/auth/claude/oauth-module.ts` - `packages/server/src/gateway/auth/bedrock/provider-module.ts` - `packages/server/src/gateway/auth/chatgpt/chatgpt-oauth-module.ts` - `packages/server/src/gateway/auth/provider-model-options.ts` - `packages/server/src/gateway/auth/settings/token-service.ts` (`prefillMcpServers?: PrefillMcp[]` — field name stays, element type pulled from core). - `packages/server/src/gateway/routes/public/agent-config.ts` - `packages/server/src/gateway/modules/module-system.ts` (`ModelOption` now re-imported from `@lobu/core` for the `OrchestratorModule` / `ModelProviderModule` interfaces in this file). Verification: grep -rn "^export interface ModelOption\\b\\|^interface ModelOption\\b" \ "^export interface PrefillSkill\\b\\|^interface PrefillSkill\\b" \ "^export interface PrefillMcp\\b\\|^interface PrefillMcpServer\\b" \ packages/ 2>/dev/null | grep -v "/dist/" returns one match per symbol — all under `packages/core/src/api-types.ts`. Validation: `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`. `make build-packages` clean. * wip(spike): drop unused TModuleData generic; fix stale wire-path comment Two trivial leftovers from rounds 1–3: 1. The `<TModuleData>` type parameter on `ModuleInterface` (core), `OrchestratorModule` (server module-system), and `BaseModule` (server module-system) was a placeholder for the deleted `WorkerModule` surface (`onBeforeResponse(): Promise<TModuleData | null>`). With that surface gone in spike 9de939d the parameter has no semantic use — the core declaration already had it underscored (`_TModuleData`), and every instantiation site (`ModelProviderModule extends OrchestratorModule`, `OrchestratorModule[]`, `(m): m is OrchestratorModule` guard) was passing no arg. Removed the type parameter from all three declarations. 2. `packages/agent-worker/src/gateway/types.ts:5` carried a stale doc comment pointing imports at `@lobu/core/worker/wire`. That subpath isn't declared in `packages/core/package.json`'s `exports` field — only `.` is exported — so the path doesn't resolve. Every actual import in the tree already uses `@lobu/core` (the package root). Comment updated to match. No code import paths changed. Verification (`grep -rn`, excluding `/dist/`): - `BaseModule<` / `OrchestratorModule<` / `ModuleInterface<` → zero matches. - `TModuleData` → zero matches. - `@lobu/core/worker/wire` → zero matches. `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean in `packages/agent-worker`. `make build-packages` clean. * wip(spike): delete dead moduleData wire plumbing; fix stale subpath comment Two trivial leftovers from spike 9de939d (the module-lifecycle deletion) and spike 9a401f5 (the session-file hoist): 1. `moduleData` was wire-plumbing for the deleted `collectModuleData()` path. The collector was removed in 9de939d, so nothing calls `setModuleData` anymore — `this.moduleData` is always `undefined`, and the two `sendResponse` sites that read it serialize `undefined` into every response. Verified monorepo-wide: grep -rn "\\bmoduleData\\b\\|setModuleData" packages/ scripts/ \ examples/ config/ Five hits, all on the write side (declaration / setter / two reads of the always-undefined field / the wire-type slot). Zero readers on the gateway end. Deleted: - `packages/core/src/types.ts`: `moduleData?` field on `ThreadResponsePayload`. - `packages/core/src/worker/transport.ts`: `setModuleData` method on the `WorkerTransport` interface. - `packages/agent-worker/src/gateway/gateway-integration.ts`: the `private moduleData?` field, the `setModuleData` method, and the two `moduleData: this.moduleData` lines in `sendStreamDelta` / `signalCompletion`. 2. `packages/server/src/gateway/routes/public/agent-history.ts:75` pointed at `@lobu/core/utils/session-file`. That subpath isn't declared in `packages/core/package.json`'s `exports` — only `.` is exported. Every actual import in the tree uses `@lobu/core`. Comment updated to match reality. `make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean in `packages/agent-worker`. `make build-packages` clean. Monorepo grep for `moduleData` / `setModuleData` / `@lobu/core/utils/session-file` returns zero matches.
Summary
Phase 5 of the multi-replica rollout — completes the work started by #836, #848, #865, and #870.
LOBU_SESSION_STOREenv gate from opt-in (=== "snapshot") to opt-out (!== "file"). Snapshot mode is the default; settingLOBU_SESSION_STORE=filekeeps the legacy file-only path for single-replica / local-dev self-hosters.app.workspaces.enabledtofalseincharts/lobu/values.yaml. The PVC was the entire reason prod was stuck onreplicaCount: 1+strategy: Recreate. Self-hosters who want it back setapp.workspaces.enabled=true.LOBU_SESSION_STORE=filefor the opt-out branch.Pairs with
lobu-ai/owlettoPR (forthcoming) which removes thereplicaCount: 1override and thestrategy: Recreatepatch from the prod overlay.Sites updated
All previously-gated-on-
=== "snapshot"sites flipped to!== "file":packages/agent-worker/src/openclaw/worker.ts— hydrate, cleanup snapshot POST, runId/runJobToken assertionspackages/agent-worker/src/openclaw/transcript-snapshot.ts— doc commentpackages/server/src/gateway/orchestration/impl/embedded-deployment.ts—acquireConversationLockenablement gate + worker-env forwarding (now forwardsLOBU_SESSION_STORE=fileonly when explicitly opted out)packages/server/src/gateway/routes/public/agent-history.ts—/session/messages,/session/stats,/connectsnapshot-first fallbackpackages/server/src/gateway/gateway/index.ts— route mount commentpackages/server/src/gateway/__tests__/agent-transcript-snapshot.test.ts—default-offtest renamed and rewritten to setLOBU_SESSION_STORE=fileWhat workers write to
{WORKSPACE_DIR}(PVC-drop impact)Audited every
workspaceDir-rooted write path:.openclaw/session.jsonl— covered by snapshot mode (PG).openclaw/provider.json— per-pod cache for "did provider change since last turn"; rebuilt on next call, no cross-pod consistency required (the per-conversation advisory lock pins a conversation's runs to one pod)input/,output/,temp/— populated bydownloadInputFiles(refetched per run from gateway artifact URLs), andoutput/is wiped at everysetupIODirectoriescall. Pod-local ephemeral storage is correct hereloadPlugins(pluginsConfig, workspaceDir)— same per-run semantics as input/outputNo write paths require cross-pod persistence beyond
session.jsonl.Graceful shutdown / SIGTERM behaviour
Already wired in
packages/agent-worker/src/index.ts:shutdowncallsgatewayClient.stop()which (a) aborts the SSE stream, (b) awaitscurrentWorker.cleanup()— the same path that POSTs the snapshot whenterminalStatus === "completed".terminationGracePeriodSeconds: 45in the chart gives the worker time to flush.Mid-run SIGTERM = no snapshot written. This is intentional and unchanged from the design landed in #865:
cleanup()only POSTs on the"completed"branch because a snapshot ending mid-tool-use would corrupt the next hydrate. The trade-off documented intranscript-snapshot.ts:30still holds — a mid-run crash loses the in-flight transcript and the runs queue re-delivers from the previous user message on the next pod. No new behaviour introduced here.Phase 5 rollout — operator procedure
summaries-app-lobu-workerDeployment, then waiting forSELECT count(*) FROM runs WHERE status IN ('claimed','running') AND run_type = 'chat_message'to hit0).kubectl get deploy summaries-app-lobu-app -n summaries-prod→replicas=2, both ready,StrategyType: RollingUpdatekubectl get pvc -n summaries-prod→ no*-lobu-app-workspacesPVC (embeddings + worker PVCs remain)kubectl describe svc summaries-app-lobu-app -n summaries-prod→Session Affinity: ClientIP(from owletto chore(main): release 3.1.1 #175)kubectl get pods -n summaries-prod -l app.kubernetes.io/name=lobu→ both pods running, no crash loops,/health/ready200@clawdotfreebotor./scripts/test-bot.sh "@me phase5 smoke".Rollback if Phase 5 breaks prod
Revert both PRs in reverse order:
Flux re-reconciles. Snapshot rows written during the failed window are safe to keep — they're idempotent; on the next run the worker either hydrates from one (if completed) or skips it.
Verification
Test plan
Summary by CodeRabbit
Chores
New Features
Tests