Skip to content

feat(server,chart): flip snapshot default + drop workspaces PVC (Phase 5)#871

Merged
buremba merged 2 commits into
mainfrom
feat/phase5-multi-replica
May 18, 2026
Merged

feat(server,chart): flip snapshot default + drop workspaces PVC (Phase 5)#871
buremba merged 2 commits into
mainfrom
feat/phase5-multi-replica

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Summary

Phase 5 of the multi-replica rollout — completes the work started by #836, #848, #865, and #870.

  • Flips the LOBU_SESSION_STORE env gate from opt-in (=== "snapshot") to opt-out (!== "file"). Snapshot mode is the default; setting LOBU_SESSION_STORE=file keeps the legacy file-only path for single-replica / local-dev self-hosters.
  • Defaults app.workspaces.enabled to false in charts/lobu/values.yaml. The PVC was the entire reason prod was stuck on replicaCount: 1 + strategy: Recreate. Self-hosters who want it back set app.workspaces.enabled=true.
  • Updates comments + the "default-off" test to use LOBU_SESSION_STORE=file for the opt-out branch.

Pairs with lobu-ai/owletto PR (forthcoming) which removes the replicaCount: 1 override and the strategy: Recreate patch from the prod overlay.

Sites updated

All previously-gated-on-=== "snapshot" sites flipped to !== "file":

  • packages/agent-worker/src/openclaw/worker.ts — hydrate, cleanup snapshot POST, runId/runJobToken assertions
  • packages/agent-worker/src/openclaw/transcript-snapshot.ts — doc comment
  • packages/server/src/gateway/orchestration/impl/embedded-deployment.tsacquireConversationLock enablement gate + worker-env forwarding (now forwards LOBU_SESSION_STORE=file only when explicitly opted out)
  • packages/server/src/gateway/routes/public/agent-history.ts/session/messages, /session/stats, /connect snapshot-first fallback
  • packages/server/src/gateway/gateway/index.ts — route mount comment
  • packages/server/src/gateway/__tests__/agent-transcript-snapshot.test.tsdefault-off test renamed and rewritten to set LOBU_SESSION_STORE=file

What workers write to {WORKSPACE_DIR} (PVC-drop impact)

Audited every workspaceDir-rooted write path:

  • .openclaw/session.jsonl — covered by snapshot mode (PG)
  • .openclaw/provider.json — per-pod cache for "did provider change since last turn"; rebuilt on next call, no cross-pod consistency required (the per-conversation advisory lock pins a conversation's runs to one pod)
  • input/, output/, temp/ — populated by downloadInputFiles (refetched per run from gateway artifact URLs), and output/ is wiped at every setupIODirectories call. Pod-local ephemeral storage is correct here
  • Plugin scratch via loadPlugins(pluginsConfig, workspaceDir) — same per-run semantics as input/output

No write paths require cross-pod persistence beyond session.jsonl.

Graceful shutdown / SIGTERM behaviour

Already wired in packages/agent-worker/src/index.ts:

process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

shutdown calls gatewayClient.stop() which (a) aborts the SSE stream, (b) awaits currentWorker.cleanup() — the same path that POSTs the snapshot when terminalStatus === "completed". terminationGracePeriodSeconds: 45 in the chart gives the worker time to flush.

Mid-run SIGTERM = no snapshot written. This is intentional and unchanged from the design landed in #865: cleanup() only POSTs on the "completed" branch because a snapshot ending mid-tool-use would corrupt the next hydrate. The trade-off documented in transcript-snapshot.ts:30 still holds — a mid-run crash loses the in-flight transcript and the runs queue re-delivers from the previous user message on the next pod. No new behaviour introduced here.

Phase 5 rollout — operator procedure

  1. Wait for a quiet window (low chat traffic).
  2. Optional defensive drain: pause new chat-message dispatch (no built-in primitive yet — easiest is to scale the workforce to zero by killing the summaries-app-lobu-worker Deployment, then waiting for SELECT count(*) FROM runs WHERE status IN ('claimed','running') AND run_type = 'chat_message' to hit 0).
  3. Merge this PR. Merge the owletto PR. Flux reconciles automatically.
  4. Post-reconcile checks:
    • kubectl get deploy summaries-app-lobu-app -n summaries-prodreplicas=2, both ready, StrategyType: RollingUpdate
    • kubectl get pvc -n summaries-prod → no *-lobu-app-workspaces PVC (embeddings + worker PVCs remain)
    • kubectl describe svc summaries-app-lobu-app -n summaries-prodSession Affinity: ClientIP (from owletto chore(main): release 3.1.1 #175)
    • kubectl get pods -n summaries-prod -l app.kubernetes.io/name=lobu → both pods running, no crash loops, /health/ready 200
  5. Smoke test: send a test message via @clawdotfreebot or ./scripts/test-bot.sh "@me phase5 smoke".

Rollback if Phase 5 breaks prod

Revert both PRs in reverse order:

gh pr revert <owletto-PR>  --repo lobu-ai/owletto
gh pr revert <this-PR>     --repo lobu-ai/lobu

Flux re-reconciles. Snapshot rows written during the failed window are safe to keep — they're idempotent; on the next run the worker either hydrates from one (if completed) or skips it.

Verification

make build-packages   # all packages compile
make typecheck        # strict tsc --noEmit clean
bun test packages/server/src/gateway/__tests__/   # 862 pass / 0 fail / 22 snapshot tests included
helm template charts/lobu --validate                                         # defaults: no app-pvc.yaml
helm template charts/lobu --validate --set app.workspaces.enabled=true       # opt-in: app-pvc.yaml renders + volume mount restored

Test plan

  • CI green
  • Codex review ≥ 90% confidence
  • Operator runs through the rollout procedure above against prod once owletto PR also merged

Summary by CodeRabbit

  • Chores

    • Clarified deployment/config guidance and Phase 5 session-state defaults; pod-local scratch is now ephemeral by default and multi-replica remains opt-in.
    • Session snapshots are enabled by default; legacy/local-dev can opt out via a file-mode setting.
  • New Features

    • Workers can trigger deletion of per-run transcript snapshots via a new worker-facing endpoint.
  • Tests

    • Improved snapshot gating and session-reset integration tests and cleanup.

Review Change Stack

Phase 5 of the multi-replica rollout. Snapshot mode is now the default
session-store path; LOBU_SESSION_STORE=file is the explicit opt-out for
legacy single-replica / local-dev paths. The chart's workspaces PVC
defaults to disabled — workers fall back to pod-local ephemeral
{WORKSPACE_DIR}/{input,output,temp} (cleared at run start) and PG-backed
session.jsonl, with the per-conversation advisory lock pinning a
conversation's runs to one pod for the run lifetime.

- agent-worker: invert hydrate + cleanup gates to `!== "file"`
- server: invert acquireConversationLock + /agent-history fallback gates;
  only forward `LOBU_SESSION_STORE=file` (the opt-out) to workers
- chart: `app.workspaces.enabled` defaults to false; comments updated
- test: `default-off` test now exercises the LOBU_SESSION_STORE=file
  opt-out path

helm template charts/lobu (defaults) no longer renders app-pvc.yaml;
helm template charts/lobu --set app.workspaces.enabled=true still
renders the PVC + volume mount for back-compat.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 93c846d0-c3b8-4cda-ab2e-9bb44185176d

📥 Commits

Reviewing files that changed from the base of the PR and between 7648b38 and 02a08e1.

📒 Files selected for processing (5)
  • packages/agent-worker/src/openclaw/transcript-snapshot.ts
  • packages/agent-worker/src/openclaw/worker.ts
  • packages/server/src/gateway/__tests__/agent-transcript-snapshot.test.ts
  • packages/server/src/gateway/connections/chat-response-bridge.ts
  • packages/server/src/gateway/gateway/transcript-routes.ts

📝 Walkthrough

Walkthrough

This PR makes Postgres-backed session snapshots the Phase 5 default (opt-out via LOBU_SESSION_STORE=file). It updates Helm defaults/docs, inverts worker snapshot gating and messages, changes gateway orchestration/env forwarding, updates server routing to prefer DB snapshots unless in file-mode, adds a worker-scoped DELETE snapshot route, purges snapshots on session reset, and adds/updates integration tests.

Changes

Phase 5 Snapshot Default Inversion

Layer / File(s) Summary
Helm Configuration Defaults and Documentation
charts/lobu/values.yaml, charts/lobu/templates/deployment.yaml
app.workspaces.enabled now defaults to false; deployment and values comments updated to document Phase 5 snapshot-mode default, multi-replica RWX prerequisites, and preStopDelaySeconds / preStop hook semantics for RollingUpdate vs Recreate.
Worker: Snapshot Helpers and Runtime Logic
packages/agent-worker/src/openclaw/transcript-snapshot.ts, packages/agent-worker/src/openclaw/worker.ts
Module docs updated for Phase 5/6; added clearSnapshots() helper; worker gating now treats snapshot mode as default unless LOBU_SESSION_STORE === "file", updates hydrate/persist conditions and messages, and best-effort purges Postgres snapshots on session reset when not in file-mode.
Gateway Orchestration and Env Forwarding
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts, packages/server/src/gateway/gateway/index.ts
snapshotModeEnabled now defaults true unless LOBU_SESSION_STORE === "file"; gateway forwards LOBU_SESSION_STORE = "file" only when disabling snapshot mode; release-lock docs updated for file opt-out.
Routing: Snapshot Selection, Purge Flows, and DELETE Route
packages/server/src/gateway/routes/public/agent-history.ts, packages/server/src/gateway/connections/chat-response-bridge.ts, packages/server/src/gateway/gateway/transcript-routes.ts
Server routes attempt Postgres snapshot reads when LOBU_SESSION_STORE !== "file" with file fallback; added DELETE /snapshot worker endpoint to delete agent_transcript_snapshot rows for the token's (org,agent,conv); chat-response-bridge/session-reset flows perform best-effort DB purges when not in file-mode.
Tests and Validation
packages/server/src/gateway/__tests__/agent-transcript-snapshot.test.ts
Updated test to treat file-mode as an opt-out (sets LOBU_SESSION_STORE=file and asserts no snapshot rows/readable snapshot), fixes env cleanup behavior, and adds session-reset test validating DELETE purge and idempotency.

Sequence Diagram(s)

sequenceDiagram
  participant Worker
  participant Gateway
  participant Postgres
  Worker->>Gateway: DELETE /worker/transcript/snapshot (worker JWT, org/agent/conv)
  Gateway->>Postgres: DELETE FROM public.agent_transcript_snapshot WHERE org/agent/conv
  Postgres-->>Gateway: deleted count
  Gateway-->>Worker: { deleted: <count> }
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

  • lobu-ai/lobu#865: Related work on the agent_transcript_snapshot Postgres flow and transcript routes that this change builds upon.
  • lobu-ai/lobu#776: Prior Helm deployment/preStop semantics edits related to RollingUpdate/Recreate documentation this PR refines.

Poem

🐰 In Phase 5 fields we hop and smile,
Snapshots now default across each mile,
Workers ping gateway, DB rows take flight,
File-mode whispers "opt-out" in the night,
Hooray — consistent snapshots, snug and bright!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: flipping snapshot mode to default and dropping the workspaces PVC as part of Phase 5.
Description check ✅ Passed The description is comprehensive and well-structured, covering summary, test plan with checkboxes, detailed site updates, workspace audit, and operator rollout procedure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase5-multi-replica

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/agent-worker/src/openclaw/transcript-snapshot.ts (1)

16-20: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the module contract comment to match actual snapshot writes.

The header still says cleanup snapshots all terminal statuses, but runtime now only persists completed. Please align this doc block to avoid incorrect assumptions in future changes.

Also applies to: 26-28

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/agent-worker/src/openclaw/transcript-snapshot.ts` around lines 16 -
20, Update the module docblock in
packages/agent-worker/src/openclaw/transcript-snapshot.ts to reflect the actual
behavior: instead of claiming snapshots are taken on every terminal status in
OpenClawWorker.cleanup(), state that the runtime persists snapshots only for
'completed' runs; adjust the note about hydrate filters to indicate the hydrate
query targets terminal_status='completed' so failed/timeout/cancelled runs are
not persisted, and change the same wording in the duplicate comment region
around lines 26-28 to match this corrected contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 663-671: The gateway currently only sets
commonEnvVars.LOBU_SESSION_STORE="file" when snapshotModeEnabled is false, but
doesn't clear any forwarded WORKER_ENV_LOBU_SESSION_STORE when
snapshotModeEnabled is true, which can silently flip the worker to file-mode;
update the logic in embedded-deployment.ts (where snapshotModeEnabled and
commonEnvVars are handled) to explicitly remove/normalize the worker-facing key
when snapshotModeEnabled is true (e.g., delete or set
commonEnvVars.LOBU_SESSION_STORE = undefined) so the forwarded
WORKER_ENV_LOBU_SESSION_STORE cannot override the gateway's snapshot-default
behavior.

---

Outside diff comments:
In `@packages/agent-worker/src/openclaw/transcript-snapshot.ts`:
- Around line 16-20: Update the module docblock in
packages/agent-worker/src/openclaw/transcript-snapshot.ts to reflect the actual
behavior: instead of claiming snapshots are taken on every terminal status in
OpenClawWorker.cleanup(), state that the runtime persists snapshots only for
'completed' runs; adjust the note about hydrate filters to indicate the hydrate
query targets terminal_status='completed' so failed/timeout/cancelled runs are
not persisted, and change the same wording in the duplicate comment region
around lines 26-28 to match this corrected contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: da3aac61-62c4-4a6d-83cc-b9899eb26257

📥 Commits

Reviewing files that changed from the base of the PR and between d4691a7 and 7648b38.

📒 Files selected for processing (8)
  • charts/lobu/templates/deployment.yaml
  • charts/lobu/values.yaml
  • packages/agent-worker/src/openclaw/transcript-snapshot.ts
  • packages/agent-worker/src/openclaw/worker.ts
  • packages/server/src/gateway/__tests__/agent-transcript-snapshot.test.ts
  • packages/server/src/gateway/gateway/index.ts
  • packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
  • packages/server/src/gateway/routes/public/agent-history.ts

Comment on lines 663 to 671
// Forward the snapshot-mode flag so workers know to hydrate from
// Postgres and write back on cleanup. Mirrors gateway-side
// process.env so the lock acquisition above and the worker's
// session-store selection stay in lockstep.
if (snapshotModeEnabled) {
commonEnvVars.LOBU_SESSION_STORE = "snapshot";
// session-store selection stay in lockstep. Phase 5: snapshot is
// the default; only forward LOBU_SESSION_STORE=file (the opt-out)
// so the worker sees the same mode the gateway picked.
if (!snapshotModeEnabled) {
commonEnvVars.LOBU_SESSION_STORE = "file";
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep worker session-store mode deterministic in snapshot-default path.

In snapshot mode, this no longer overwrites LOBU_SESSION_STORE. A forwarded WORKER_ENV_LOBU_SESSION_STORE=file can now silently flip only the worker to file-mode while the gateway still behaves as snapshot-mode. Please explicitly clear/normalize this env key when snapshotModeEnabled is true.

Suggested fix
-      if (!snapshotModeEnabled) {
-        commonEnvVars.LOBU_SESSION_STORE = "file";
-      }
+      if (snapshotModeEnabled) {
+        delete commonEnvVars.LOBU_SESSION_STORE;
+      } else {
+        commonEnvVars.LOBU_SESSION_STORE = "file";
+      }

As per coding guidelines, "WORKER_ENV_* gateway vars are forwarded to workers with the prefix stripped (WORKER_ENV_FOO=barFOO=bar)."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Forward the snapshot-mode flag so workers know to hydrate from
// Postgres and write back on cleanup. Mirrors gateway-side
// process.env so the lock acquisition above and the worker's
// session-store selection stay in lockstep.
if (snapshotModeEnabled) {
commonEnvVars.LOBU_SESSION_STORE = "snapshot";
// session-store selection stay in lockstep. Phase 5: snapshot is
// the default; only forward LOBU_SESSION_STORE=file (the opt-out)
// so the worker sees the same mode the gateway picked.
if (!snapshotModeEnabled) {
commonEnvVars.LOBU_SESSION_STORE = "file";
}
// Forward the snapshot-mode flag so workers know to hydrate from
// Postgres and write back on cleanup. Mirrors gateway-side
// process.env so the lock acquisition above and the worker's
// session-store selection stay in lockstep. Phase 5: snapshot is
// the default; only forward LOBU_SESSION_STORE=file (the opt-out)
// so the worker sees the same mode the gateway picked.
if (snapshotModeEnabled) {
delete commonEnvVars.LOBU_SESSION_STORE;
} else {
commonEnvVars.LOBU_SESSION_STORE = "file";
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts` around
lines 663 - 671, The gateway currently only sets
commonEnvVars.LOBU_SESSION_STORE="file" when snapshotModeEnabled is false, but
doesn't clear any forwarded WORKER_ENV_LOBU_SESSION_STORE when
snapshotModeEnabled is true, which can silently flip the worker to file-mode;
update the logic in embedded-deployment.ts (where snapshotModeEnabled and
commonEnvVars are handled) to explicitly remove/normalize the worker-facing key
when snapshotModeEnabled is true (e.g., delete or set
commonEnvVars.LOBU_SESSION_STORE = undefined) so the forwarded
WORKER_ENV_LOBU_SESSION_STORE cannot override the gateway's snapshot-default
behavior.

…eset

Codex P1 on PR #871. With snapshot mode default-on, the `/new` reset
path was only unlinking the local session.jsonl — the next worker boot
would rehydrate from Postgres and the user-visible "Starting fresh"
would silently resurrect the conversation we just flushed.

Two new write paths covering both legs of the reset:

- Worker → gateway: new `DELETE /worker/transcript/snapshot` route
  scoped to the JWT's (org, agent, conv); worker calls
  `clearSnapshots` from `transcript-snapshot.ts` after the local
  unlink in the session-reset branch.
- Gateway bridge: `chat-response-bridge` runs a `DELETE … USING
  agents` to scope to the right org without re-deriving it from the
  payload (the bridge doesn't carry org). Belt-and-suspenders for
  the case where the worker exits before its own purge call lands.

Both paths gate on `LOBU_SESSION_STORE !== "file"` so file-mode
opt-outs see no behaviour change. Adds an integration test that seeds
three snapshots in the reset conversation + one in a sibling
conversation, calls DELETE, and verifies (a) all three reset rows
go, (b) the sibling row survives, (c) a second DELETE is idempotent.
@buremba buremba merged commit 9484be3 into main May 18, 2026
20 of 22 checks passed
@buremba buremba deleted the feat/phase5-multi-replica branch May 18, 2026 13:58
buremba added a commit that referenced this pull request May 18, 2026
…ma (#874)

PR #871 flipped LOBU_SESSION_STORE default to snapshot mode. PR #865
added a startup assertion that the worker throws if snapshot mode is on
but WorkerConfig.runId is missing. Together those broke every Telegram
chat in prod with:

  "Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runId
   is missing — runs-queue dispatch did not stamp runId on the job
   payload"

The gateway-side MessageConsumer correctly sets data.runId (line 149) and
data.runJobToken (line 185) before dispatch. job-router writes the full
payload to SSE. The worker reads payload.runId / payload.runJobToken in
payloadToWorkerConfig (sse-client.ts:925-935).

The missing link was JobEventSchema. Its inner payload object used plain
z.object(...) which is strict-strip by default — runId and runJobToken
were silently dropped at safeParse, so payload.runId was always
undefined and the assertion fired on every message.

Fix: declare runId + runJobToken explicitly on the schema, and add
.passthrough() so future MessagePayload fields (mcpConfig, nixConfig,
egressConfig, preApprovedTools, exec*, organizationId, networkConfig…)
don't regress the same way.

Tests:
- new regression test feeds a job event with runId + runJobToken through
  handleEvent and asserts they reach handleThreadMessage (pre-fix:
  undefined; post-fix: preserved)
- new test pins payloadToWorkerConfig's mapping of runId/runJobToken
- new test confirms the legacy direct-enqueue path (no runId) still
  threads undefined cleanly
buremba added a commit that referenced this pull request May 19, 2026
…ire + session-file utilities (#930)

* wip(spike): delete dead worker module lifecycle wiring

The worker's session loop invokes `initModuleWorkspace`, `onSessionStart`,
and `collectModuleData` from `agent-worker/src/modules/lifecycle.ts`. Each
call resolves `moduleRegistry.getWorkerModules()`, which by registration
intent surfaces every module that has `onBeforeResponse` in its prototype.

In practice, the only modules ever registered are `ModelProviderModule`s
(Claude OAuth, ChatGPT OAuth, Bedrock, Gemini CLI, the API-key catalog
module) — all inheriting from `BaseModule`, whose lifecycle hooks are
no-ops. Three `for` loops over no-op methods run on every session start,
plus three `await import("../modules/lifecycle")` dynamic imports that
violate the static-import rule documented in CLAUDE.md.

Delete the call sites in `openclaw/worker.ts` and the whole
`agent-worker/src/modules/` directory. The `WorkerModule` /
`BaseModule` interface surface in `@lobu/core` and
`gateway/modules/module-system.ts` stays — it's public-API, and removing
it is a separate, larger decision (see REPORT.md "Do not do"). Re-adding
the call sites later, when an actual module needs them, is ~20 LOC.

Validation: `make typecheck` clean from worktree root (server + owletto).
`bunx tsc --noEmit` clean inside `packages/agent-worker`.

* wip(spike): hoist MessagePayload + JobType into @lobu/core

`MessagePayload` is the gateway↔worker wire contract: produced by
`MessageConsumer` and `EmbeddedDeploymentManager.dispatch*`, consumed by
`GatewayClient.handleThreadMessage` / `handleExecJob` over SSE. Both
sides need to agree on the shape, but the type was declared twice and
had already drifted: the worker's copy in `agent-worker/src/gateway/
types.ts` was missing `organizationId`, `networkConfig`, `egressConfig`,
`mcpConfig`, `nixConfig`, and `preApprovedTools` (all present on the
gateway side). The worker's zod schema was patched with `.passthrough()`
to keep the extra fields from being stripped at parse time (PR #871
regression), but the static type silently lied — TS would have happily
accepted reads of e.g. `payload.preApprovedTools` as `undefined` even
when the gateway always populates it.

Move both `MessagePayload` and `JobType` (plus the worker-side
`QueuedMessage` envelope) into `packages/core/src/worker/wire.ts` and
re-export them from `@lobu/core`. The worker and gateway both import
from there; `queue-producer.ts` and `gateway/types.ts` keep a re-export
so the existing `from "./types"` / `from "../infrastructure/queue/
queue-producer"` paths continue to resolve without churn.

Type unification:
- `agentOptions: AgentOptions` (was `Record<string, any>` on the gateway,
  `AgentOptions` on the worker). The worker reads `.model`, `.allowedTools`,
  `.disallowedTools`, `.timeoutMinutes` directly, so `AgentOptions` is the
  honest shape. `Record<string, any>` is assignable into it via the
  `[key: string]: unknown` index signature.
- `platformMetadata: Record<string, unknown>` (was `Record<string, any>`
  on the gateway, a richer named interface on the worker). The unknown
  flavour forces three new `typeof === "string"` guards inside
  `base-deployment-manager.ts` where the env-var builder reads
  `originalMessageTs`, `botResponseTs`, and `teamId` off the bag.
- `teamId?: string` (was `string` on the gateway, `string | undefined`
  on the worker). The worker SSE zod schema marks it `.optional()`, and
  the worker has explicit fallback logic (`payload.teamId ??
  platformMetadata.teamId`). The gateway-side `buildMessagePayload`
  always sets a non-empty string, so concrete callers are unaffected.

Validation: `make build-packages` clean. `make typecheck` clean
(server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`.

* wip(spike): hoist session.jsonl parser to @lobu/core

Two HTTP surfaces parse the worker's `.openclaw/session.jsonl`:

- the worker's own `/session/messages` and `/session/stats` (rooted at
  `WORKSPACE_DIR`), and
- the gateway's `/session/messages` and `/session/stats` proxy fallback
  (rooted at `workspaces/<agentId>`, used when the worker is offline).

The gateway proxies to the worker when the worker is online and falls
back to its own copy otherwise — so the two parsers MUST agree on the
JSONL line shape. They had drifted. The worker used `safeJsonParse` (a
core util that debug-logs malformed lines); the gateway inlined a bare
`try { JSON.parse } catch {}`. The worker carried two extra
`SessionEntry` fields (`tokensBefore`, `firstKeptEntryId`) that no
parser actually read. The display-projection logic in `entryToMessage`
was duplicated character-for-character.

Move the shared parts (`SessionEntry`, `ParsedMessage`,
`parseSessionEntries`, `entryToMessage`) into
`packages/core/src/utils/session-file.ts` and have both call sites
import from `@lobu/core`. `safeJsonParse` wins as the canonical line
parser — it's already in core, and the debug-log on malformed lines is
strictly more useful than silently swallowing them.

What does NOT move: `findSessionFile`. The two call sites have
intentionally different path policies (worker scans its own
`WORKSPACE_DIR` one level deep; gateway scans `workspaces/<agentId>`
three levels deep with a `SAFE_AGENT_ID` regex guard against path
traversal). Collapsing them would force one side to inherit the other's
policy — not a behaviour change to make silently.

Dropped from the canonical `SessionEntry`: `tokensBefore` and
`firstKeptEntryId`. They were declared on the worker's local interface
but nothing reads them — only `agent-worker/src/__tests__/
memory-flush-runtime.test.ts` mentions them in a fixture. Reintroduce
when an actual consumer needs them.

Validation: `make build-packages` clean. `make typecheck` clean
(server + owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`.

* wip(spike): delete wire-type re-export shims, import from @lobu/core directly

Spike 8a5f686 hoisted `MessagePayload`/`JobType`/`QueuedMessage` into
`@lobu/core` but left three re-export shims so existing import paths kept
resolving. The user's global CLAUDE.md is explicit ("Do NOT add backwards-
compatibility shims, deprecated annotations, or fallback aliases"), and
codex flagged the leftover aliases on review. Delete them:

- `packages/server/src/gateway/infrastructure/queue/queue-producer.ts`
  drop `export type { JobType, MessagePayload } from "@lobu/core"`.
- `packages/server/src/gateway/orchestration/base-deployment-manager.ts`
  drop `import type { MessagePayload } from "../infrastructure/queue/
  queue-producer.js"` + the trailing `export type { MessagePayload }`
  alias; pull the type from `@lobu/core` instead.
- `packages/agent-worker/src/gateway/types.ts` shrinks to its only
  remaining concern: the worker-only `ResponseData = ThreadResponsePayload
  & { originalMessageId: string }`. The wire-contract types are gone.

Every importer migrated to `import type { MessagePayload, ... } from
"@lobu/core"`:
- worker: `sse-client.ts`, `message-batcher.ts`, `__tests__/message-
  batcher.test.ts`.
- gateway: `orchestration/{impl/embedded-deployment,message-consumer}.ts`,
  `services/platform-helpers.ts`, three orchestration tests
  (`embedded-deployment.test.ts`, `orchestration-harden.test.ts`,
  `base-deployment-grants.test.ts`).

Validation: `make build-packages` clean. `make typecheck` clean (server +
owletto). `bunx tsc --noEmit` clean inside `packages/agent-worker`.

* wip(spike): delete dead module-lifecycle and dispatcher surfaces

Spike 83a2cf4 deleted the worker call sites for `initModuleWorkspace`,
`onSessionStart`, and `collectModuleData` but left the interfaces and
no-op base implementations they targeted. Codex flagged the dead
surfaces on review; the rule from CLAUDE.md is "remove the old code
entirely instead of keeping it alongside the new code."

Verified zero callers anywhere in the monorepo before deleting:

  for sym in onSessionStart onSessionEnd initWorkspace \
             onBeforeResponse generateActionButtons handleAction \
             ActionButton ModuleSessionContext DispatcherModule \
             DispatcherContext getContainerAddress getWorkerModules; do
    grep -rn "\\b$sym\\b" packages/ scripts/ examples/ config/
  done

After this commit the only matches are `initWorkspaceProvider` in
`packages/server/src/workspace/` (different concept, unrelated). The
sibling `WorkerContext` symbol in `packages/server/src/gateway/routes/
internal/*.ts` is a Hono variable-context type; the deleted one was the
worker-module hook context and is removed only from `packages/core/src/
modules.ts`.

`BaseProviderModule` (the only `BaseModule` subclass — Claude / ChatGPT /
Bedrock / Gemini / API-key catalog) was verified to NOT override any of
the deleted methods (it overrides `isEnabled`, `buildEnvVars`,
`getModelOptions`, and the provider-specific surface only), so trimming
`BaseModule` is type-safe.

Deletions:
- `packages/core/src/modules.ts`: `WorkerContext`, `WorkerModule`,
  `ActionButton`, `ModuleSessionContext` interfaces;
  `IModuleRegistry.getWorkerModules` + `ModuleRegistry.getWorkerModules`
  method.
- `packages/core/src/index.ts`: `ActionButton` and `ModuleSessionContext`
  type re-exports.
- `packages/server/src/gateway/modules/module-system.ts`:
  `DispatcherContext`, `DispatcherModule` interfaces; the
  `WorkerModule<TModuleData>` and `DispatcherModule<TModuleData>` clauses
  on `BaseModule implements`; the `initWorkspace`, `onSessionStart`,
  `onSessionEnd`, `onBeforeResponse`, `generateActionButtons`,
  `handleAction`, `getContainerAddress` methods on `BaseModule`; and the
  `getContainerAddress(): string` member on the `OrchestratorModule`
  interface.

Validation: `make typecheck` clean (server + owletto). `bunx tsc
--noEmit` clean inside `packages/agent-worker`. `make build-packages`
clean.

* wip(spike): delete unused dynamic-import plugin path in moduleRegistry

`ModuleRegistry.registerAvailableModules(modulePackages?: string[])`
exists to dynamically `await import(packageName)` plugin packages at
startup. Both call sites pass no args — the loop body is unreachable:

  packages/agent-worker/src/index.ts:42  → `registerAvailableModules()`
  packages/server/src/gateway/services/core-services.ts:844 → same

Verified monorepo-wide:

  grep -rn "registerAvailableModules" packages/ scripts/ examples/ config/

Two hits, both no-arg. The method is also the only dynamic `import()` in
`@lobu/core`, which conflicts with the project memory rule "No dynamic
imports — always static `import`" (`feedback_no_dynamic_imports`).

Deletions:
- `packages/core/src/modules.ts`: `IModuleRegistry.registerAvailableModules`
  signature and `ModuleRegistry.registerAvailableModules` method (~40
  lines including the `@example` block).
- `packages/agent-worker/src/index.ts:42`: drop the no-arg call.
- `packages/server/src/gateway/services/core-services.ts:844`: drop the
  no-arg call; the comment above the remaining `initAll()` is updated to
  read "Initialize all registered modules" (registration now happens at
  the explicit `moduleRegistry.register(...)` sites earlier in
  `core-services.ts`).

Behaviour: zero change. Both code paths previously entered the function,
iterated an empty array, and returned. Removing them removes the only
dynamic-import in `@lobu/core` and drops one no-op step out of every
gateway/worker boot.

Validation: `make typecheck` clean (server + owletto). `bunx tsc --noEmit`
clean in `packages/agent-worker`. `make build-packages` clean.

* wip(spike): dedupe ModelOption and PrefillSkill; @lobu/core is the source

Two type duplicates that codex flagged on review:

1. `ModelOption` was declared in BOTH `packages/core/src/api-types.ts:38`
   AND `packages/server/src/gateway/modules/module-system.ts:5`. Same
   shape (`{label: string; value: string}`), declared twice. Every
   provider module imported the server-local copy; the core export was
   carried only by the public SDK surface.
2. `PrefillSkill` was declared in BOTH `packages/core/src/api-types.ts:94`
   AND `packages/server/src/gateway/auth/settings/token-service.ts:4`.
   Same shape. The token-service file also carried a local
   `PrefillMcpServer` interface that was structurally identical to core's
   `PrefillMcp` except for a narrowed `type: "sse" | "streamable-http" |
   "stdio"` vs core's `type?: string`. The narrowed type is declaration-
   only and never narrowed at runtime, so widening to core's shape is
   behaviour-neutral; the token route emits whatever string the operator
   passed in either way.

`@lobu/core` wins (it's the public API surface; deleting it would be a
breaking change for SDK consumers). The server-local duplicates go.

Migrated imports — every consumer now pulls `ModelOption` / `PrefillMcp`
/ `PrefillSkill` from `@lobu/core` directly:
- `packages/server/src/gateway/auth/gemini/cli-module.ts`
- `packages/server/src/gateway/auth/api-key-provider-module.ts`
- `packages/server/src/gateway/auth/claude/oauth-module.ts`
- `packages/server/src/gateway/auth/bedrock/provider-module.ts`
- `packages/server/src/gateway/auth/chatgpt/chatgpt-oauth-module.ts`
- `packages/server/src/gateway/auth/provider-model-options.ts`
- `packages/server/src/gateway/auth/settings/token-service.ts`
  (`prefillMcpServers?: PrefillMcp[]` — field name stays, element type
  pulled from core).
- `packages/server/src/gateway/routes/public/agent-config.ts`
- `packages/server/src/gateway/modules/module-system.ts`
  (`ModelOption` now re-imported from `@lobu/core` for the
  `OrchestratorModule` / `ModelProviderModule` interfaces in this file).

Verification:
  grep -rn "^export interface ModelOption\\b\\|^interface ModelOption\\b" \
         "^export interface PrefillSkill\\b\\|^interface PrefillSkill\\b" \
         "^export interface PrefillMcp\\b\\|^interface PrefillMcpServer\\b" \
       packages/ 2>/dev/null | grep -v "/dist/"
returns one match per symbol — all under `packages/core/src/api-types.ts`.

Validation: `make typecheck` clean (server + owletto). `bunx tsc --noEmit`
clean inside `packages/agent-worker`. `make build-packages` clean.

* wip(spike): drop unused TModuleData generic; fix stale wire-path comment

Two trivial leftovers from rounds 1–3:

1. The `<TModuleData>` type parameter on `ModuleInterface` (core),
   `OrchestratorModule` (server module-system), and `BaseModule` (server
   module-system) was a placeholder for the deleted `WorkerModule`
   surface (`onBeforeResponse(): Promise<TModuleData | null>`). With
   that surface gone in spike 9de939d the parameter has no semantic
   use — the core declaration already had it underscored
   (`_TModuleData`), and every instantiation site (`ModelProviderModule
   extends OrchestratorModule`, `OrchestratorModule[]`,
   `(m): m is OrchestratorModule` guard) was passing no arg. Removed
   the type parameter from all three declarations.

2. `packages/agent-worker/src/gateway/types.ts:5` carried a stale
   doc comment pointing imports at `@lobu/core/worker/wire`. That
   subpath isn't declared in `packages/core/package.json`'s `exports`
   field — only `.` is exported — so the path doesn't resolve.
   Every actual import in the tree already uses `@lobu/core` (the
   package root). Comment updated to match. No code import paths
   changed.

Verification (`grep -rn`, excluding `/dist/`):
  - `BaseModule<` / `OrchestratorModule<` / `ModuleInterface<` → zero
    matches.
  - `TModuleData` → zero matches.
  - `@lobu/core/worker/wire` → zero matches.

`make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean in
`packages/agent-worker`. `make build-packages` clean.

* wip(spike): delete dead moduleData wire plumbing; fix stale subpath comment

Two trivial leftovers from spike 9de939d (the module-lifecycle
deletion) and spike 9a401f5 (the session-file hoist):

1. `moduleData` was wire-plumbing for the deleted `collectModuleData()`
   path. The collector was removed in 9de939d, so nothing calls
   `setModuleData` anymore — `this.moduleData` is always `undefined`,
   and the two `sendResponse` sites that read it serialize `undefined`
   into every response. Verified monorepo-wide:

     grep -rn "\\bmoduleData\\b\\|setModuleData" packages/ scripts/ \
                                                 examples/ config/

   Five hits, all on the write side (declaration / setter / two reads
   of the always-undefined field / the wire-type slot). Zero readers
   on the gateway end. Deleted:
   - `packages/core/src/types.ts`: `moduleData?` field on
     `ThreadResponsePayload`.
   - `packages/core/src/worker/transport.ts`: `setModuleData` method
     on the `WorkerTransport` interface.
   - `packages/agent-worker/src/gateway/gateway-integration.ts`: the
     `private moduleData?` field, the `setModuleData` method, and the
     two `moduleData: this.moduleData` lines in `sendStreamDelta` /
     `signalCompletion`.

2. `packages/server/src/gateway/routes/public/agent-history.ts:75`
   pointed at `@lobu/core/utils/session-file`. That subpath isn't
   declared in `packages/core/package.json`'s `exports` — only `.` is
   exported. Every actual import in the tree uses `@lobu/core`. Comment
   updated to match reality.

`make typecheck` clean (server + owletto). `bunx tsc --noEmit` clean
in `packages/agent-worker`. `make build-packages` clean. Monorepo grep
for `moduleData` / `setModuleData` / `@lobu/core/utils/session-file`
returns zero matches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants