lobu-ai · buremba · May 18, 2026 · May 18, 2026 · May 18, 2026
diff --git a/charts/lobu/templates/deployment.yaml b/charts/lobu/templates/deployment.yaml
@@ -15,9 +15,14 @@ spec:
        This is the operator opt-in path for true blue/green deploys.
     3. Else → Recreate (the safe default).
 
-  Why `allowMultiReplica` is an explicit flag, not auto-detected from
-  RWX: even with RWX storage, several in-memory components break with
-  >1 gateway replicas OR during the brief RollingUpdate overlap:
+  Phase 5 (workspaces PVC is OFF by default + LOBU_SESSION_STORE
+  defaults to snapshot mode) means most deploys naturally land on the
+  rolling-update branch as soon as `allowMultiReplica: true` is set.
+  The RWX check still applies for self-hosters who re-enable the PVC.
+
+  Why `allowMultiReplica` is an explicit flag, not auto-detected:
+  several in-memory components break with >1 gateway replicas OR
+  during the brief RollingUpdate overlap:
     * `SseManager` (gateway/services/sse-manager.ts) — SSE streams are
       pod-local; a job claimed by pod B broadcasts to no-one if the
       client is on pod A.
@@ -28,9 +33,10 @@ spec:
     * Telegram polling mode (gateway/connections/chat-instance-manager
       .ts:610-613) — every replica long-polls the same bot, causing
       conflicts.
-  RWX is necessary but not sufficient. The flag forces operators to
-  acknowledge "I have only webhook-mode Chat connections AND no
-  AskUser/SSE flows in flight" before opting in.
+  Snapshot-mode session state and the per-conversation advisory lock
+  are necessary but not sufficient. The flag forces operators to
+  acknowledge "I have only webhook-mode Chat connections AND accept
+  the SSE / AskUser handoff caveats" before opting in.
   */}}
   {{- $rwxConfigured := has "ReadWriteMany" (.Values.app.workspaces.accessModes | default (list)) }}
   {{- $rollSafe := and .Values.app.allowMultiReplica (or (not .Values.app.workspaces.enabled) $rwxConfigured) }}
@@ -139,12 +145,12 @@ spec:
           already serving, so deregistering the old pod via Service
           endpoint removal + giving downstream caches time to notice it
           shrinks the "old pod kept getting traffic during drain" window).
-          Under `Recreate` (the workspaces-RWO default), the new pod
-          doesn't start until the old one fully terminates — adding a
-          preStop sleep would EXTEND the no-available-server window by
-          its duration. We only emit the hook when preStopDelaySeconds
-          is explicitly > 0; ops repos using RollingUpdate set it,
-          Recreate-mode deploys leave it at the default 0.
+          Under `Recreate`, the new pod doesn't start until the old one
+          fully terminates — adding a preStop sleep would EXTEND the
+          no-available-server window by its duration. We only emit the
+          hook when preStopDelaySeconds is explicitly > 0; ops repos
+          using RollingUpdate set it, Recreate-mode deploys leave it at
+          the default 0.
           */ -}}
           {{- if gt (int (.Values.app.preStopDelaySeconds | default 0)) 0 }}
           lifecycle:

diff --git a/charts/lobu/values.yaml b/charts/lobu/values.yaml
@@ -40,12 +40,10 @@ app:
   # is going away, so in-flight requests during the deregistration lag
   # still hit a live process.
   #
-  # Default 0 (preStop hook NOT emitted). Set to ~15 only when running
-  # with `app.strategy.type: RollingUpdate` and replicaCount > 1.
-  # Under the default Recreate strategy (workspaces PVC is RWO), the
-  # new pod doesn't start until the old one terminates, so adding a
-  # preStop sleep would EXTEND the no-available-server window — the
-  # opposite of what we want.
+  # Default 0 (preStop hook NOT emitted). Set to ~15 when running with
+  # `app.strategy.type: RollingUpdate` and replicaCount > 1. The
+  # workspaces PVC is OFF by default in Phase 5 so the chart now picks
+  # RollingUpdate automatically when `allowMultiReplica: true`.
   preStopDelaySeconds: 0
   # Total time k8s waits for the pod to stop before SIGKILL. Must be
   # > preStopDelaySeconds + actual shutdown time. The gateway's graceful
@@ -67,17 +65,15 @@ app:
   # Opt-in to multi-replica / rolling deploys. DEFAULT FALSE — leaving
   # this off keeps the safe `strategy: Recreate` behavior. Setting it
   # true makes the chart pick `RollingUpdate` (maxSurge: 1,
-  # maxUnavailable: 0) when workspaces is RWX or disabled.
+  # maxUnavailable: 0) as long as `app.workspaces.enabled` is false (the
+  # Phase 5 default) or workspaces is RWX-configured.
   #
   # Prerequisites BEFORE setting true — chart cannot detect these:
-  #   1. Workspaces volume must be RWX-capable:
-  #      `app.workspaces.accessModes: [ReadWriteMany]` + a storage class
-  #      that backs it (NFS, EFS, CephFS, Longhorn-RWX, …).
-  #   2. NO active Chat connections in `mode: "polling"` (Telegram).
+  #   1. NO active Chat connections in `mode: "polling"` (Telegram).
   #      Multiple replicas long-polling the same bot conflict. Use
   #      webhook mode only — see
   #      gateway/connections/chat-instance-manager.ts:610.
-  #   3. Acknowledge that the gateway has in-memory state for SSE
+  #   2. Acknowledge that the gateway has in-memory state for SSE
   #      streams (gateway/services/sse-manager.ts) and AskUser
   #      questions (gateway/connections/interaction-bridge.ts:193).
   #      A request whose SSE stream / AskUser click lands on a
@@ -87,17 +83,28 @@ app:
   #      then, occasional dropped streams / button clicks are the cost
   #      of zero-downtime deploys on this configuration.
   #
+  # If you re-enable the workspaces PVC (legacy path), you must either
+  # provision RWX-capable storage (NFS / EFS / CephFS / Longhorn-RWX)
+  # or leave `allowMultiReplica: false` — RWO + multi-replica is
+  # rejected by the chart's strategy helper and falls back to Recreate.
+  #
   # Single-replica + RollingUpdate (replicaCount: 1, allowMultiReplica: true)
   # still creates a brief overlap window where both pods are running.
   # The same in-memory caveats apply during that window, just for a
   # shorter span (~5-15s).
   allowMultiReplica: false
 
-  # Persistent workspaces volume. Embedded agent workers store session and
-  # workspace state below /app/workspaces (watcher run scratch, agent panel
-  # sessions, etc.).
+  # Persistent workspaces volume. Embedded agent workers used to store
+  # session and workspace state below /app/workspaces (session.jsonl,
+  # input/output/temp scratch). Phase 5 moved session.jsonl to Postgres
+  # (`agent_transcript_snapshot`) and made input/output/temp pod-local
+  # ephemeral storage — a per-conversation advisory lock pins all turns
+  # of one conversation to a single pod for its run lifetime, so no PVC
+  # is required for correctness. Set `enabled: true` only if you have a
+  # specific reason to persist scratch files across pod restarts; the
+  # PVC blocks rolling deploys on the default RWO storage class.
   workspaces:
-    enabled: true
+    enabled: false
     size: 20Gi
     storageClass: ""
     accessModes:

diff --git a/packages/agent-worker/src/openclaw/transcript-snapshot.ts b/packages/agent-worker/src/openclaw/transcript-snapshot.ts
@@ -23,9 +23,9 @@
  *     hydrate, `POST /worker/transcript/snapshot` for write. (org, agent,
  *     conv) are pulled from the worker JWT on the gateway side, so the
  *     worker can't impersonate another conversation.
- *   - Selection is opt-in via `LOBU_SESSION_STORE=snapshot`. Default unset
- *     preserves today's file-only behavior. Phase 5 flips the default,
- *     Phase 6 drops the env var.
+ *   - Phase 5: snapshot mode is the default. `LOBU_SESSION_STORE=file`
+ *     opts out for legacy/local-dev single-replica deploys. Phase 6
+ *     drops the env var entirely.
  *
  * Trade-off accepted: a mid-run crash loses the partial transcript for that
  * run. The next attempt re-runs from the previous user message. Tools must
@@ -197,5 +197,42 @@ export async function writeSnapshot(
     logger.error(
       `Snapshot POST threw: ${err instanceof Error ? err.message : String(err)}`
     );
+    return;
+  }
+}
+
+/**
+ * Purge all snapshot rows for this worker's (org, agent, conv). Called
+ * by the session-reset path so the next boot doesn't rehydrate the
+ * conversation from Postgres after a `/new`. Idempotent — a 404 / empty
+ * result is treated as success.
+ *
+ * Failures are logged but not thrown — reset is best-effort; if the
+ * purge HTTP call fails the worst case is the next boot hydrates from
+ * the previous transcript (the legacy file-mode behaviour). The local
+ * session.jsonl unlink is the primary signal; this is the multi-replica
+ * complement to it.
+ */
+export async function clearSnapshots(
+  opts: Pick<TranscriptSnapshotOptions, "gatewayUrl" | "workerToken">
+): Promise<void> {
+  const url = `${opts.gatewayUrl}/worker/transcript/snapshot`;
+  try {
+    const res = await fetch(url, {
+      method: "DELETE",
+      headers: { Authorization: `Bearer ${opts.workerToken}` },
+      signal: AbortSignal.timeout(30_000),
+    });
+    if (!res.ok) {
+      logger.warn(
+        `Snapshot DELETE failed: ${res.status} ${res.statusText} — next boot may rehydrate stale history`
+      );
+      return;
+    }
+    logger.info("Purged conversation snapshots for session reset");
+  } catch (err) {
+    logger.warn(
+      `Snapshot DELETE threw: ${err instanceof Error ? err.message : String(err)} — next boot may rehydrate stale history`
+    );
   }
 }
diff --git a/packages/agent-worker/src/openclaw/worker.ts b/packages/agent-worker/src/openclaw/worker.ts
@@ -55,6 +55,7 @@ import {
 } from "./model-resolver";
 import { checkSandboxLeak } from "./sandbox-leak";
 import {
+  clearSnapshots,
   hydrateFromSnapshot,
   type TerminalStatus,
   writeSnapshot,
@@ -344,15 +345,19 @@ export class OpenClawWorker implements WorkerExecutor {
     // turns; throwing here surfaces it on the first turn and the runs
     // queue's retry path handles re-delivery. Codex round 2 quality
     // win D on PR #865.
-    if (process.env.LOBU_SESSION_STORE === "snapshot") {
+    //
+    // Phase 5: snapshot is the default; setting LOBU_SESSION_STORE=file
+    // opts out (legacy / local-dev path that keeps reading session.jsonl
+    // straight off disk without writing to Postgres).
+    if (process.env.LOBU_SESSION_STORE !== "file") {
       if (typeof this.config.runId !== "number") {
         throw new Error(
-          "LOBU_SESSION_STORE=snapshot but WorkerConfig.runId is missing — runs-queue dispatch did not stamp runId on the job payload"
+          "Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runId is missing — runs-queue dispatch did not stamp runId on the job payload"
         );
       }
       if (!this.config.runJobToken) {
         throw new Error(
-          "LOBU_SESSION_STORE=snapshot but WorkerConfig.runJobToken is missing — MessageConsumer did not mint a per-run worker token"
+          "Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runJobToken is missing — MessageConsumer did not mint a per-run worker token"
         );
       }
     }
@@ -607,8 +612,8 @@ export class OpenClawWorker implements WorkerExecutor {
     // (possibly on a different pod) can hydrate from it. Hydrate filters
     // `terminal_status='completed'`, so we ONLY POST on the success path
     // — writing `failed`/`timeout`/`cancelled` rows is pure network
-    // waste (codex round 2 quality win C on PR #865). Opt-in via
-    // LOBU_SESSION_STORE=snapshot.
+    // waste (codex round 2 quality win C on PR #865). Default-on in
+    // Phase 5; LOBU_SESSION_STORE=file opts out for legacy/local-dev.
     //
     // The runs queue has already moved this run to a terminal state by
     // the time cleanup() fires (sse-client.ts:865 finally block runs
@@ -617,7 +622,7 @@ export class OpenClawWorker implements WorkerExecutor {
     // released when the subprocess exits, so by the next claim's boot
     // this snapshot is the visible "latest" row.
     if (
-      process.env.LOBU_SESSION_STORE === "snapshot" &&
+      process.env.LOBU_SESSION_STORE !== "file" &&
       this.sessionFilePath &&
       this.terminalStatus === "completed"
     ) {
@@ -957,9 +962,9 @@ export class OpenClawWorker implements WorkerExecutor {
 
     const sessionFile = path.join(workspaceDir, ".openclaw", "session.jsonl");
     // Capture for cleanup() — it reads the file back to write the snapshot
-    // at terminal time. Set unconditionally so non-snapshot deployments
-    // still get a defined value (snapshot writer no-ops when LOBU_SESSION_STORE
-    // is unset).
+    // at terminal time. Set unconditionally so file-mode opt-outs
+    // still get a defined value (snapshot writer no-ops when
+    // LOBU_SESSION_STORE=file).
     this.sessionFilePath = sessionFile;
     const providerStateFile = path.join(
       workspaceDir,
@@ -968,9 +973,10 @@ export class OpenClawWorker implements WorkerExecutor {
     );
 
     // Hydrate from the latest completed Postgres snapshot BEFORE the
-    // provider-state check or SessionManager.open(). Opt-in via
-    // LOBU_SESSION_STORE=snapshot — default unset preserves today's
-    // file-only behavior. Phase 5 flips the default.
+    // provider-state check or SessionManager.open(). Phase 5: snapshot
+    // mode is the default; LOBU_SESSION_STORE=file opts out and keeps
+    // the legacy file-only behaviour for local-dev / single-replica
+    // self-hosters.
     //
     // Order matters: hydrate → provider check (may unlink) →
     // SessionManager.open(). The provider-change unlink at line ~925 still
@@ -980,7 +986,7 @@ export class OpenClawWorker implements WorkerExecutor {
     // PG rows remain readable without poisoning the new conversation
     // (hydrate would only resurrect them if a subsequent run completes
     // successfully and overwrites the latest pointer).
-    if (process.env.LOBU_SESSION_STORE === "snapshot") {
+    if (process.env.LOBU_SESSION_STORE !== "file") {
       const gatewayUrl = process.env.DISPATCHER_URL;
       const workerToken = process.env.WORKER_TOKEN;
       if (gatewayUrl && workerToken) {
@@ -1000,7 +1006,7 @@ export class OpenClawWorker implements WorkerExecutor {
         }
       } else {
         logger.warn(
-          "LOBU_SESSION_STORE=snapshot set but DISPATCHER_URL or WORKER_TOKEN missing; snapshot disabled"
+          "Snapshot mode active (LOBU_SESSION_STORE != 'file') but DISPATCHER_URL or WORKER_TOKEN missing; snapshot disabled"
         );
       }
     }
@@ -1545,6 +1551,21 @@ Use it when the user references past discussions or you need context.`);
           // File may not exist
         }
 
+        // Also purge the Postgres snapshots for this (org, agent, conv)
+        // — in snapshot mode (the Phase 5 default) the next worker boot
+        // would otherwise rehydrate from the now-flushed conversation
+        // and the user-visible "Starting fresh" would be a lie. Best-
+        // effort: a failure here is logged but doesn't block the reset
+        // since the local unlink already happened and the snapshot
+        // helper is a no-op in file mode.
+        if (process.env.LOBU_SESSION_STORE !== "file") {
+          const gatewayUrl = process.env.DISPATCHER_URL;
+          const workerToken = process.env.WORKER_TOKEN;
+          if (gatewayUrl && workerToken) {
+            await clearSnapshots({ gatewayUrl, workerToken });
+          }
+        }
+
         // Send visible confirmation to user
         await onProgress({
           type: "output",