Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 18 additions & 12 deletions charts/lobu/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,14 @@ spec:
This is the operator opt-in path for true blue/green deploys.
3. Else → Recreate (the safe default).

Why `allowMultiReplica` is an explicit flag, not auto-detected from
RWX: even with RWX storage, several in-memory components break with
>1 gateway replicas OR during the brief RollingUpdate overlap:
Phase 5 (workspaces PVC is OFF by default + LOBU_SESSION_STORE
defaults to snapshot mode) means most deploys naturally land on the
rolling-update branch as soon as `allowMultiReplica: true` is set.
The RWX check still applies for self-hosters who re-enable the PVC.

Why `allowMultiReplica` is an explicit flag, not auto-detected:
several in-memory components break with >1 gateway replicas OR
during the brief RollingUpdate overlap:
* `SseManager` (gateway/services/sse-manager.ts) — SSE streams are
pod-local; a job claimed by pod B broadcasts to no-one if the
client is on pod A.
Expand All @@ -28,9 +33,10 @@ spec:
* Telegram polling mode (gateway/connections/chat-instance-manager
.ts:610-613) — every replica long-polls the same bot, causing
conflicts.
RWX is necessary but not sufficient. The flag forces operators to
acknowledge "I have only webhook-mode Chat connections AND no
AskUser/SSE flows in flight" before opting in.
Snapshot-mode session state and the per-conversation advisory lock
are necessary but not sufficient. The flag forces operators to
acknowledge "I have only webhook-mode Chat connections AND accept
the SSE / AskUser handoff caveats" before opting in.
*/}}
{{- $rwxConfigured := has "ReadWriteMany" (.Values.app.workspaces.accessModes | default (list)) }}
{{- $rollSafe := and .Values.app.allowMultiReplica (or (not .Values.app.workspaces.enabled) $rwxConfigured) }}
Expand Down Expand Up @@ -139,12 +145,12 @@ spec:
already serving, so deregistering the old pod via Service
endpoint removal + giving downstream caches time to notice it
shrinks the "old pod kept getting traffic during drain" window).
Under `Recreate` (the workspaces-RWO default), the new pod
doesn't start until the old one fully terminates — adding a
preStop sleep would EXTEND the no-available-server window by
its duration. We only emit the hook when preStopDelaySeconds
is explicitly > 0; ops repos using RollingUpdate set it,
Recreate-mode deploys leave it at the default 0.
Under `Recreate`, the new pod doesn't start until the old one
fully terminates — adding a preStop sleep would EXTEND the
no-available-server window by its duration. We only emit the
hook when preStopDelaySeconds is explicitly > 0; ops repos
using RollingUpdate set it, Recreate-mode deploys leave it at
the default 0.
*/ -}}
{{- if gt (int (.Values.app.preStopDelaySeconds | default 0)) 0 }}
lifecycle:
Expand Down
39 changes: 23 additions & 16 deletions charts/lobu/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,10 @@ app:
# is going away, so in-flight requests during the deregistration lag
# still hit a live process.
#
# Default 0 (preStop hook NOT emitted). Set to ~15 only when running
# with `app.strategy.type: RollingUpdate` and replicaCount > 1.
# Under the default Recreate strategy (workspaces PVC is RWO), the
# new pod doesn't start until the old one terminates, so adding a
# preStop sleep would EXTEND the no-available-server window — the
# opposite of what we want.
# Default 0 (preStop hook NOT emitted). Set to ~15 when running with
# `app.strategy.type: RollingUpdate` and replicaCount > 1. The
# workspaces PVC is OFF by default in Phase 5 so the chart now picks
# RollingUpdate automatically when `allowMultiReplica: true`.
preStopDelaySeconds: 0
# Total time k8s waits for the pod to stop before SIGKILL. Must be
# > preStopDelaySeconds + actual shutdown time. The gateway's graceful
Expand All @@ -67,17 +65,15 @@ app:
# Opt-in to multi-replica / rolling deploys. DEFAULT FALSE — leaving
# this off keeps the safe `strategy: Recreate` behavior. Setting it
# true makes the chart pick `RollingUpdate` (maxSurge: 1,
# maxUnavailable: 0) when workspaces is RWX or disabled.
# maxUnavailable: 0) as long as `app.workspaces.enabled` is false (the
# Phase 5 default) or workspaces is RWX-configured.
#
# Prerequisites BEFORE setting true — chart cannot detect these:
# 1. Workspaces volume must be RWX-capable:
# `app.workspaces.accessModes: [ReadWriteMany]` + a storage class
# that backs it (NFS, EFS, CephFS, Longhorn-RWX, …).
# 2. NO active Chat connections in `mode: "polling"` (Telegram).
# 1. NO active Chat connections in `mode: "polling"` (Telegram).
# Multiple replicas long-polling the same bot conflict. Use
# webhook mode only — see
# gateway/connections/chat-instance-manager.ts:610.
# 3. Acknowledge that the gateway has in-memory state for SSE
# 2. Acknowledge that the gateway has in-memory state for SSE
# streams (gateway/services/sse-manager.ts) and AskUser
# questions (gateway/connections/interaction-bridge.ts:193).
# A request whose SSE stream / AskUser click lands on a
Expand All @@ -87,17 +83,28 @@ app:
# then, occasional dropped streams / button clicks are the cost
# of zero-downtime deploys on this configuration.
#
# If you re-enable the workspaces PVC (legacy path), you must either
# provision RWX-capable storage (NFS / EFS / CephFS / Longhorn-RWX)
# or leave `allowMultiReplica: false` — RWO + multi-replica is
# rejected by the chart's strategy helper and falls back to Recreate.
#
# Single-replica + RollingUpdate (replicaCount: 1, allowMultiReplica: true)
# still creates a brief overlap window where both pods are running.
# The same in-memory caveats apply during that window, just for a
# shorter span (~5-15s).
allowMultiReplica: false

# Persistent workspaces volume. Embedded agent workers store session and
# workspace state below /app/workspaces (watcher run scratch, agent panel
# sessions, etc.).
# Persistent workspaces volume. Embedded agent workers used to store
# session and workspace state below /app/workspaces (session.jsonl,
# input/output/temp scratch). Phase 5 moved session.jsonl to Postgres
# (`agent_transcript_snapshot`) and made input/output/temp pod-local
# ephemeral storage — a per-conversation advisory lock pins all turns
# of one conversation to a single pod for its run lifetime, so no PVC
# is required for correctness. Set `enabled: true` only if you have a
# specific reason to persist scratch files across pod restarts; the
# PVC blocks rolling deploys on the default RWO storage class.
workspaces:
enabled: true
enabled: false
size: 20Gi
storageClass: ""
accessModes:
Expand Down
43 changes: 40 additions & 3 deletions packages/agent-worker/src/openclaw/transcript-snapshot.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@
* hydrate, `POST /worker/transcript/snapshot` for write. (org, agent,
* conv) are pulled from the worker JWT on the gateway side, so the
* worker can't impersonate another conversation.
* - Selection is opt-in via `LOBU_SESSION_STORE=snapshot`. Default unset
* preserves today's file-only behavior. Phase 5 flips the default,
* Phase 6 drops the env var.
* - Phase 5: snapshot mode is the default. `LOBU_SESSION_STORE=file`
* opts out for legacy/local-dev single-replica deploys. Phase 6
* drops the env var entirely.
*
* Trade-off accepted: a mid-run crash loses the partial transcript for that
* run. The next attempt re-runs from the previous user message. Tools must
Expand Down Expand Up @@ -197,5 +197,42 @@ export async function writeSnapshot(
logger.error(
`Snapshot POST threw: ${err instanceof Error ? err.message : String(err)}`
);
return;
}
}

/**
* Purge all snapshot rows for this worker's (org, agent, conv). Called
* by the session-reset path so the next boot doesn't rehydrate the
* conversation from Postgres after a `/new`. Idempotent — a 404 / empty
* result is treated as success.
*
* Failures are logged but not thrown — reset is best-effort; if the
* purge HTTP call fails the worst case is the next boot hydrates from
* the previous transcript (the legacy file-mode behaviour). The local
* session.jsonl unlink is the primary signal; this is the multi-replica
* complement to it.
*/
export async function clearSnapshots(
opts: Pick<TranscriptSnapshotOptions, "gatewayUrl" | "workerToken">
): Promise<void> {
const url = `${opts.gatewayUrl}/worker/transcript/snapshot`;
try {
const res = await fetch(url, {
method: "DELETE",
headers: { Authorization: `Bearer ${opts.workerToken}` },
signal: AbortSignal.timeout(30_000),
});
if (!res.ok) {
logger.warn(
`Snapshot DELETE failed: ${res.status} ${res.statusText} — next boot may rehydrate stale history`
);
return;
}
logger.info("Purged conversation snapshots for session reset");
} catch (err) {
logger.warn(
`Snapshot DELETE threw: ${err instanceof Error ? err.message : String(err)} — next boot may rehydrate stale history`
);
}
}
49 changes: 35 additions & 14 deletions packages/agent-worker/src/openclaw/worker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ import {
} from "./model-resolver";
import { checkSandboxLeak } from "./sandbox-leak";
import {
clearSnapshots,
hydrateFromSnapshot,
type TerminalStatus,
writeSnapshot,
Expand Down Expand Up @@ -344,15 +345,19 @@ export class OpenClawWorker implements WorkerExecutor {
// turns; throwing here surfaces it on the first turn and the runs
// queue's retry path handles re-delivery. Codex round 2 quality
// win D on PR #865.
if (process.env.LOBU_SESSION_STORE === "snapshot") {
//
// Phase 5: snapshot is the default; setting LOBU_SESSION_STORE=file
// opts out (legacy / local-dev path that keeps reading session.jsonl
// straight off disk without writing to Postgres).
if (process.env.LOBU_SESSION_STORE !== "file") {
if (typeof this.config.runId !== "number") {
throw new Error(
"LOBU_SESSION_STORE=snapshot but WorkerConfig.runId is missing — runs-queue dispatch did not stamp runId on the job payload"
"Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runId is missing — runs-queue dispatch did not stamp runId on the job payload"
);
}
if (!this.config.runJobToken) {
throw new Error(
"LOBU_SESSION_STORE=snapshot but WorkerConfig.runJobToken is missing — MessageConsumer did not mint a per-run worker token"
"Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runJobToken is missing — MessageConsumer did not mint a per-run worker token"
);
}
}
Expand Down Expand Up @@ -607,8 +612,8 @@ export class OpenClawWorker implements WorkerExecutor {
// (possibly on a different pod) can hydrate from it. Hydrate filters
// `terminal_status='completed'`, so we ONLY POST on the success path
// — writing `failed`/`timeout`/`cancelled` rows is pure network
// waste (codex round 2 quality win C on PR #865). Opt-in via
// LOBU_SESSION_STORE=snapshot.
// waste (codex round 2 quality win C on PR #865). Default-on in
// Phase 5; LOBU_SESSION_STORE=file opts out for legacy/local-dev.
//
// The runs queue has already moved this run to a terminal state by
// the time cleanup() fires (sse-client.ts:865 finally block runs
Expand All @@ -617,7 +622,7 @@ export class OpenClawWorker implements WorkerExecutor {
// released when the subprocess exits, so by the next claim's boot
// this snapshot is the visible "latest" row.
if (
process.env.LOBU_SESSION_STORE === "snapshot" &&
process.env.LOBU_SESSION_STORE !== "file" &&
this.sessionFilePath &&
this.terminalStatus === "completed"
) {
Expand Down Expand Up @@ -957,9 +962,9 @@ export class OpenClawWorker implements WorkerExecutor {

const sessionFile = path.join(workspaceDir, ".openclaw", "session.jsonl");
// Capture for cleanup() — it reads the file back to write the snapshot
// at terminal time. Set unconditionally so non-snapshot deployments
// still get a defined value (snapshot writer no-ops when LOBU_SESSION_STORE
// is unset).
// at terminal time. Set unconditionally so file-mode opt-outs
// still get a defined value (snapshot writer no-ops when
// LOBU_SESSION_STORE=file).
this.sessionFilePath = sessionFile;
const providerStateFile = path.join(
workspaceDir,
Expand All @@ -968,9 +973,10 @@ export class OpenClawWorker implements WorkerExecutor {
);

// Hydrate from the latest completed Postgres snapshot BEFORE the
// provider-state check or SessionManager.open(). Opt-in via
// LOBU_SESSION_STORE=snapshot — default unset preserves today's
// file-only behavior. Phase 5 flips the default.
// provider-state check or SessionManager.open(). Phase 5: snapshot
// mode is the default; LOBU_SESSION_STORE=file opts out and keeps
// the legacy file-only behaviour for local-dev / single-replica
// self-hosters.
//
// Order matters: hydrate → provider check (may unlink) →
// SessionManager.open(). The provider-change unlink at line ~925 still
Expand All @@ -980,7 +986,7 @@ export class OpenClawWorker implements WorkerExecutor {
// PG rows remain readable without poisoning the new conversation
// (hydrate would only resurrect them if a subsequent run completes
// successfully and overwrites the latest pointer).
if (process.env.LOBU_SESSION_STORE === "snapshot") {
if (process.env.LOBU_SESSION_STORE !== "file") {
const gatewayUrl = process.env.DISPATCHER_URL;
const workerToken = process.env.WORKER_TOKEN;
if (gatewayUrl && workerToken) {
Expand All @@ -1000,7 +1006,7 @@ export class OpenClawWorker implements WorkerExecutor {
}
} else {
logger.warn(
"LOBU_SESSION_STORE=snapshot set but DISPATCHER_URL or WORKER_TOKEN missing; snapshot disabled"
"Snapshot mode active (LOBU_SESSION_STORE != 'file') but DISPATCHER_URL or WORKER_TOKEN missing; snapshot disabled"
);
}
}
Expand Down Expand Up @@ -1545,6 +1551,21 @@ Use it when the user references past discussions or you need context.`);
// File may not exist
}

// Also purge the Postgres snapshots for this (org, agent, conv)
// — in snapshot mode (the Phase 5 default) the next worker boot
// would otherwise rehydrate from the now-flushed conversation
// and the user-visible "Starting fresh" would be a lie. Best-
// effort: a failure here is logged but doesn't block the reset
// since the local unlink already happened and the snapshot
// helper is a no-op in file mode.
if (process.env.LOBU_SESSION_STORE !== "file") {
const gatewayUrl = process.env.DISPATCHER_URL;
const workerToken = process.env.WORKER_TOKEN;
if (gatewayUrl && workerToken) {
await clearSnapshots({ gatewayUrl, workerToken });
}
}

// Send visible confirmation to user
await onProgress({
type: "output",
Expand Down
Loading
Loading