feat(chart): auto-pick RollingUpdate when workspaces is RWX by buremba · Pull Request #776 · lobu-ai/lobu

buremba · 2026-05-16T22:14:09Z

Why

Close the last structural gap from the post-incident audit (#775 review): under `strategy: Recreate` (the workspaces-RWO default), every rollout has a ~30s "no available server" window between old-pod-terminated and new-pod-ready. Cloudflare returns the same page the 2026-05-16 outage produced, just briefly.

The actual constraint: the workspaces PVC is RWO so two pods can't mount it simultaneously, so Recreate is mandatory there. But operators using RWX storage (NFS, EFS, CephFS, Longhorn-RWX, Portworx, …) can run multiple replicas safely — the chart just wasn't auto-detecting that.

What

The chart now resolves `strategy` as:

Explicit `app.strategy` override wins.
Else: workspaces enabled AND `accessModes` does NOT include `ReadWriteMany` → `Recreate` (backward compatible).
Else (workspaces RWX or disabled) → `RollingUpdate` with `maxSurge: 1, maxUnavailable: 0`. With `replicaCount > 1` this is true blue/green: the new pod must reach Ready before the old terminates, so there's zero window where Service has no endpoints.

Operator opt-in

```yaml
app:
replicaCount: 2
preStopDelaySeconds: 15 # already added in lobu#775
workspaces:
accessModes: [ReadWriteMany]
storageClass: ""
```

That's the whole change on the ops side.

Safety: concurrent pods on RWX

`/app/workspaces` paths are keyed by (agent, run) tuples — checked the prod volume contents directly:

```
/app/workspaces/marketing/marketing_watcher_218_run_141460
/app/workspaces/marketing/marketing_web-66ebf172_17fefa2b-0ff
/app/workspaces/marketing/marketing_web-ca015184_agent-panel
```

Each watcher run / agent panel session has a unique directory. The runs queue uses `FOR UPDATE SKIP LOCKED` for claims so no two pods own the same run. Concurrent gateway pods on RWX are structurally safe.

Validation

`helm lint charts/lobu` clean
Template tests:
- Defaults (RWO) → `strategy: Recreate` ✓
- `accessModes: [ReadWriteMany]` + `replicaCount: 2` → `strategy: RollingUpdate {maxSurge: 1, maxUnavailable: 0}` ✓
- `workspaces.enabled: false` → `strategy: RollingUpdate` ✓
Backward compatible: existing deployments using defaults keep their current Recreate behavior

Not in this PR

Ops-side rollout — switching prod's actual values file to enable RWX is a separate ops-repo PR (requires provisioning an RWX storage class, e.g. EFS on AWS or a Longhorn-RWX volume in your Hetzner cluster). The chart now supports it; the storage decision is yours.

Summary by CodeRabbit

Refactor
- Improved deployment rollout selection to safer defaults for multi-replica and workspace storage scenarios.
Documentation
- Expanded guidance on workspace storage requirements, the multi-replica opt-in behavior, and operational caveats (replica overlap, streaming/state impacts).
Chores
- Updated web component snapshot to a newer upstream revision.

coderabbitai · 2026-05-16T22:14:19Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d6fa5a94-7e45-4022-9b6f-ed2ce2e9ba12

📥 Commits

Reviewing files that changed from the base of the PR and between c113bd4 and da47ebe.

📒 Files selected for processing (3)

charts/lobu/templates/deployment.yaml
charts/lobu/values.yaml
packages/web

📝 Walkthrough

Walkthrough

Helm chart changes adjust Deployment strategy selection to detect RWX-capable workspaces and gate RollingUpdate behind a new app.allowMultiReplica flag; values.yaml docs expanded and a packages/web submodule pointer advanced.

Changes

RWX Storage and Deployment Strategy

Layer / File(s)	Summary
Strategy selection logic `charts/lobu/templates/deployment.yaml`	Replaces workspace-enabled-only branching with RWX detection via `$rwxConfigured` and an `app.allowMultiReplica`-gated `$rollSafe`; precedence: explicit `app.strategy` → `RollingUpdate` (when rollSafe, `maxSurge: 1`, `maxUnavailable: 0`) → `Recreate`.
Values and workspace documentation `charts/lobu/values.yaml`	Expands `app.allowMultiReplica` documentation describing when the chart switches to `RollingUpdate`, prerequisites for RWX volumes, Telegram `mode: "polling"` caution, and `/app/workspaces` state description.

Misc updates

Layer / File(s)	Summary
Web submodule bump `packages/web`	Advance Git submodule recorded commit SHA for `packages/web`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through charts at break of day,
Searched volumes that whisper "RWX, hooray!",
Flags set, replicas may briefly meet,
A careful rollout, surge kept neat,
Then back to fields — the cluster sleeps in hay.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	PR description covers the context, implementation approach, operator opt-in instructions, and validation. However, the template sections (Summary, Test plan, Notes) are not explicitly followed in the provided description.	Align description with repository template: add explicit 'Summary' section, check off completed test plan items, and add a 'Notes' section linking issue `#775` and mentioning the follow-up work.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title accurately describes the main change: introducing RollingUpdate strategy selection when workspaces use RWX storage, which is the core improvement.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/rolling-deploy-support

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad1068db40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T22:15:47Z

+    type: RollingUpdate
+    rollingUpdate:
+      maxSurge: 1
+      maxUnavailable: 0


Keep Recreate until worker workspaces are isolated

When app.workspaces.accessModes contains ReadWriteMany and operators follow the new values guidance to set replicaCount: 2+, this branch allows two app pods to serve concurrently against the same mounted /app/workspaces PVC. The embedded worker is not isolated by run: EmbeddedDeploymentManager still uses workspaces/${agentId} as the worker cwd (packages/server/src/gateway/orchestration/impl/embedded-deployment.ts:324), and the worker writes shared files like .openclaw/session.jsonl and clears the shared output directory on startup (packages/agent-worker/src/openclaw/worker.ts:868, :1595). Two pods handling different runs for the same agent can therefore race and corrupt session/artifact state; the old Recreate path avoided that overlap even on RWX storage.

Useful? React with 👍 / 👎.

codecov-commenter · 2026-05-16T22:16:09Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Pi review on #776 found three real correctness issues that would break under multi-replica or even brief RollingUpdate overlap: * **SseManager is pod-local** (gateway/services/sse-manager.ts). If pod A holds the SSE stream and pod B claims the job, broadcast goes to no-one — client sees the request hang. * **AskUser question routing is pod-local** (gateway/connections/interaction-bridge.ts:193-214). Question posted by pod A; click webhook lands on pod B; pod B's local `claimQuestion(id)` returns undefined and the click is dropped. * **Telegram polling mode is incompatible with multi-replica** (gateway/connections/chat-instance-manager.ts:610). Every replica starts its own long-poller on the same bot. RWX storage is necessary but NOT sufficient. The previous version of this PR auto-switched to RollingUpdate when it detected RWX accessModes, which would silently introduce these bugs for any operator who configured shared storage. Fix: gate the strategy switch behind an explicit `app.allowMultiReplica: false` (default off) flag. Setting it true documents that the operator has: 1. RWX workspaces storage 2. No Telegram polling-mode connections 3. Acknowledged the SSE / AskUser caveats (silently dropped responses during the overlap window). Without those preconditions, Recreate is the safe default and stays that way for every existing deploy. Operators who flip the flag get RollingUpdate with maxSurge:1, maxUnavailable:0. values.yaml documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture. Also bumps web submodule to current main (drift fix).

buremba · 2026-05-16T22:22:55Z

pi review — addressed (changed approach)

Pi caught three real correctness issues that would have made the auto-switch unsafe:

SSE streams are pod-local (`gateway/services/sse-manager.ts`). Pod B claiming a job broadcasts to its empty SseManager when the client is connected to pod A → request hangs.
AskUser question routing is pod-local (`gateway/connections/interaction-bridge.ts:193`). Button click webhook lands on a different pod than the one that posted the question → click dropped.
Telegram polling mode is incompatible with multi-replica (`gateway/connections/chat-instance-manager.ts:610`). Every replica long-polls the same bot, causing conflicts.

RWX storage is necessary but not sufficient. The previous auto-switch on RWX detection would silently introduce these bugs for any operator who configured shared storage.

Fix in c113bd4

Gated behind an explicit `app.allowMultiReplica: false` (default off) flag. Strategy now resolves as:

Explicit `app.strategy` override wins.
Else if `allowMultiReplica: true` AND workspaces is RWX (or disabled) → RollingUpdate.
Else → Recreate (safe default, unchanged for everyone).

`values.yaml` documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture (RWX storage, no Telegram polling, accept the SSE/AskUser caveats).

Matrix tests:

Configuration	Strategy
defaults	Recreate
allowMultiReplica=true, RWO workspaces	Recreate (safety override)
allowMultiReplica=true, RWX workspaces	RollingUpdate(maxSurge:1, maxUnavailable:0)
allowMultiReplica=true, workspaces disabled	RollingUpdate(maxSurge:1, maxUnavailable:0)

Pre-existing CI noise

check-drift: was failing because web submodule was behind; bumped to current main in same commit.
frontend / integration: failing on main too (pre-existing flakes — frontend missing /embedded smoke entry, integration hits Postgres connection exhaustion). Not introduced by this PR.

Followup tracked

Migrating SSE delivery, AskUser routing, and Telegram polling coordination to durable / distributed-safe implementations is the path to making multi-replica usable without the caveats. Out of scope here.

Close the structural deploy gap from lobu#775's review: under `strategy: Recreate` (the workspaces-RWO default), every rollout has a ~30s "no available server" window between old-pod-terminated and new-pod-ready. Cloudflare returns the same "no available server" page the 2026-05-16 outage produced, just briefly. The chart now resolves strategy as: 1. Explicit `app.strategy` override wins. 2. Else: workspaces enabled AND accessModes does NOT include ReadWriteMany → Recreate (RWO PVC can't be mounted by two pods, RollingUpdate would deadlock at PVC attach). Backward compatible with current ops setups. 3. Else (workspaces RWX or disabled) → RollingUpdate with maxSurge: 1, maxUnavailable: 0. With replicaCount > 1 this is true blue/green: the new pod must reach Ready before the old terminates, so there's zero window where Service has no endpoints. With replicaCount: 1 there's still an overlap during which the new pod starts and becomes ready, but no gap. To opt in to blue/green: app.workspaces.accessModes: [ReadWriteMany] app.workspaces.storageClass: "<rwx-class>" # NFS / EFS / CephFS / Longhorn-RWX / ... app.replicaCount: 2 app.preStopDelaySeconds: 15 # already added in lobu#775; opt-in # makes drain happen before SIGTERM Concurrent gateway pods on RWX are safe: /app/workspaces paths are keyed by (agent, run) tuples and the runs queue uses FOR UPDATE SKIP LOCKED for claims, so no two pods own the same run. `helm lint charts/lobu` and template-rendering tests for the three matrix combinations (default RWO, RWX, workspaces disabled) all produce the expected strategy and lifecycle blocks.

Pi review on #776 found three real correctness issues that would break under multi-replica or even brief RollingUpdate overlap: * **SseManager is pod-local** (gateway/services/sse-manager.ts). If pod A holds the SSE stream and pod B claims the job, broadcast goes to no-one — client sees the request hang. * **AskUser question routing is pod-local** (gateway/connections/interaction-bridge.ts:193-214). Question posted by pod A; click webhook lands on pod B; pod B's local `claimQuestion(id)` returns undefined and the click is dropped. * **Telegram polling mode is incompatible with multi-replica** (gateway/connections/chat-instance-manager.ts:610). Every replica starts its own long-poller on the same bot. RWX storage is necessary but NOT sufficient. The previous version of this PR auto-switched to RollingUpdate when it detected RWX accessModes, which would silently introduce these bugs for any operator who configured shared storage. Fix: gate the strategy switch behind an explicit `app.allowMultiReplica: false` (default off) flag. Setting it true documents that the operator has: 1. RWX workspaces storage 2. No Telegram polling-mode connections 3. Acknowledged the SSE / AskUser caveats (silently dropped responses during the overlap window). Without those preconditions, Recreate is the safe default and stays that way for every existing deploy. Operators who flip the flag get RollingUpdate with maxSurge:1, maxUnavailable:0. values.yaml documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture. Also bumps web submodule to current main (drift fix).

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

buremba force-pushed the feat/rolling-deploy-support branch from 44b31d2 to c113bd4 Compare May 16, 2026 22:22

buremba added 2 commits May 16, 2026 23:29

buremba force-pushed the feat/rolling-deploy-support branch from c113bd4 to da47ebe Compare May 16, 2026 22:29

buremba merged commit e98e1ea into main May 16, 2026
4 checks passed

buremba deleted the feat/rolling-deploy-support branch May 16, 2026 22:30

buremba mentioned this pull request May 16, 2026

chore(main): release lobu 7.1.0 #724

Merged

This was referenced May 18, 2026

feat(server,chart): flip snapshot default + drop workspaces PVC (Phase 5) #871

Merged

chore(chart): align lobu chart to owletto fork (pre-consolidation, byte-identical) #882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chart): auto-pick RollingUpdate when workspaces is RWX#776

feat(chart): auto-pick RollingUpdate when workspaces is RWX#776
buremba merged 2 commits into
mainfrom
feat/rolling-deploy-support

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

codecov-commenter commented May 16, 2026

Uh oh!

buremba commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Operator opt-in

Safety: concurrent pods on RWX

Validation

Not in this PR

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 16, 2026

Codecov Report

Uh oh!

buremba commented May 16, 2026

pi review — addressed (changed approach)

Fix in c113bd4

Pre-existing CI noise

Followup tracked

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading