Skip to content

feat(chart): auto-pick RollingUpdate when workspaces is RWX#776

Merged
buremba merged 2 commits into
mainfrom
feat/rolling-deploy-support
May 16, 2026
Merged

feat(chart): auto-pick RollingUpdate when workspaces is RWX#776
buremba merged 2 commits into
mainfrom
feat/rolling-deploy-support

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Close the last structural gap from the post-incident audit (#775 review): under `strategy: Recreate` (the workspaces-RWO default), every rollout has a ~30s "no available server" window between old-pod-terminated and new-pod-ready. Cloudflare returns the same page the 2026-05-16 outage produced, just briefly.

The actual constraint: the workspaces PVC is RWO so two pods can't mount it simultaneously, so Recreate is mandatory there. But operators using RWX storage (NFS, EFS, CephFS, Longhorn-RWX, Portworx, …) can run multiple replicas safely — the chart just wasn't auto-detecting that.

What

The chart now resolves `strategy` as:

  1. Explicit `app.strategy` override wins.
  2. Else: workspaces enabled AND `accessModes` does NOT include `ReadWriteMany` → `Recreate` (backward compatible).
  3. Else (workspaces RWX or disabled) → `RollingUpdate` with `maxSurge: 1, maxUnavailable: 0`. With `replicaCount > 1` this is true blue/green: the new pod must reach Ready before the old terminates, so there's zero window where Service has no endpoints.

Operator opt-in

```yaml
app:
replicaCount: 2
preStopDelaySeconds: 15 # already added in lobu#775
workspaces:
accessModes: [ReadWriteMany]
storageClass: ""
```

That's the whole change on the ops side.

Safety: concurrent pods on RWX

`/app/workspaces` paths are keyed by (agent, run) tuples — checked the prod volume contents directly:

```
/app/workspaces/marketing/marketing_watcher_218_run_141460
/app/workspaces/marketing/marketing_web-66ebf172_17fefa2b-0ff
/app/workspaces/marketing/marketing_web-ca015184_agent-panel
```

Each watcher run / agent panel session has a unique directory. The runs queue uses `FOR UPDATE SKIP LOCKED` for claims so no two pods own the same run. Concurrent gateway pods on RWX are structurally safe.

Validation

  • `helm lint charts/lobu` clean
  • Template tests:
    • Defaults (RWO) → `strategy: Recreate` ✓
    • `accessModes: [ReadWriteMany]` + `replicaCount: 2` → `strategy: RollingUpdate {maxSurge: 1, maxUnavailable: 0}` ✓
    • `workspaces.enabled: false` → `strategy: RollingUpdate` ✓
  • Backward compatible: existing deployments using defaults keep their current Recreate behavior

Not in this PR

  • Ops-side rollout — switching prod's actual values file to enable RWX is a separate ops-repo PR (requires provisioning an RWX storage class, e.g. EFS on AWS or a Longhorn-RWX volume in your Hetzner cluster). The chart now supports it; the storage decision is yours.

Summary by CodeRabbit

  • Refactor

    • Improved deployment rollout selection to safer defaults for multi-replica and workspace storage scenarios.
  • Documentation

    • Expanded guidance on workspace storage requirements, the multi-replica opt-in behavior, and operational caveats (replica overlap, streaming/state impacts).
  • Chores

    • Updated web component snapshot to a newer upstream revision.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d6fa5a94-7e45-4022-9b6f-ed2ce2e9ba12

📥 Commits

Reviewing files that changed from the base of the PR and between c113bd4 and da47ebe.

📒 Files selected for processing (3)
  • charts/lobu/templates/deployment.yaml
  • charts/lobu/values.yaml
  • packages/web

📝 Walkthrough

Walkthrough

Helm chart changes adjust Deployment strategy selection to detect RWX-capable workspaces and gate RollingUpdate behind a new app.allowMultiReplica flag; values.yaml docs expanded and a packages/web submodule pointer advanced.

Changes

RWX Storage and Deployment Strategy

Layer / File(s) Summary
Strategy selection logic
charts/lobu/templates/deployment.yaml
Replaces workspace-enabled-only branching with RWX detection via $rwxConfigured and an app.allowMultiReplica-gated $rollSafe; precedence: explicit app.strategyRollingUpdate (when rollSafe, maxSurge: 1, maxUnavailable: 0) → Recreate.
Values and workspace documentation
charts/lobu/values.yaml
Expands app.allowMultiReplica documentation describing when the chart switches to RollingUpdate, prerequisites for RWX volumes, Telegram mode: "polling" caution, and /app/workspaces state description.

Misc updates

Layer / File(s) Summary
Web submodule bump
packages/web
Advance Git submodule recorded commit SHA for packages/web.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through charts at break of day,
Searched volumes that whisper "RWX, hooray!",
Flags set, replicas may briefly meet,
A careful rollout, surge kept neat,
Then back to fields — the cluster sleeps in hay.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive PR description covers the context, implementation approach, operator opt-in instructions, and validation. However, the template sections (Summary, Test plan, Notes) are not explicitly followed in the provided description. Align description with repository template: add explicit 'Summary' section, check off completed test plan items, and add a 'Notes' section linking issue #775 and mentioning the follow-up work.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main change: introducing RollingUpdate strategy selection when workspaces use RWX storage, which is the core improvement.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rolling-deploy-support

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad1068db40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +37 to +40
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Recreate until worker workspaces are isolated

When app.workspaces.accessModes contains ReadWriteMany and operators follow the new values guidance to set replicaCount: 2+, this branch allows two app pods to serve concurrently against the same mounted /app/workspaces PVC. The embedded worker is not isolated by run: EmbeddedDeploymentManager still uses workspaces/${agentId} as the worker cwd (packages/server/src/gateway/orchestration/impl/embedded-deployment.ts:324), and the worker writes shared files like .openclaw/session.jsonl and clears the shared output directory on startup (packages/agent-worker/src/openclaw/worker.ts:868, :1595). Two pods handling different runs for the same agent can therefore race and corrupt session/artifact state; the old Recreate path avoided that overlap even on RWX storage.

Useful? React with 👍 / 👎.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

buremba added a commit that referenced this pull request May 16, 2026
Pi review on #776 found three real correctness issues that would break
under multi-replica or even brief RollingUpdate overlap:

* **SseManager is pod-local** (gateway/services/sse-manager.ts). If
  pod A holds the SSE stream and pod B claims the job, broadcast goes
  to no-one — client sees the request hang.
* **AskUser question routing is pod-local**
  (gateway/connections/interaction-bridge.ts:193-214). Question
  posted by pod A; click webhook lands on pod B; pod B's local
  `claimQuestion(id)` returns undefined and the click is dropped.
* **Telegram polling mode is incompatible with multi-replica**
  (gateway/connections/chat-instance-manager.ts:610). Every replica
  starts its own long-poller on the same bot.

RWX storage is necessary but NOT sufficient. The previous version of
this PR auto-switched to RollingUpdate when it detected RWX
accessModes, which would silently introduce these bugs for any
operator who configured shared storage.

Fix: gate the strategy switch behind an explicit
`app.allowMultiReplica: false` (default off) flag. Setting it true
documents that the operator has:
  1. RWX workspaces storage
  2. No Telegram polling-mode connections
  3. Acknowledged the SSE / AskUser caveats (silently dropped
     responses during the overlap window).

Without those preconditions, Recreate is the safe default and stays
that way for every existing deploy. Operators who flip the flag get
RollingUpdate with maxSurge:1, maxUnavailable:0.

values.yaml documents each prerequisite inline so operators reading
the comments before flipping the flag see the full picture.

Also bumps web submodule to current main (drift fix).
buremba added a commit that referenced this pull request May 16, 2026
Pi review on #776 found three real correctness issues that would break
under multi-replica or even brief RollingUpdate overlap:

* **SseManager is pod-local** (gateway/services/sse-manager.ts). If
  pod A holds the SSE stream and pod B claims the job, broadcast goes
  to no-one — client sees the request hang.
* **AskUser question routing is pod-local**
  (gateway/connections/interaction-bridge.ts:193-214). Question
  posted by pod A; click webhook lands on pod B; pod B's local
  `claimQuestion(id)` returns undefined and the click is dropped.
* **Telegram polling mode is incompatible with multi-replica**
  (gateway/connections/chat-instance-manager.ts:610). Every replica
  starts its own long-poller on the same bot.

RWX storage is necessary but NOT sufficient. The previous version of
this PR auto-switched to RollingUpdate when it detected RWX
accessModes, which would silently introduce these bugs for any
operator who configured shared storage.

Fix: gate the strategy switch behind an explicit
`app.allowMultiReplica: false` (default off) flag. Setting it true
documents that the operator has:
  1. RWX workspaces storage
  2. No Telegram polling-mode connections
  3. Acknowledged the SSE / AskUser caveats (silently dropped
     responses during the overlap window).

Without those preconditions, Recreate is the safe default and stays
that way for every existing deploy. Operators who flip the flag get
RollingUpdate with maxSurge:1, maxUnavailable:0.

values.yaml documents each prerequisite inline so operators reading
the comments before flipping the flag see the full picture.

Also bumps web submodule to current main (drift fix).
@buremba buremba force-pushed the feat/rolling-deploy-support branch from 44b31d2 to c113bd4 Compare May 16, 2026 22:22
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 16, 2026

pi review — addressed (changed approach)

Pi caught three real correctness issues that would have made the auto-switch unsafe:

  1. SSE streams are pod-local (`gateway/services/sse-manager.ts`). Pod B claiming a job broadcasts to its empty SseManager when the client is connected to pod A → request hangs.
  2. AskUser question routing is pod-local (`gateway/connections/interaction-bridge.ts:193`). Button click webhook lands on a different pod than the one that posted the question → click dropped.
  3. Telegram polling mode is incompatible with multi-replica (`gateway/connections/chat-instance-manager.ts:610`). Every replica long-polls the same bot, causing conflicts.

RWX storage is necessary but not sufficient. The previous auto-switch on RWX detection would silently introduce these bugs for any operator who configured shared storage.

Fix in c113bd4

Gated behind an explicit `app.allowMultiReplica: false` (default off) flag. Strategy now resolves as:

  1. Explicit `app.strategy` override wins.
  2. Else if `allowMultiReplica: true` AND workspaces is RWX (or disabled) → RollingUpdate.
  3. Else → Recreate (safe default, unchanged for everyone).

`values.yaml` documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture (RWX storage, no Telegram polling, accept the SSE/AskUser caveats).

Matrix tests:

Configuration Strategy
defaults Recreate
allowMultiReplica=true, RWO workspaces Recreate (safety override)
allowMultiReplica=true, RWX workspaces RollingUpdate(maxSurge:1, maxUnavailable:0)
allowMultiReplica=true, workspaces disabled RollingUpdate(maxSurge:1, maxUnavailable:0)

Pre-existing CI noise

  • check-drift: was failing because web submodule was behind; bumped to current main in same commit.
  • frontend / integration: failing on main too (pre-existing flakes — frontend missing /embedded smoke entry, integration hits Postgres connection exhaustion). Not introduced by this PR.

Followup tracked

Migrating SSE delivery, AskUser routing, and Telegram polling coordination to durable / distributed-safe implementations is the path to making multi-replica usable without the caveats. Out of scope here.

buremba added 2 commits May 16, 2026 23:29
Close the structural deploy gap from lobu#775's review: under
`strategy: Recreate` (the workspaces-RWO default), every rollout has
a ~30s "no available server" window between old-pod-terminated and
new-pod-ready. Cloudflare returns the same "no available server" page
the 2026-05-16 outage produced, just briefly.

The chart now resolves strategy as:
  1. Explicit `app.strategy` override wins.
  2. Else: workspaces enabled AND accessModes does NOT include
     ReadWriteMany → Recreate (RWO PVC can't be mounted by two pods,
     RollingUpdate would deadlock at PVC attach). Backward compatible
     with current ops setups.
  3. Else (workspaces RWX or disabled) → RollingUpdate with
     maxSurge: 1, maxUnavailable: 0. With replicaCount > 1 this is
     true blue/green: the new pod must reach Ready before the old
     terminates, so there's zero window where Service has no
     endpoints. With replicaCount: 1 there's still an overlap
     during which the new pod starts and becomes ready, but no gap.

To opt in to blue/green:
  app.workspaces.accessModes: [ReadWriteMany]
  app.workspaces.storageClass: "<rwx-class>"   # NFS / EFS / CephFS / Longhorn-RWX / ...
  app.replicaCount: 2
  app.preStopDelaySeconds: 15   # already added in lobu#775; opt-in
                                # makes drain happen before SIGTERM

Concurrent gateway pods on RWX are safe: /app/workspaces paths are
keyed by (agent, run) tuples and the runs queue uses
FOR UPDATE SKIP LOCKED for claims, so no two pods own the same run.

`helm lint charts/lobu` and template-rendering tests for the three
matrix combinations (default RWO, RWX, workspaces disabled) all
produce the expected strategy and lifecycle blocks.
Pi review on #776 found three real correctness issues that would break
under multi-replica or even brief RollingUpdate overlap:

* **SseManager is pod-local** (gateway/services/sse-manager.ts). If
  pod A holds the SSE stream and pod B claims the job, broadcast goes
  to no-one — client sees the request hang.
* **AskUser question routing is pod-local**
  (gateway/connections/interaction-bridge.ts:193-214). Question
  posted by pod A; click webhook lands on pod B; pod B's local
  `claimQuestion(id)` returns undefined and the click is dropped.
* **Telegram polling mode is incompatible with multi-replica**
  (gateway/connections/chat-instance-manager.ts:610). Every replica
  starts its own long-poller on the same bot.

RWX storage is necessary but NOT sufficient. The previous version of
this PR auto-switched to RollingUpdate when it detected RWX
accessModes, which would silently introduce these bugs for any
operator who configured shared storage.

Fix: gate the strategy switch behind an explicit
`app.allowMultiReplica: false` (default off) flag. Setting it true
documents that the operator has:
  1. RWX workspaces storage
  2. No Telegram polling-mode connections
  3. Acknowledged the SSE / AskUser caveats (silently dropped
     responses during the overlap window).

Without those preconditions, Recreate is the safe default and stays
that way for every existing deploy. Operators who flip the flag get
RollingUpdate with maxSurge:1, maxUnavailable:0.

values.yaml documents each prerequisite inline so operators reading
the comments before flipping the flag see the full picture.

Also bumps web submodule to current main (drift fix).
@buremba buremba force-pushed the feat/rolling-deploy-support branch from c113bd4 to da47ebe Compare May 16, 2026 22:29
@buremba buremba merged commit e98e1ea into main May 16, 2026
4 checks passed
@buremba buremba deleted the feat/rolling-deploy-support branch May 16, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants