feat(chart): auto-pick RollingUpdate when workspaces is RWX#776
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughHelm chart changes adjust Deployment strategy selection to detect RWX-capable workspaces and gate RollingUpdate behind a new app.allowMultiReplica flag; values.yaml docs expanded and a ChangesRWX Storage and Deployment Strategy
Misc updates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad1068db40
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxSurge: 1 | ||
| maxUnavailable: 0 |
There was a problem hiding this comment.
Keep Recreate until worker workspaces are isolated
When app.workspaces.accessModes contains ReadWriteMany and operators follow the new values guidance to set replicaCount: 2+, this branch allows two app pods to serve concurrently against the same mounted /app/workspaces PVC. The embedded worker is not isolated by run: EmbeddedDeploymentManager still uses workspaces/${agentId} as the worker cwd (packages/server/src/gateway/orchestration/impl/embedded-deployment.ts:324), and the worker writes shared files like .openclaw/session.jsonl and clears the shared output directory on startup (packages/agent-worker/src/openclaw/worker.ts:868, :1595). Two pods handling different runs for the same agent can therefore race and corrupt session/artifact state; the old Recreate path avoided that overlap even on RWX storage.
Useful? React with 👍 / 👎.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Pi review on #776 found three real correctness issues that would break under multi-replica or even brief RollingUpdate overlap: * **SseManager is pod-local** (gateway/services/sse-manager.ts). If pod A holds the SSE stream and pod B claims the job, broadcast goes to no-one — client sees the request hang. * **AskUser question routing is pod-local** (gateway/connections/interaction-bridge.ts:193-214). Question posted by pod A; click webhook lands on pod B; pod B's local `claimQuestion(id)` returns undefined and the click is dropped. * **Telegram polling mode is incompatible with multi-replica** (gateway/connections/chat-instance-manager.ts:610). Every replica starts its own long-poller on the same bot. RWX storage is necessary but NOT sufficient. The previous version of this PR auto-switched to RollingUpdate when it detected RWX accessModes, which would silently introduce these bugs for any operator who configured shared storage. Fix: gate the strategy switch behind an explicit `app.allowMultiReplica: false` (default off) flag. Setting it true documents that the operator has: 1. RWX workspaces storage 2. No Telegram polling-mode connections 3. Acknowledged the SSE / AskUser caveats (silently dropped responses during the overlap window). Without those preconditions, Recreate is the safe default and stays that way for every existing deploy. Operators who flip the flag get RollingUpdate with maxSurge:1, maxUnavailable:0. values.yaml documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture. Also bumps web submodule to current main (drift fix).
Pi review on #776 found three real correctness issues that would break under multi-replica or even brief RollingUpdate overlap: * **SseManager is pod-local** (gateway/services/sse-manager.ts). If pod A holds the SSE stream and pod B claims the job, broadcast goes to no-one — client sees the request hang. * **AskUser question routing is pod-local** (gateway/connections/interaction-bridge.ts:193-214). Question posted by pod A; click webhook lands on pod B; pod B's local `claimQuestion(id)` returns undefined and the click is dropped. * **Telegram polling mode is incompatible with multi-replica** (gateway/connections/chat-instance-manager.ts:610). Every replica starts its own long-poller on the same bot. RWX storage is necessary but NOT sufficient. The previous version of this PR auto-switched to RollingUpdate when it detected RWX accessModes, which would silently introduce these bugs for any operator who configured shared storage. Fix: gate the strategy switch behind an explicit `app.allowMultiReplica: false` (default off) flag. Setting it true documents that the operator has: 1. RWX workspaces storage 2. No Telegram polling-mode connections 3. Acknowledged the SSE / AskUser caveats (silently dropped responses during the overlap window). Without those preconditions, Recreate is the safe default and stays that way for every existing deploy. Operators who flip the flag get RollingUpdate with maxSurge:1, maxUnavailable:0. values.yaml documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture. Also bumps web submodule to current main (drift fix).
44b31d2 to
c113bd4
Compare
pi review — addressed (changed approach)Pi caught three real correctness issues that would have made the auto-switch unsafe:
RWX storage is necessary but not sufficient. The previous auto-switch on RWX detection would silently introduce these bugs for any operator who configured shared storage. Fix in c113bd4Gated behind an explicit `app.allowMultiReplica: false` (default off) flag. Strategy now resolves as:
`values.yaml` documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture (RWX storage, no Telegram polling, accept the SSE/AskUser caveats). Matrix tests:
Pre-existing CI noise
Followup trackedMigrating SSE delivery, AskUser routing, and Telegram polling coordination to durable / distributed-safe implementations is the path to making multi-replica usable without the caveats. Out of scope here. |
Close the structural deploy gap from lobu#775's review: under
`strategy: Recreate` (the workspaces-RWO default), every rollout has
a ~30s "no available server" window between old-pod-terminated and
new-pod-ready. Cloudflare returns the same "no available server" page
the 2026-05-16 outage produced, just briefly.
The chart now resolves strategy as:
1. Explicit `app.strategy` override wins.
2. Else: workspaces enabled AND accessModes does NOT include
ReadWriteMany → Recreate (RWO PVC can't be mounted by two pods,
RollingUpdate would deadlock at PVC attach). Backward compatible
with current ops setups.
3. Else (workspaces RWX or disabled) → RollingUpdate with
maxSurge: 1, maxUnavailable: 0. With replicaCount > 1 this is
true blue/green: the new pod must reach Ready before the old
terminates, so there's zero window where Service has no
endpoints. With replicaCount: 1 there's still an overlap
during which the new pod starts and becomes ready, but no gap.
To opt in to blue/green:
app.workspaces.accessModes: [ReadWriteMany]
app.workspaces.storageClass: "<rwx-class>" # NFS / EFS / CephFS / Longhorn-RWX / ...
app.replicaCount: 2
app.preStopDelaySeconds: 15 # already added in lobu#775; opt-in
# makes drain happen before SIGTERM
Concurrent gateway pods on RWX are safe: /app/workspaces paths are
keyed by (agent, run) tuples and the runs queue uses
FOR UPDATE SKIP LOCKED for claims, so no two pods own the same run.
`helm lint charts/lobu` and template-rendering tests for the three
matrix combinations (default RWO, RWX, workspaces disabled) all
produce the expected strategy and lifecycle blocks.
Pi review on #776 found three real correctness issues that would break under multi-replica or even brief RollingUpdate overlap: * **SseManager is pod-local** (gateway/services/sse-manager.ts). If pod A holds the SSE stream and pod B claims the job, broadcast goes to no-one — client sees the request hang. * **AskUser question routing is pod-local** (gateway/connections/interaction-bridge.ts:193-214). Question posted by pod A; click webhook lands on pod B; pod B's local `claimQuestion(id)` returns undefined and the click is dropped. * **Telegram polling mode is incompatible with multi-replica** (gateway/connections/chat-instance-manager.ts:610). Every replica starts its own long-poller on the same bot. RWX storage is necessary but NOT sufficient. The previous version of this PR auto-switched to RollingUpdate when it detected RWX accessModes, which would silently introduce these bugs for any operator who configured shared storage. Fix: gate the strategy switch behind an explicit `app.allowMultiReplica: false` (default off) flag. Setting it true documents that the operator has: 1. RWX workspaces storage 2. No Telegram polling-mode connections 3. Acknowledged the SSE / AskUser caveats (silently dropped responses during the overlap window). Without those preconditions, Recreate is the safe default and stays that way for every existing deploy. Operators who flip the flag get RollingUpdate with maxSurge:1, maxUnavailable:0. values.yaml documents each prerequisite inline so operators reading the comments before flipping the flag see the full picture. Also bumps web submodule to current main (drift fix).
c113bd4 to
da47ebe
Compare
Why
Close the last structural gap from the post-incident audit (#775 review): under `strategy: Recreate` (the workspaces-RWO default), every rollout has a ~30s "no available server" window between old-pod-terminated and new-pod-ready. Cloudflare returns the same page the 2026-05-16 outage produced, just briefly.
The actual constraint: the workspaces PVC is RWO so two pods can't mount it simultaneously, so Recreate is mandatory there. But operators using RWX storage (NFS, EFS, CephFS, Longhorn-RWX, Portworx, …) can run multiple replicas safely — the chart just wasn't auto-detecting that.
What
The chart now resolves `strategy` as:
Operator opt-in
```yaml
app:
replicaCount: 2
preStopDelaySeconds: 15 # already added in lobu#775
workspaces:
accessModes: [ReadWriteMany]
storageClass: ""
```
That's the whole change on the ops side.
Safety: concurrent pods on RWX
`/app/workspaces` paths are keyed by (agent, run) tuples — checked the prod volume contents directly:
```
/app/workspaces/marketing/marketing_watcher_218_run_141460
/app/workspaces/marketing/marketing_web-66ebf172_17fefa2b-0ff
/app/workspaces/marketing/marketing_web-ca015184_agent-panel
```
Each watcher run / agent panel session has a unique directory. The runs queue uses `FOR UPDATE SKIP LOCKED` for claims so no two pods own the same run. Concurrent gateway pods on RWX are structurally safe.
Validation
Not in this PR
Summary by CodeRabbit
Refactor
Documentation
Chores