Skip to content

infra: add holodeck provisioning and deploy scripts#159

Merged
cmeans merged 5 commits into
mainfrom
infra/zero-downtime-scripts
Apr 7, 2026
Merged

infra: add holodeck provisioning and deploy scripts#159
cmeans merged 5 commits into
mainfrom
infra/zero-downtime-scripts

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

@cmeans-claude-dev cmeans-claude-dev Bot commented Apr 5, 2026

Summary

  • Add holodeck LXC provisioning scripts (create-app-ct.sh, create-ct200.sh)
  • Add zero-downtime deploy script (deploy.sh) with hot and maintenance modes
  • Update deployment plan: mark Tasks 1–10 complete, add HAProxy stick store-response for MCP session affinity
  • Update maintenance guide to reference deploy script instead of manual steps
  • Update topology diagram and design spec for HAProxy + app pool architecture (CT 203/210/211)

QA

Prerequisites

  • bash -n scripts/holodeck/deploy.sh — syntax check
  • bash -n scripts/holodeck/create-app-ct.sh — syntax check

Manual tests

    • Verify deploy.sh syntax: bash -n scripts/holodeck/deploy.sh
    • Verify create-app-ct.sh syntax: bash -n scripts/holodeck/create-app-ct.sh
    • Verify create-ct200.sh syntax: bash -n scripts/holodeck/create-ct200.sh
    • Review maintenance deploy sudo -u awareness bash -c preserves env vars
    • Review maintenance deploy only updates remaining nodes (not first node twice)
    • Verify plan checkboxes match actual completion state
    • Verify topology diagram reflects HAProxy + app pool architecture
    • Verify maintenance guide references deploy script

🤖 Generated with Claude Code

Add create-app-ct.sh for automated LXC provisioning and deploy.sh
for zero-downtime rolling deploys (hot) and migration deploys
(maintenance). Update deployment plan to reflect Tasks 1-6 completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions github-actions Bot added Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 5, 2026
Fix sudo in deploy.sh (git/pip must run as awareness user, not root),
fix nameserver in create-ct200.sh (.1 → .10), add AGPL preamble to
create-ct200.sh, and mark tasks 7-8 complete in deployment plan
(tunnel cutover, resource pool, boot order, vzdump backups verified).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 7, 2026
Update maintenance guide to reference deploy script instead of manual steps.
Update topology diagram and design spec to reflect HAProxy + app pool
architecture (CT 203/210/211 replacing CT 201). Add stick store-response
to HAProxy config for proper MCP session affinity. Add password prompt
to provisioning script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 7, 2026
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA and removed Ready for QA Dev work complete — QA can begin review labels Apr 7, 2026
@cmeans-claude-dev cmeans-claude-dev Bot removed the Dev Active Developer is actively working on this PR; QA should not start label Apr 7, 2026
@github-actions github-actions Bot added Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 7, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 7, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 7, 2026

Adding QA Active — beginning review of infra scripts and doc changes.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 1

Findings

1. [Substantive] maintenance_deploysudo drops environment variables (deploy.sh:133)

ssh "root@${first_ip}" 'cd /opt/mcp-awareness && set -a && source /etc/awareness/env && set +a && sudo -u awareness venv/bin/mcp-awareness-migrate upgrade head'

set -a && source /etc/awareness/env loads and exports variables in the root shell, but sudo -u awareness resets the environment by default. AWARENESS_DATABASE_URL and other vars from the env file won't reach mcp-awareness-migrate. The migration will fail.

Fix options:

  • sudo -E -u awareness to preserve environment
  • sudo -u awareness env $(cat /etc/awareness/env | xargs) venv/bin/mcp-awareness-migrate upgrade head
  • Source the env within a sudo -u awareness bash -c '...' subshell

2. [Observation] maintenance_deploy double-updates the first node

Step 2 runs git pull + pip install on the first node for migration. Step 3 loops over ALL nodes and calls update_node which runs git pull + pip install + systemctl restart again on the same node. The second pull/install is a no-op but wasteful. Consider either skipping the first node in step 3's update (but still restart it), or restructuring so step 2 only runs the migration after step 3's update.

3. [Observation] PR summary is stale

PR body says "mark Tasks 1–6 as complete" — this reflects only commit 1. The third commit updates tasks 7–12 (except Task 11 decommission). Not a blocker, but worth updating the PR description to reflect the full scope.

4. [Nit] create-ct200.sh missing usage comment

create-app-ct.sh has a usage header (# Usage: create-app-ct.sh <ct-id> <ip-suffix> <hostname>), but create-ct200.sh has only a minimal header comment. Since it's hardcoded (no args), this is fine, but adding "# No arguments — creates CT 200 with fixed parameters" would make it self-documenting.

QA Steps

# Step Result
1 bash -n scripts/holodeck/deploy.sh — syntax check ✅ Pass
2 bash -n scripts/holodeck/create-app-ct.sh — syntax check ✅ Pass
3 Run hot deploy end-to-end ⏭️ Requires infra access — skipped
4 Verify service healthy after deploy ⏭️ Requires infra access — skipped
5 Verify plan checkboxes reflect actual state ✅ 49 checked, 4 unchecked (Task 11 decommission + format note) — correct

Additional checks

  • All 3 scripts: executable (+x), proper shebangs (#!/usr/bin/env bash), set -euo pipefail
  • AGPL license preamble on all scripts ✅ (matches project convention — full preamble, not SPDX one-liner)
  • CI: lint ✅, typecheck ✅, test 3.10/3.11/3.12 ✅, codecov/patch ✅
  • Topology diagram updated correctly ✅
  • Design spec reflects new architecture ✅
  • Maintenance guide references deploy script instead of manual steps ✅

Verdict

Finding #1 is a blocker — the maintenance deploy migration will fail because sudo drops environment variables. Findings #2–4 are non-blocking.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 7, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 7, 2026

Applying QA Failed — finding #1 (sudo drops env vars in maintenance deploy migration) is a blocker. See review for details and fix options.

- Fix maintenance deploy: sudo drops env vars, use sudo -u awareness
  bash -c with sourced env to preserve AWARENESS_DATABASE_URL (#1 blocker)
- Fix double-update of first node in maintenance deploy: update_node
  runs once, remaining nodes loop skips first entry (#2)
- Add usage comment to create-ct200.sh (#4)
- Update deployment plan to match fixed script

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 7, 2026
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA and removed QA Failed QA found issues — needs dev attention labels Apr 7, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 7, 2026
@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Dev Note — Requesting Round 3 QA

QA Round 2 reviewed the fix commit (35b0f1c) but may not have fully reviewed commit 474ed1f which includes significant doc changes:

474ed1f — infra: update docs and HAProxy config for zero-downtime deployment

  • docs/maintenance/holodeck/update-mcp-awareness.md — complete rewrite: replaces manual CT 201 steps with deploy script reference, new architecture table
  • docs/superpowers/specs/2026-04-01-holodeck-deployment-design.md — topology diagram updated (HAProxy + app pool replacing CT 201), CT 201 section replaced with CT 203 + CT 210/211, tunnel config updated to point at HAProxy, load profiling and network summary tables updated
  • docs/superpowers/plans/2026-04-02-zero-downtime-deployment.md — HAProxy config adds stick store-response line, Task 9/10/12 checkboxes marked done, maintenance deploy code block updated to match fixed script
  • scripts/holodeck/create-app-ct.sh — added password prompt message

Please review these doc changes in addition to confirming the Round 1 fixes.

@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 7, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 7, 2026

Adding QA Active — Round 3 thorough re-review, checkbox updates, and feedback check.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 3 (thorough re-review)

Full file-level review of all 6 changed files across all 4 commits. Confirming Round 1 fixes and checking for anything missed.

Round 1 findings — verified fixed

# Finding Status
1 sudo drops env vars in maintenance migration ✅ Fixed — sudo -u awareness bash -c "set -a && source ... && ..." (deploy.sh:133)
2 Double-update of first node ✅ Fixed — update_node in step 2, ${APP_NODES[@]:1} in step 3, added wait_healthy
3 Stale PR summary ✅ Fixed — PR body updated
4 Missing usage comment on create-ct200.sh ✅ Fixed — comment added

New findings

5. [Substantive] create-ct200.sh:19 — comment says "1GB RAM" but script allocates 2GB

# No arguments — creates CT 200 with fixed parameters (IP .100, 2 cores, 1GB RAM).

Actual: --memory 2048 (2GB). The design spec correctly says "2 CPU cores, 2GB RAM, 20GB disk". The comment introduced in the fix commit is wrong.

6. [Observation] deploy.sh:19 — "Run from holodeck host" is misleading

The script uses ssh root@IP to connect to nodes, and the maintenance guide says to run it from the workstation with SSH aliases. The script works from either location, but "Run from holodeck host" may confuse operators. Consider: "Run from any host with SSH access to app nodes and HAProxy."

7. [Observation] create-app-ct.sh:117 — stale CT 201 reference

Next steps output says (paste contents from CT 201 or KeePass). CT 201 is decommissioned. Should reference an existing node or just KeePass.

8. [Observation] Design spec — historical sections still reference CT 201

The "Decisions made" table (line 21: "3 LXCs"), Data Migration (line 246: "Start CT 201"), and Implementation Order (lines 311-312) still reference the pre-HA architecture. These sections describe completed historical steps so they're not wrong per se, but could confuse future readers. CT 201 is properly marked decommissioned in the topology (line 73). Non-blocking.

Cross-reference verification

Check Result
IPs consistent across all 6 files
Hostnames match (awareness-lb, app-a, app-b, pg, tunnel)
Plan embedded code blocks match actual scripts
Maintenance doc architecture table matches design spec
Maintenance doc cross-references correct spec file ✅ (2026-04-02-zero-downtime-deployment-design.md exists)
AGPL preamble on all 3 scripts
All scripts executable (+x), set -euo pipefail
bash -n syntax check on all 3 scripts
CI: lint, typecheck, test 3.10/3.11/3.12, codecov ✅ All pass

PR checkbox status

# Step Result
1 deploy.sh syntax check ✅ Pass
2 create-app-ct.sh syntax check ✅ Pass
3 Hot deploy end-to-end ⏭️ Requires infra
4 Service healthy after deploy ⏭️ Requires infra
5 Plan checkboxes reflect actual state ✅ 49 checked, 3 unchecked (Task 11 decommission)

Verdict

Finding #5 is substantive — the comment is factually wrong (1GB vs 2GB) and could mislead an operator reprovisioning CT 200. Findings #6–8 are observations, non-blocking.

No blockers this round. Up to you whether to fix #5 before signoff or treat it as a follow-up.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 7, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 7, 2026

Applying Ready for QA Signoff — thorough Round 3 complete. All Round 1 blockers verified fixed. One new substantive finding (#5: create-ct200.sh comment says 1GB but script allocates 2GB) — non-blocking, script behavior is correct. Three observations (#6-8). CI green.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 7, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 7, 2026

Applying QA Failed — findings #5–7 need fixes before signoff. All are trivial:

#5 create-ct200.sh:19 — comment says "1GB RAM", should be "2GB RAM" (--memory 2048)
#6 deploy.sh:19 — "Run from holodeck host" → "Run from any host with SSH access to app nodes and HAProxy"
#7 create-app-ct.sh:117 — next steps reference "CT 201 or KeePass" → CT 201 is decommissioned, drop the reference

@cmeans-claude-dev cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 7, 2026
- create-ct200.sh: fix RAM comment (1GB → 2GB) to match --memory 2048
- deploy.sh: clarify runs from any SSH-capable host, not just holodeck
- create-app-ct.sh: replace CT 201 reference with generic "existing app node"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Dev Response — Round 3 Findings

All three findings fixed in 2f1f8da:

#5 create-ct200.sh:19 — Fixed: "1GB RAM" → "2GB RAM" to match --memory 2048
#6 deploy.sh:19 — Fixed: "Run from holodeck host" → "Run from any host with SSH access to app nodes and HAProxy"
#7 create-app-ct.sh:117 — Fixed: "CT 201 or KeePass" → "an existing app node or KeePass"

CI running.

@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA and removed QA Failed QA found issues — needs dev attention labels Apr 7, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 7, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 7, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 7, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 4

All Round 3 findings verified fixed in 2f1f8da:

# Finding Fix
5 create-ct200.sh comment: "1GB" → "2GB"
6 deploy.sh comment: "holodeck host" → "any host with SSH access"
7 create-app-ct.sh output: "CT 201" → "an existing app node"

Zero new findings. CI green. All 8 findings across 4 rounds resolved.

Verdict: Pass — ready for signoff.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 7, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 7, 2026

Applying Ready for QA Signoff — Round 4 complete. All 8 findings across 4 rounds resolved, zero outstanding. CI green.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 7, 2026
@cmeans cmeans merged commit 52a9c20 into main Apr 7, 2026
43 checks passed
@cmeans cmeans deleted the infra/zero-downtime-scripts branch April 7, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant