infra: add holodeck provisioning and deploy scripts#159
Conversation
Add create-app-ct.sh for automated LXC provisioning and deploy.sh for zero-downtime rolling deploys (hot) and migration deploys (maintenance). Update deployment plan to reflect Tasks 1-6 completion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Fix sudo in deploy.sh (git/pip must run as awareness user, not root), fix nameserver in create-ct200.sh (.1 → .10), add AGPL preamble to create-ct200.sh, and mark tasks 7-8 complete in deployment plan (tunnel cutover, resource pool, boot order, vzdump backups verified). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update maintenance guide to reference deploy script instead of manual steps. Update topology diagram and design spec to reflect HAProxy + app pool architecture (CT 203/210/211 replacing CT 201). Add stick store-response to HAProxy config for proper MCP session affinity. Add password prompt to provisioning script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Adding QA Active — beginning review of infra scripts and doc changes. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 1
Findings
1. [Substantive] maintenance_deploy — sudo drops environment variables (deploy.sh:133)
ssh "root@${first_ip}" 'cd /opt/mcp-awareness && set -a && source /etc/awareness/env && set +a && sudo -u awareness venv/bin/mcp-awareness-migrate upgrade head'set -a && source /etc/awareness/env loads and exports variables in the root shell, but sudo -u awareness resets the environment by default. AWARENESS_DATABASE_URL and other vars from the env file won't reach mcp-awareness-migrate. The migration will fail.
Fix options:
sudo -E -u awarenessto preserve environmentsudo -u awareness env $(cat /etc/awareness/env | xargs) venv/bin/mcp-awareness-migrate upgrade head- Source the env within a
sudo -u awareness bash -c '...'subshell
2. [Observation] maintenance_deploy double-updates the first node
Step 2 runs git pull + pip install on the first node for migration. Step 3 loops over ALL nodes and calls update_node which runs git pull + pip install + systemctl restart again on the same node. The second pull/install is a no-op but wasteful. Consider either skipping the first node in step 3's update (but still restart it), or restructuring so step 2 only runs the migration after step 3's update.
3. [Observation] PR summary is stale
PR body says "mark Tasks 1–6 as complete" — this reflects only commit 1. The third commit updates tasks 7–12 (except Task 11 decommission). Not a blocker, but worth updating the PR description to reflect the full scope.
4. [Nit] create-ct200.sh missing usage comment
create-app-ct.sh has a usage header (# Usage: create-app-ct.sh <ct-id> <ip-suffix> <hostname>), but create-ct200.sh has only a minimal header comment. Since it's hardcoded (no args), this is fine, but adding "# No arguments — creates CT 200 with fixed parameters" would make it self-documenting.
QA Steps
| # | Step | Result |
|---|---|---|
| 1 | bash -n scripts/holodeck/deploy.sh — syntax check |
✅ Pass |
| 2 | bash -n scripts/holodeck/create-app-ct.sh — syntax check |
✅ Pass |
| 3 | Run hot deploy end-to-end | ⏭️ Requires infra access — skipped |
| 4 | Verify service healthy after deploy | ⏭️ Requires infra access — skipped |
| 5 | Verify plan checkboxes reflect actual state | ✅ 49 checked, 4 unchecked (Task 11 decommission + format note) — correct |
Additional checks
- All 3 scripts: executable (+x), proper shebangs (
#!/usr/bin/env bash),set -euo pipefail✅ - AGPL license preamble on all scripts ✅ (matches project convention — full preamble, not SPDX one-liner)
- CI: lint ✅, typecheck ✅, test 3.10/3.11/3.12 ✅, codecov/patch ✅
- Topology diagram updated correctly ✅
- Design spec reflects new architecture ✅
- Maintenance guide references deploy script instead of manual steps ✅
Verdict
Finding #1 is a blocker — the maintenance deploy migration will fail because sudo drops environment variables. Findings #2–4 are non-blocking.
|
Applying QA Failed — finding #1 (sudo drops env vars in maintenance deploy migration) is a blocker. See review for details and fix options. |
- Fix maintenance deploy: sudo drops env vars, use sudo -u awareness bash -c with sourced env to preserve AWARENESS_DATABASE_URL (#1 blocker) - Fix double-update of first node in maintenance deploy: update_node runs once, remaining nodes loop skips first entry (#2) - Add usage comment to create-ct200.sh (#4) - Update deployment plan to match fixed script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev Note — Requesting Round 3 QAQA Round 2 reviewed the fix commit (35b0f1c) but may not have fully reviewed commit 474ed1f which includes significant doc changes: 474ed1f — infra: update docs and HAProxy config for zero-downtime deployment
Please review these doc changes in addition to confirming the Round 1 fixes. |
|
Adding QA Active — Round 3 thorough re-review, checkbox updates, and feedback check. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 3 (thorough re-review)
Full file-level review of all 6 changed files across all 4 commits. Confirming Round 1 fixes and checking for anything missed.
Round 1 findings — verified fixed
| # | Finding | Status |
|---|---|---|
| 1 | sudo drops env vars in maintenance migration |
✅ Fixed — sudo -u awareness bash -c "set -a && source ... && ..." (deploy.sh:133) |
| 2 | Double-update of first node | ✅ Fixed — update_node in step 2, ${APP_NODES[@]:1} in step 3, added wait_healthy |
| 3 | Stale PR summary | ✅ Fixed — PR body updated |
| 4 | Missing usage comment on create-ct200.sh |
✅ Fixed — comment added |
New findings
5. [Substantive] create-ct200.sh:19 — comment says "1GB RAM" but script allocates 2GB
# No arguments — creates CT 200 with fixed parameters (IP .100, 2 cores, 1GB RAM).
Actual: --memory 2048 (2GB). The design spec correctly says "2 CPU cores, 2GB RAM, 20GB disk". The comment introduced in the fix commit is wrong.
6. [Observation] deploy.sh:19 — "Run from holodeck host" is misleading
The script uses ssh root@IP to connect to nodes, and the maintenance guide says to run it from the workstation with SSH aliases. The script works from either location, but "Run from holodeck host" may confuse operators. Consider: "Run from any host with SSH access to app nodes and HAProxy."
7. [Observation] create-app-ct.sh:117 — stale CT 201 reference
Next steps output says (paste contents from CT 201 or KeePass). CT 201 is decommissioned. Should reference an existing node or just KeePass.
8. [Observation] Design spec — historical sections still reference CT 201
The "Decisions made" table (line 21: "3 LXCs"), Data Migration (line 246: "Start CT 201"), and Implementation Order (lines 311-312) still reference the pre-HA architecture. These sections describe completed historical steps so they're not wrong per se, but could confuse future readers. CT 201 is properly marked decommissioned in the topology (line 73). Non-blocking.
Cross-reference verification
| Check | Result |
|---|---|
| IPs consistent across all 6 files | ✅ |
| Hostnames match (awareness-lb, app-a, app-b, pg, tunnel) | ✅ |
| Plan embedded code blocks match actual scripts | ✅ |
| Maintenance doc architecture table matches design spec | ✅ |
| Maintenance doc cross-references correct spec file | ✅ (2026-04-02-zero-downtime-deployment-design.md exists) |
| AGPL preamble on all 3 scripts | ✅ |
All scripts executable (+x), set -euo pipefail |
✅ |
bash -n syntax check on all 3 scripts |
✅ |
| CI: lint, typecheck, test 3.10/3.11/3.12, codecov | ✅ All pass |
PR checkbox status
| # | Step | Result |
|---|---|---|
| 1 | deploy.sh syntax check | ✅ Pass |
| 2 | create-app-ct.sh syntax check | ✅ Pass |
| 3 | Hot deploy end-to-end | ⏭️ Requires infra |
| 4 | Service healthy after deploy | ⏭️ Requires infra |
| 5 | Plan checkboxes reflect actual state | ✅ 49 checked, 3 unchecked (Task 11 decommission) |
Verdict
Finding #5 is substantive — the comment is factually wrong (1GB vs 2GB) and could mislead an operator reprovisioning CT 200. Findings #6–8 are observations, non-blocking.
No blockers this round. Up to you whether to fix #5 before signoff or treat it as a follow-up.
|
Applying QA Failed — findings #5–7 need fixes before signoff. All are trivial: #5 |
- create-ct200.sh: fix RAM comment (1GB → 2GB) to match --memory 2048 - deploy.sh: clarify runs from any SSH-capable host, not just holodeck - create-app-ct.sh: replace CT 201 reference with generic "existing app node" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev Response — Round 3 FindingsAll three findings fixed in 2f1f8da: #5 CI running. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 4
All Round 3 findings verified fixed in 2f1f8da:
| # | Finding | Fix |
|---|---|---|
| 5 | create-ct200.sh comment: "1GB" → "2GB" |
✅ |
| 6 | deploy.sh comment: "holodeck host" → "any host with SSH access" |
✅ |
| 7 | create-app-ct.sh output: "CT 201" → "an existing app node" |
✅ |
Zero new findings. CI green. All 8 findings across 4 rounds resolved.
Verdict: Pass — ready for signoff.
|
Applying Ready for QA Signoff — Round 4 complete. All 8 findings across 4 rounds resolved, zero outstanding. CI green. |
Summary
create-app-ct.sh,create-ct200.sh)deploy.sh) with hot and maintenance modesstick store-responsefor MCP session affinityQA
Prerequisites
bash -n scripts/holodeck/deploy.sh— syntax checkbash -n scripts/holodeck/create-app-ct.sh— syntax checkManual tests
bash -n scripts/holodeck/deploy.shbash -n scripts/holodeck/create-app-ct.shbash -n scripts/holodeck/create-ct200.shsudo -u awareness bash -cpreserves env vars🤖 Generated with Claude Code