diff --git a/docs/maintenance/holodeck/update-mcp-awareness.md b/docs/maintenance/holodeck/update-mcp-awareness.md index 51c9e628..94ad491d 100644 --- a/docs/maintenance/holodeck/update-mcp-awareness.md +++ b/docs/maintenance/holodeck/update-mcp-awareness.md @@ -1,53 +1,65 @@ # Update MCP Awareness on Holodeck -Manual deployment steps for updating the mcp-awareness service on the holodeck Proxmox host (CT 201 — `awareness-app`). +The mcp-awareness service runs on two app nodes (CT 210, CT 211) behind an HAProxy load balancer (CT 203). Updates are deployed using the zero-downtime deploy script. ## Prerequisites -- SSH access to holodeck (`192.168.200.70`) -- Root access on CT 201 (`awareness-app`, `192.168.200.101`) +- SSH access to holodeck and all CTs (via `~/.ssh/config` aliases) +- The deploy script at `scripts/holodeck/deploy.sh` -## Steps +## Deploying Updates -### 1. SSH into the container +### Code-only updates (zero-downtime) ```bash -ssh root@192.168.200.101 +scripts/holodeck/deploy.sh hot ``` -### 2. Pull latest code +This performs a rolling update: drains each node from HAProxy, pulls latest code, installs, restarts the service, waits for health check, then re-enables. One node is always serving traffic. -```bash -git config --global --add safe.directory /opt/mcp-awareness -cd /opt/mcp-awareness -git pull origin main -``` +**Note:** Active MCP sessions on the restarting node will get "Session terminated" errors. Clients need to reconnect. See issues #161–#163 for planned improvements. -### 3. Install updated package +### Updates with migrations or config changes ```bash -/opt/mcp-awareness/venv/bin/pip install -e . +scripts/holodeck/deploy.sh maintenance ``` -### 4. Add any new environment variables +This drains all nodes, runs Alembic migrations on the first node, then updates and restarts all nodes. There is a brief service interruption during migration. + +### Adding new environment variables -If the release includes new env vars, append them to the env file: +If a release requires new env vars, update the env file on both app nodes before deploying: ```bash -nano /etc/awareness/env +ssh awareness-app-a 'nano /etc/awareness/env' +ssh awareness-app-b 'nano /etc/awareness/env' ``` -### 5. Restart the service +## Verification + +After deploy, verify via HAProxy: ```bash -systemctl restart mcp-awareness +curl -s http://192.168.200.103:8420/health | python3 -m json.tool ``` -### 6. Verify +Or check both backends directly: ```bash -curl -s localhost:8420/health | python3 -m json.tool +curl -s http://192.168.200.110:8420/health | python3 -m json.tool +curl -s http://192.168.200.111:8420/health | python3 -m json.tool ``` -Confirm `status: ok` and expected uptime (should be a few seconds). +## Architecture + +See `docs/superpowers/specs/2026-04-02-zero-downtime-deployment-design.md` for the full design spec. + +| Component | Host | IP | +|-----------|------|----| +| HAProxy (load balancer) | CT 203 `awareness-lb` | 192.168.200.103 | +| App node A | CT 210 `awareness-app-a` | 192.168.200.110 | +| App node B | CT 211 `awareness-app-b` | 192.168.200.111 | +| Postgres | CT 200 `awareness-pg` | 192.168.200.100 | +| Cloudflare tunnel | CT 202 `awareness-tunnel` | 192.168.200.102 | diff --git a/docs/superpowers/plans/2026-04-02-zero-downtime-deployment.md b/docs/superpowers/plans/2026-04-02-zero-downtime-deployment.md index 54d20e96..61de0a2f 100644 --- a/docs/superpowers/plans/2026-04-02-zero-downtime-deployment.md +++ b/docs/superpowers/plans/2026-04-02-zero-downtime-deployment.md @@ -71,7 +71,7 @@ each CT. The provisioning script (Task 2) handles this for new CTs. **Where:** `[holodeck]` -- [ ] **Step 1: Identify the Debian 12 template** +- [x] **Step 1: Identify the Debian 12 template** ```bash pveam list local | grep debian-12 @@ -79,39 +79,39 @@ pveam list local | grep debian-12 Expected: Shows a `debian-12-standard_*.tar.zst` template. Note the exact filename. -- [ ] **Step 2: Create the LXC** +- [x] **Step 2: Create the LXC** ```bash -pct create 203 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst --hostname awareness-lb --cores 1 --memory 256 --swap 128 --rootfs local-lvm:4 --net0 name=eth0,bridge=vmbr0,ip=192.168.200.103/24,gw=192.168.200.1 --nameserver 192.168.200.1 --unprivileged 1 --features nesting=0 --start 0 --password +pct create 203 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst --hostname awareness-lb --cores 1 --memory 256 --swap 128 --rootfs local-lvm:4 --net0 name=eth0,bridge=vmbr0,ip=192.168.200.103/24,gw=192.168.200.1 --nameserver 192.168.200.10 --unprivileged 1 --features nesting=0 --start 0 --password ``` **[USER]** Set root password, store in KeePass. Adjust the template filename if it differs from step 1. -- [ ] **Step 3: Start and enter CT 203** +- [x] **Step 3: Start and enter CT 203** ```bash pct start 203 pct enter 203 ``` -- [ ] **Step 4: Update base system** +- [x] **Step 4: Update base system** ```bash apt update && apt upgrade -y ``` -- [ ] **Step 5: Install HAProxy and socat** +- [x] **Step 5: Install HAProxy, socat, and curl** ```bash -apt install -y haproxy socat +apt install -y haproxy socat curl haproxy -v ``` Expected: HAProxy version 2.6+ (Debian 12 ships 2.6.x). -- [ ] **Step 6: Install openssh-server and push SSH key** +- [x] **Step 6: Install openssh-server and push SSH key** ```bash apt install -y openssh-server @@ -131,7 +131,7 @@ ssh root@192.168.200.103 hostname Expected: `awareness-lb` -- [ ] **Step 7: Configure HAProxy** +- [x] **Step 7: Configure HAProxy** Create `/etc/haproxy/haproxy.cfg`: @@ -167,6 +167,7 @@ backend awareness-backend http-check expect status 200 stick-table type string len 64 size 10k expire 30m stick on req.hdr(mcp-session-id) if { req.hdr(mcp-session-id) -m found } + stick store-response res.hdr(mcp-session-id) if { res.hdr(mcp-session-id) -m found } server app-a 192.168.200.110:8420 check inter 5s fall 3 rise 2 server app-b 192.168.200.111:8420 check inter 5s fall 3 rise 2 @@ -181,14 +182,14 @@ EOF **[USER]** Change the stats password (`admin:haproxy-stats`) to something from KeePass. -- [ ] **Step 8: Create runtime socket directory** +- [x] **Step 8: Create runtime socket directory** ```bash mkdir -p /var/run/haproxy chown haproxy:haproxy /var/run/haproxy ``` -- [ ] **Step 9: Validate config and restart** +- [x] **Step 9: Validate config and restart** ```bash haproxy -c -f /etc/haproxy/haproxy.cfg @@ -199,7 +200,7 @@ systemctl status haproxy Expected: Config valid, service active. Backends will show as DOWN until app LXCs are provisioned. -- [ ] **Step 10: Verify stats page** +- [x] **Step 10: Verify stats page** From workstation: ```bash @@ -216,7 +217,7 @@ Expected: Non-zero (stats page is serving, shows backend names). This script automates creating new app LXCs with all operational fixes applied. -- [ ] **Step 1: Create the provisioning script** +- [x] **Step 1: Create the provisioning script** Create `scripts/holodeck/create-app-ct.sh`: @@ -273,7 +274,7 @@ pct create "$CT_ID" "$TEMPLATE" \ --swap 256 \ --rootfs local-lvm:8 \ --net0 "name=eth0,bridge=vmbr0,ip=${IP}/24,gw=192.168.200.1" \ - --nameserver 192.168.200.1 \ + --nameserver 192.168.200.10 \ --unprivileged 1 \ --features nesting=0 \ --start 1 \ @@ -283,7 +284,7 @@ echo "Waiting for container to start..." sleep 5 echo "Installing base packages..." -pct exec "$CT_ID" -- bash -c "apt update -qq && apt install -y -qq openssh-server python3 python3-pip python3-venv python3-dev git build-essential libpq-dev > /dev/null 2>&1" +pct exec "$CT_ID" -- bash -c "apt update -qq && apt install -y -qq openssh-server sudo python3 python3-pip python3-venv python3-dev git build-essential libpq-dev curl > /dev/null 2>&1" echo "Configuring SSH..." pct exec "$CT_ID" -- bash -c "mkdir -p /root/.ssh && chmod 700 /root/.ssh" @@ -341,13 +342,13 @@ echo " 2. Start service: pct exec ${CT_ID} -- systemctl start mcp-awareness" echo " 3. Verify health: curl -s http://${IP}:8420/health | python3 -m json.tool" ``` -- [ ] **Step 2: Make executable** +- [x] **Step 2: Make executable** ```bash chmod +x scripts/holodeck/create-app-ct.sh ``` -- [ ] **Step 3: Commit** +- [x] **Step 3: Commit** ```bash git add scripts/holodeck/create-app-ct.sh @@ -360,21 +361,21 @@ git commit -m "infra: add app LXC provisioning script for holodeck" **Where:** `[holodeck]` -- [ ] **Step 1: Copy SSH key to holodeck** +- [x] **Step 1: Copy SSH key to holodeck** From workstation: ```bash scp ~/.ssh/id_ed25519.pub root@192.168.200.70:/tmp/awareness-ssh-key.pub ``` -- [ ] **Step 2: Copy provisioning script to holodeck** +- [x] **Step 2: Copy provisioning script to holodeck** From workstation: ```bash scp scripts/holodeck/create-app-ct.sh root@192.168.200.70:/tmp/create-app-ct.sh ``` -- [ ] **Step 3: Run provisioning script** +- [x] **Step 3: Run provisioning script** From holodeck: ```bash @@ -385,14 +386,14 @@ bash /tmp/create-app-ct.sh 210 110 awareness-app-a Expected: Script completes with "CT 210 (awareness-app-a) provisioned at 192.168.200.110." -- [ ] **Step 4: Copy env file from CT 201** +- [x] **Step 4: Copy env file from CT 201** From holodeck: ```bash pct exec 201 -- cat /etc/awareness/env | pct exec 210 -- bash -c 'cat > /etc/awareness/env && chmod 600 /etc/awareness/env' ``` -- [ ] **Step 5: Start the service** +- [x] **Step 5: Start the service** ```bash pct exec 210 -- systemctl start mcp-awareness @@ -401,7 +402,7 @@ pct exec 210 -- systemctl status mcp-awareness Expected: Active (running). -- [ ] **Step 6: Verify health** +- [x] **Step 6: Verify health** From holodeck: ```bash @@ -410,7 +411,7 @@ curl -s http://192.168.200.110:8420/health | python3 -m json.tool Expected: `{"status": "ok", ...}` -- [ ] **Step 7: Verify SSH from workstation** +- [x] **Step 7: Verify SSH from workstation** From workstation: ```bash @@ -419,7 +420,7 @@ ssh root@192.168.200.110 hostname Expected: `awareness-app-a` -- [ ] **Step 8: Verify CLI tools** +- [x] **Step 8: Verify CLI tools** ```bash ssh root@192.168.200.110 mcp-awareness-user list @@ -435,7 +436,7 @@ Expected: Shows user list (may fail if env isn't sourced — the CLI tools need Repeat Task 3 with different parameters. -- [ ] **Step 1: Run provisioning script** +- [x] **Step 1: Run provisioning script** From holodeck: ```bash @@ -444,13 +445,13 @@ bash /tmp/create-app-ct.sh 211 111 awareness-app-b **[USER]** Set root password when prompted, store in KeePass. -- [ ] **Step 2: Copy env file from CT 201** +- [x] **Step 2: Copy env file from CT 201** ```bash pct exec 201 -- cat /etc/awareness/env | pct exec 211 -- bash -c 'cat > /etc/awareness/env && chmod 600 /etc/awareness/env' ``` -- [ ] **Step 3: Start and verify** +- [x] **Step 3: Start and verify** ```bash pct exec 211 -- systemctl start mcp-awareness @@ -459,7 +460,7 @@ curl -s http://192.168.200.111:8420/health | python3 -m json.tool Expected: `{"status": "ok", ...}` -- [ ] **Step 4: Verify SSH from workstation** +- [x] **Step 4: Verify SSH from workstation** ```bash ssh root@192.168.200.111 hostname @@ -475,7 +476,7 @@ Expected: `awareness-app-b` Both app nodes should now be visible and healthy in HAProxy. -- [ ] **Step 1: Check HAProxy stats** +- [x] **Step 1: Check HAProxy stats** ```bash curl -s -u admin:haproxy-stats http://192.168.200.103:8421/\;csv | grep -E "app-a|app-b" | cut -d, -f1,2,18 @@ -483,7 +484,7 @@ curl -s -u admin:haproxy-stats http://192.168.200.103:8421/\;csv | grep -E "app- Expected: Both `app-a` and `app-b` show status `UP`. -- [ ] **Step 2: Test traffic routing** +- [x] **Step 2: Test traffic routing** Send a request through HAProxy and verify it reaches an app node: @@ -493,7 +494,7 @@ curl -s http://192.168.200.103:8420/health | python3 -m json.tool Expected: `{"status": "ok", ...}` — response came from one of the app nodes via HAProxy. -- [ ] **Step 3: Test session stickiness** +- [x] **Step 3: Test session stickiness** Initialize an MCP session through HAProxy and verify subsequent requests go to the same backend: @@ -511,7 +512,7 @@ curl -s http://192.168.200.103:8420/mcp -X POST -H "Content-Type: application/js Expected: Returns data (the session was routed to the same backend that created it). -- [ ] **Step 4: Test connection draining** +- [x] **Step 4: Test connection draining** Set app-a to drain and verify new requests go to app-b: @@ -537,7 +538,7 @@ ssh root@192.168.200.103 'echo "set server awareness-backend/app-a state ready" **Where:** `[laptop]` -- [ ] **Step 1: Create the deploy script** +- [x] **Step 1: Create the deploy script** Create `scripts/holodeck/deploy.sh`: @@ -669,16 +670,17 @@ maintenance_deploy() { done echo "" - echo "Step 2: All nodes drained. Running migration..." + echo "Step 2: Updating first node and running migration..." local first_ip first_ip=$(node_ip "${APP_NODES[0]}") - ssh "root@${first_ip}" 'cd /opt/mcp-awareness && git pull origin main && venv/bin/pip install -e . -q' - ssh "root@${first_ip}" 'cd /opt/mcp-awareness && set -a && source /etc/awareness/env && set +a && venv/bin/mcp-awareness-migrate upgrade head' + update_node "$first_ip" + ssh "root@${first_ip}" 'cd /opt/mcp-awareness && sudo -u awareness bash -c "set -a && source /etc/awareness/env && set +a && /opt/mcp-awareness/venv/bin/mcp-awareness-migrate upgrade head"' echo " Migration complete on ${first_ip}" + wait_healthy "$first_ip" || echo " WARNING: ${first_ip} not healthy after migration" echo "" - echo "Step 3: Updating and restarting all nodes..." - for entry in "${APP_NODES[@]}"; do + echo "Step 3: Updating remaining nodes..." + for entry in "${APP_NODES[@]:1}"; do local ip ip=$(node_ip "$entry") update_node "$ip" @@ -714,13 +716,13 @@ case "$MODE" in esac ``` -- [ ] **Step 2: Make executable** +- [x] **Step 2: Make executable** ```bash chmod +x scripts/holodeck/deploy.sh ``` -- [ ] **Step 3: Commit** +- [x] **Step 3: Commit** ```bash git add scripts/holodeck/deploy.sh @@ -735,7 +737,7 @@ git commit -m "infra: add zero-downtime deploy script (hot + maintenance modes)" This is the cutover — traffic starts flowing through HAProxy. -- [ ] **Step 1: Verify CT 201 is still serving (fallback ready)** +- [x] **Step 1: Verify CT 201 is still serving (fallback ready)** ```bash curl -s http://192.168.200.101:8420/health | python3 -m json.tool @@ -743,7 +745,7 @@ curl -s http://192.168.200.101:8420/health | python3 -m json.tool Expected: `{"status": "ok", ...}` -- [ ] **Step 2: Update tunnel config** +- [x] **Step 2: Update tunnel config** SSH to CT 202 and update the cloudflared config: @@ -765,7 +767,7 @@ ingress: - service: http_status:404 ``` -- [ ] **Step 3: Restart cloudflared** +- [x] **Step 3: Restart cloudflared** ```bash systemctl restart cloudflared @@ -775,7 +777,7 @@ journalctl -u cloudflared -n 10 --no-pager Expected: Active, "Connection established", "Registered tunnel connection". -- [ ] **Step 4: Verify end-to-end** +- [x] **Step 4: Verify end-to-end** From workstation, test via the public URL: @@ -796,14 +798,14 @@ Expected: Health response (or auth challenge, depending on mount path config). **Where:** `[holodeck]` -- [ ] **Step 1: Create resource pool** +- [x] **Step 1: Create resource pool** ```bash pvesh create /pools --poolid awareness pvesh set /pools/awareness --vms 200,202,203,210,211 ``` -- [ ] **Step 2: Set boot order for new containers** +- [x] **Step 2: Set boot order for new containers** ```bash pct set 203 --onboot 1 --startup order=2,up=5 @@ -813,7 +815,7 @@ pct set 211 --onboot 1 --startup order=3,up=15 Boot order: CT 200 (Postgres, order=1) → CT 203 (HAProxy, order=2, 5s delay) → CT 210+211 (apps, order=3, 15s delay for Postgres) → CT 202 (tunnel, order=4). -- [ ] **Step 2b: Verify CT 202 boot order** +- [x] **Step 2b: Verify CT 202 boot order** CT 202 (tunnel) must boot after HAProxy (CT 203) and the app nodes, otherwise cloudflared starts before its upstream is available. @@ -827,27 +829,29 @@ If order is not set or is lower than 4, fix it: pct set 202 --onboot 1 --startup order=4,up=5 ``` -- [ ] **Step 3: Update snapshot script** - -Edit `/usr/local/bin/awareness-snapshots.sh` on holodeck: +- [x] **Step 3: Create backup script** -Change: -```bash -for ct in 200 201 202; do -``` +The plan originally called for `pct snapshot`, but local-lvm storage doesn't support +container snapshots. Created `/usr/local/bin/awareness-snapshots.sh` using `vzdump` +instead: -To: ```bash +#!/usr/bin/env bash +set -euo pipefail for ct in 200 202 203 210 211; do + echo "Backing up CT ${ct}..." + vzdump "$ct" --mode snapshot --compress zstd --storage local +done +echo "Done." ``` -- [ ] **Step 4: Verify snapshot script** +- [x] **Step 4: Verify backup script** ```bash bash /usr/local/bin/awareness-snapshots.sh ``` -Expected: Creates snapshots for all 5 containers. +Expected: Creates vzdump backups for all 5 containers. Verified — ~1.9GB total (600MB pg, 214MB tunnel, 199MB lb, 436MB app-a, 435MB app-b). --- @@ -855,7 +859,7 @@ Expected: Creates snapshots for all 5 containers. **Where:** `[laptop]` -- [ ] **Step 1: Update SSH config** +- [x] **Step 1: Update SSH config** Replace the existing `holodeck` entry and add the new CT aliases in `~/.ssh/config`. Use the config from the **Remote access** section in Conventions above — it uses @@ -866,7 +870,7 @@ Key changes: - `holodeck` IdentityFile changes from `id_ed25519_github` to `id_ed25519` - New entries for `awareness-lb`, `awareness-app-a`, `awareness-app-b` with ProxyJump -- [ ] **Step 2: Verify from workstation** +- [x] **Step 2: Verify from workstation** ```bash ssh holodeck hostname @@ -883,7 +887,7 @@ Expected: `holodeck`, `awareness-lb`, `awareness-app-a`, `awareness-app-b` **Where:** `[laptop]` -- [ ] **Step 1: Run a hot deploy** +- [x] **Step 1: Run a hot deploy** ```bash scripts/holodeck/deploy.sh hot @@ -910,7 +914,7 @@ Expected output: === Hot deploy complete === ``` -- [ ] **Step 2: Verify service is healthy after deploy** +- [x] **Step 2: Verify service is healthy after deploy** ```bash curl -s http://192.168.200.103:8420/health | python3 -m json.tool @@ -918,7 +922,7 @@ curl -s http://192.168.200.103:8420/health | python3 -m json.tool Expected: `{"status": "ok", ...}` -- [ ] **Step 3: Verify from Claude Desktop** +- [x] **Step 3: Verify from Claude Desktop** **[USER]** Call `get_briefing` from Claude Desktop. Should work without reconnecting — existing sessions should have survived (they were drained, not killed). @@ -960,7 +964,7 @@ Remove CT 201 from any snapshot scripts or resource pools if it was added. **Where:** `[laptop]` -- [ ] **Step 1: Update maintenance guide** +- [x] **Step 1: Update maintenance guide** Update `docs/maintenance/holodeck/update-mcp-awareness.md` to reference the deploy script instead of manual steps: @@ -984,11 +988,11 @@ scripts/holodeck/deploy.sh maintenance See `docs/superpowers/specs/2026-04-02-zero-downtime-deployment-design.md` for details. ``` -- [ ] **Step 2: Update deployment design spec topology diagram** +- [x] **Step 2: Update deployment design spec topology diagram** Update `docs/superpowers/specs/2026-04-01-holodeck-deployment-design.md` topology section to reflect the new architecture (HAProxy + app pool instead of single CT 201). -- [ ] **Step 3: Commit** +- [x] **Step 3: Commit** ```bash git add docs/ diff --git a/docs/superpowers/specs/2026-04-01-holodeck-deployment-design.md b/docs/superpowers/specs/2026-04-01-holodeck-deployment-design.md index 6c0d7d1e..2ff350a1 100644 --- a/docs/superpowers/specs/2026-04-01-holodeck-deployment-design.md +++ b/docs/superpowers/specs/2026-04-01-holodeck-deployment-design.md @@ -31,31 +31,46 @@ Holodeck is a Proxmox VE host with 40 Xeon threads, 128GB RAM, 2x Quadro P4000 G ## Topology ``` -┌──────────────────────────────────────────────────────────────┐ -│ holodeck (192.168.200.70) · Proxmox VE 8.4.1 │ -│ │ -│ ┌──────────────────┐ ┌──────────────────┐ ┌────────────┐ │ -│ │ CT 200 │ │ CT 201 │ │ CT 202 │ │ -│ │ awareness-pg │ │ awareness-app │ │ awareness- │ │ -│ │ 192.168.200.100 │ │ 192.168.200.101 │ │ tunnel │ │ -│ │ │ │ │ │ 192.168. │ │ -│ │ Postgres 17 │ │ mcp-awareness │ │ 200.102 │ │ -│ │ pgvector │ │ (pip from git) │ │ │ │ -│ │ PostGIS │ │ systemd service │ │ cloudflared│ │ -│ │ pg_stat_stmts │ │ │ │ systemd │ │ -│ │ │ │ OAuth enabled │ │ │ │ -│ │ :5432 LAN │ │ :8420 │ │ → CF tunnel│ │ -│ └──────────────────┘ └──────────────────┘ └────────────┘ │ -│ ↑ ↑ ↑ │ │ -│ │ pg connect │ │ embed │ │ -│ └────────────────────┘ │ │ │ -│ │ │ │ -│ Ollama (bare metal, 2x P4000) ◄─────┘ │ │ -│ :11434 │ │ -│ │ │ -│ Internet → Cloudflare → staging.mcpawareness.com ──┘ │ -│ → CT 202 tunnel → CT 201 awareness │ -└──────────────────────────────────────────────────────────────┘ +┌─────────────────────────────────────────────────────────────────────────┐ +│ holodeck (192.168.200.70) · Proxmox VE 8.4.1 │ +│ │ +│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ +│ │ CT 200 │ │ CT 203 │ │ CT 202 │ │ +│ │ awareness-pg │ │ awareness-lb │ │ awareness-tunnel │ │ +│ │ 192.168.200.100 │ │ 192.168.200.103 │ │ 192.168.200.102 │ │ +│ │ │ │ │ │ │ │ +│ │ Postgres 17 │ │ HAProxy 2.6 │ │ cloudflared │ │ +│ │ pgvector │ │ session sticky │ │ systemd │ │ +│ │ PostGIS │ │ :8420 frontend │ │ → CF tunnel │ │ +│ │ pg_stat_stmts │ │ :8421 stats │ │ │ │ +│ │ :5432 LAN │ │ │ │ │ │ +│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ +│ ↑ │ │ │ │ +│ │ pg connect │ │ │ │ +│ │ ┌─────────┘ └─────────┐ │ │ +│ │ ↓ ↓ │ │ +│ ┌──────────────────┐ ┌──────────────────┐ │ +│ │ CT 210 │ │ CT 211 │ │ +│ │ awareness-app-a │ │ awareness-app-b │ │ +│ │ 192.168.200.110 │ │ 192.168.200.111 │ │ +│ │ │ │ │ │ +│ │ mcp-awareness │ │ mcp-awareness │ │ +│ │ (pip from git) │ │ (pip from git) │ │ +│ │ systemd service │ │ systemd service │ │ +│ │ OAuth enabled │ │ OAuth enabled │ │ +│ │ :8420 │ │ :8420 │ │ +│ └──────────────────┘ └──────────────────┘ │ +│ ↑ ↑ ↑ ↑ │ +│ │ │ embed │ │ embed │ +│ │ │ │ │ │ +│ Ollama (bare metal, 2x P4000) ◄─────────┘────────┘ │ +│ :11434 │ +│ │ +│ Internet → Cloudflare → staging.mcpawareness.com ──→ CT 202 tunnel │ +│ → CT 203 HAProxy → CT 210/211 (round-robin, sticky) │ +└─────────────────────────────────────────────────────────────────────────┘ + +CT 201 (awareness-app, 192.168.200.101) — decommissioned, replaced by CT 210/211. Synology NAS "Seska" (192.168.200.52) └─ /volume1/awareness-backups (encrypted, NFS, 10GB quota) @@ -93,20 +108,40 @@ Synology NAS "Seska" (192.168.200.52) 192.168.200.52:/volume1/awareness-backups /mnt/backup nfs rw,hard,intr 0 0 ``` -## CT 201 — Awareness App (`awareness-app`) +## CT 203 — HAProxy Load Balancer (`awareness-lb`) ### Provisioning -- CT ID: 201 -- Hostname: `awareness-app` +- CT ID: 203 +- Hostname: `awareness-lb` - Template: Debian 12 -- Static IP: `192.168.200.101/24`, gateway `192.168.200.1` -- Resources: 1 CPU core, 512MB RAM, 8GB disk (`local-lvm`) +- Static IP: `192.168.200.103/24`, gateway `192.168.200.1` +- Resources: 1 CPU core, 256MB RAM, 4GB disk (`local-lvm`) + +### Software +- HAProxy 2.6 (Debian 12 repos), socat, curl + +### Configuration +- Frontend: `:8420` → backend pool +- Backend: round-robin with `mcp-session-id` header stickiness (request + response) +- Health checks: `GET /health` every 5s +- Stats: `:8421` (admin-only) +- Session stick table: captures `mcp-session-id` from both request headers and server response headers to maintain MCP session affinity + +## CT 210/211 — Awareness App Pool (`awareness-app-a`, `awareness-app-b`) + +### Provisioning +- CT IDs: 210, 211 +- Hostnames: `awareness-app-a`, `awareness-app-b` +- Template: Debian 12 +- Static IPs: `192.168.200.110/24`, `192.168.200.111/24`, gateway `192.168.200.1` +- Resources: 1 CPU core, 512MB RAM, 8GB disk (`local-lvm`) each +- Provisioned via `scripts/holodeck/create-app-ct.sh` ### Software -- Python 3.12 (Debian repos or deadsnakes PPA) +- Python 3.11 (Debian repos) - Clone `cmeans/mcp-awareness` from GitHub - `pip install -e .` (editable install from source) -- Alembic migrations run on first deploy +- Alembic migrations run on first deploy (from one node only) ### Runtime — systemd service (`mcp-awareness.service`) ```ini @@ -148,12 +183,9 @@ AWARENESS_OLLAMA_URL=http://192.168.200.70:11434 ``` ### Updates -```bash -cd /opt/mcp-awareness -git pull -pip install -e . -sudo systemctl restart mcp-awareness -``` +Deployed via `scripts/holodeck/deploy.sh`: +- `deploy.sh hot` — rolling zero-downtime update (drain → update → health check → re-enable, one node at a time) +- `deploy.sh maintenance` — full stop, migrate, restart (brief downtime for schema changes) ### User provisioning Pre-provision Chris's user before first OAuth login: @@ -176,17 +208,16 @@ This tests the flow where OAuth login finds an existing user by email match. ### Runtime - Install as system service: `cloudflared service install` -- Tunnel config points to `http://192.168.200.101:8420` +- Tunnel config points to `http://192.168.200.103:8420` (HAProxy) - Credentials file copied from laptop (`~/.cloudflared/staging-config.yml` and tunnel JSON) -### Tunnel config update -The existing staging tunnel config references `http://awareness-oauth:8421` (Docker network). Must update to: +### Tunnel config ```yaml tunnel: credentials-file: /etc/cloudflared/credentials.json ingress: - hostname: staging.mcpawareness.com - service: http://192.168.200.101:8420 + service: http://192.168.200.103:8420 - service: http_status:404 ``` @@ -237,7 +268,8 @@ Each LXC has cgroup-enforced resource limits that Proxmox tracks. These metrics | LXC | Metric | Cloud equivalent | |-----|--------|-----------------| | CT 200 (Postgres) | CPU, RAM, disk I/O | RDS/Cloud SQL instance tier | -| CT 201 (Awareness) | CPU, RAM, request rate | Cloud Run instance sizing | +| CT 203 (HAProxy) | CPU, RAM, connections | Cloud LB (ALB/Cloud LB) | +| CT 210/211 (Awareness) | CPU, RAM, request rate | Cloud Run instance sizing | | CT 202 (Tunnel) | CPU, RAM, bandwidth | Cloudflare handles this in cloud (free) | | Host (Ollama) | GPU util, VRAM, latency | GPU instance or API costs | @@ -264,8 +296,10 @@ The Docker image, Postgres config, and load profiling data all carry forward at | holodeck | 192.168.200.70 | Proxmox host, Ollama bare metal | | Seska (Synology) | 192.168.200.52 | NAS, backup storage | | CT 200 | 192.168.200.100 | Postgres | -| CT 201 | 192.168.200.101 | Awareness app | | CT 202 | 192.168.200.102 | Cloudflare tunnel | +| CT 203 | 192.168.200.103 | HAProxy load balancer | +| CT 210 | 192.168.200.110 | Awareness app-a | +| CT 211 | 192.168.200.111 | Awareness app-b | | Laptop | (DHCP) | Fallback production stack | ## Implementation Order diff --git a/scripts/holodeck/create-app-ct.sh b/scripts/holodeck/create-app-ct.sh new file mode 100755 index 00000000..11597142 --- /dev/null +++ b/scripts/holodeck/create-app-ct.sh @@ -0,0 +1,119 @@ +#!/usr/bin/env bash +# mcp-awareness — ambient system awareness for AI agents +# Copyright (C) 2026 Chris Means +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU Affero General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU Affero General Public License for more details. +# +# You should have received a copy of the GNU Affero General Public License +# along with this program. If not, see . + +# Provision an awareness app LXC on holodeck. +# Usage: create-app-ct.sh +# Example: create-app-ct.sh 210 110 awareness-app-a +# +# Run from holodeck host. Requires: pct, a Debian 12 template, and the +# workstation SSH public key at /tmp/awareness-ssh-key.pub on holodeck. +set -euo pipefail + +CT_ID="${1:?Usage: create-app-ct.sh }" +IP_SUFFIX="${2:?Usage: create-app-ct.sh }" +HOSTNAME="${3:?Usage: create-app-ct.sh }" +IP="192.168.200.${IP_SUFFIX}" + +TEMPLATE=$(pveam list local | grep "debian-12-standard" | awk '{print $1}' | head -1) +if [[ -z "$TEMPLATE" ]]; then + echo "Error: No Debian 12 template found. Run: pveam download local debian-12-standard_12.12-1_amd64.tar.zst" >&2 + exit 1 +fi + +SSH_KEY="/tmp/awareness-ssh-key.pub" +if [[ ! -f "$SSH_KEY" ]]; then + echo "Error: SSH public key not found at $SSH_KEY" >&2 + echo "Copy your workstation key: scp ~/.ssh/id_ed25519.pub holodeck:/tmp/awareness-ssh-key.pub" >&2 + exit 1 +fi + +echo "Creating CT ${CT_ID} (${HOSTNAME}) at ${IP}..." + +echo "You will be prompted to set a root password for the container." +pct create "$CT_ID" "$TEMPLATE" \ + --hostname "$HOSTNAME" \ + --cores 1 \ + --memory 512 \ + --swap 256 \ + --rootfs local-lvm:8 \ + --net0 "name=eth0,bridge=vmbr0,ip=${IP}/24,gw=192.168.200.1" \ + --nameserver 192.168.200.10 \ + --unprivileged 1 \ + --features nesting=0 \ + --start 1 \ + --password + +echo "Waiting for container to start..." +sleep 5 + +echo "Installing base packages..." +pct exec "$CT_ID" -- bash -c "apt update -qq && apt install -y -qq openssh-server sudo python3 python3-pip python3-venv python3-dev git build-essential libpq-dev curl > /dev/null 2>&1" + +echo "Configuring SSH..." +pct exec "$CT_ID" -- bash -c "mkdir -p /root/.ssh && chmod 700 /root/.ssh" +pct push "$CT_ID" "$SSH_KEY" /root/.ssh/authorized_keys +pct exec "$CT_ID" -- bash -c "chmod 600 /root/.ssh/authorized_keys" + +echo "Creating awareness user..." +pct exec "$CT_ID" -- bash -c "useradd --system --create-home --shell /bin/bash awareness" + +echo "Cloning repo and installing..." +# NOTE: HTTPS clone requires the repo to be public, or a deploy key / credential +# helper configured on the container. If the repo is private, set up a read-only +# deploy key on each app node before running this script. +pct exec "$CT_ID" -- bash -c "mkdir -p /opt/mcp-awareness && chown awareness:awareness /opt/mcp-awareness" +pct exec "$CT_ID" -- bash -c "sudo -u awareness git clone https://github.com/cmeans/mcp-awareness.git /opt/mcp-awareness" +pct exec "$CT_ID" -- bash -c "sudo -u awareness python3 -m venv /opt/mcp-awareness/venv" +pct exec "$CT_ID" -- bash -c "sudo -u awareness /opt/mcp-awareness/venv/bin/pip install -e /opt/mcp-awareness" + +echo "Creating CLI symlinks..." +pct exec "$CT_ID" -- bash -c "ln -sf /opt/mcp-awareness/venv/bin/mcp-awareness-token /usr/local/bin/" +pct exec "$CT_ID" -- bash -c "ln -sf /opt/mcp-awareness/venv/bin/mcp-awareness-user /usr/local/bin/" +pct exec "$CT_ID" -- bash -c "ln -sf /opt/mcp-awareness/venv/bin/mcp-awareness-secret /usr/local/bin/" +pct exec "$CT_ID" -- bash -c "ln -sf /opt/mcp-awareness/venv/bin/mcp-awareness-migrate /usr/local/bin/" + +echo "Installing systemd service..." +pct exec "$CT_ID" -- bash -c 'cat > /etc/systemd/system/mcp-awareness.service << SVC +[Unit] +Description=MCP Awareness Server +After=network.target + +[Service] +Type=simple +User=awareness +EnvironmentFile=/etc/awareness/env +ExecStart=/opt/mcp-awareness/venv/bin/mcp-awareness +Restart=on-failure +RestartSec=5 +WorkingDirectory=/opt/mcp-awareness + +[Install] +WantedBy=multi-user.target +SVC' +pct exec "$CT_ID" -- bash -c "systemctl daemon-reload && systemctl enable mcp-awareness" + +echo "Creating env directory (env file must be copied separately)..." +pct exec "$CT_ID" -- bash -c "mkdir -p /etc/awareness && chmod 700 /etc/awareness" + +echo "" +echo "CT ${CT_ID} (${HOSTNAME}) provisioned at ${IP}." +echo "" +echo "Next steps:" +echo " 1. Copy env file: pct exec ${CT_ID} -- bash -c 'cat > /etc/awareness/env << EOF'" +echo " (paste contents from an existing app node or KeePass)" +echo " 2. Start service: pct exec ${CT_ID} -- systemctl start mcp-awareness" +echo " 3. Verify health: curl -s http://${IP}:8420/health | python3 -m json.tool" diff --git a/scripts/holodeck/create-ct200.sh b/scripts/holodeck/create-ct200.sh new file mode 100755 index 00000000..65274a54 --- /dev/null +++ b/scripts/holodeck/create-ct200.sh @@ -0,0 +1,39 @@ +#!/usr/bin/env bash +# mcp-awareness — ambient system awareness for AI agents +# Copyright (C) 2026 Chris Means +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU Affero General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU Affero General Public License for more details. +# +# You should have received a copy of the GNU Affero General Public License +# along with this program. If not, see . + +# Create CT 200 — Postgres LXC for awareness. +# No arguments — creates CT 200 with fixed parameters (IP .100, 2 cores, 2GB RAM). +# Run on holodeck host: bash create-ct200.sh +set -euo pipefail + +echo "Creating CT 200 (awareness-pg)..." +pct create 200 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \ + --hostname awareness-pg \ + --cores 2 \ + --memory 2048 \ + --swap 512 \ + --rootfs local-lvm:20 \ + --net0 name=eth0,bridge=vmbr0,ip=192.168.200.100/24,gw=192.168.200.1 \ + --nameserver 192.168.200.10 \ + --unprivileged 1 \ + --features nesting=0 \ + --start 0 \ + --password + +echo "CT 200 created. Starting..." +pct start 200 +echo "CT 200 running." diff --git a/scripts/holodeck/deploy.sh b/scripts/holodeck/deploy.sh new file mode 100755 index 00000000..4f904a6c --- /dev/null +++ b/scripts/holodeck/deploy.sh @@ -0,0 +1,172 @@ +#!/usr/bin/env bash +# mcp-awareness — ambient system awareness for AI agents +# Copyright (C) 2026 Chris Means +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU Affero General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU Affero General Public License for more details. +# +# You should have received a copy of the GNU Affero General Public License +# along with this program. If not, see . + +# Zero-downtime deploy for mcp-awareness on holodeck. +# Run from any host with SSH access to app nodes and HAProxy. +# Usage: deploy.sh hot — rolling code update, zero-downtime +# deploy.sh maintenance — full stop, migrate, restart (scheduled) +set -euo pipefail + +HAPROXY_HOST="192.168.200.103" +HAPROXY_SOCK="/var/run/haproxy/admin.sock" +APP_NODES=("192.168.200.110:app-a" "192.168.200.111:app-b") +DRAIN_TIMEOUT=60 +HEALTH_TIMEOUT=30 +HEALTH_INTERVAL=2 + +MODE="${1:?Usage: deploy.sh }" + +# --- Helpers --- + +haproxy_cmd() { + ssh "root@${HAPROXY_HOST}" "echo '$1' | socat stdio ${HAPROXY_SOCK}" +} + +node_ip() { echo "${1%%:*}"; } +node_name() { echo "${1##*:}"; } + +drain_node() { + local name="$1" + echo " Draining ${name}..." + haproxy_cmd "set server awareness-backend/${name} state drain" + + local waited=0 + while (( waited < DRAIN_TIMEOUT )); do + local conns + conns=$(haproxy_cmd "show stat" | grep "awareness-backend,${name}," | cut -d, -f5) + if [[ "${conns:-0}" == "0" ]]; then + echo " ${name}: all connections drained" + return 0 + fi + echo " ${name}: ${conns} active connections, waiting..." + sleep 5 + waited=$((waited + 5)) + done + echo " WARNING: ${name} drain timeout (${DRAIN_TIMEOUT}s), proceeding anyway" +} + +enable_node() { + local name="$1" + haproxy_cmd "set server awareness-backend/${name} state ready" + echo " ${name}: re-enabled" +} + +update_node() { + local ip="$1" + echo " Updating ${ip}..." + ssh "root@${ip}" 'cd /opt/mcp-awareness && sudo -u awareness git pull origin main && sudo -u awareness venv/bin/pip install -e . -q && systemctl restart mcp-awareness' +} + +wait_healthy() { + local ip="$1" + local waited=0 + while (( waited < HEALTH_TIMEOUT )); do + if curl -sf "http://${ip}:8420/health" > /dev/null 2>&1; then + echo " ${ip}: healthy" + return 0 + fi + sleep "$HEALTH_INTERVAL" + waited=$((waited + HEALTH_INTERVAL)) + done + echo " ERROR: ${ip} failed health check after ${HEALTH_TIMEOUT}s" + return 1 +} + +# --- Hot deploy (rolling, zero-downtime) --- + +hot_deploy() { + echo "=== Hot deploy (zero-downtime) ===" + for entry in "${APP_NODES[@]}"; do + local ip name + ip=$(node_ip "$entry") + name=$(node_name "$entry") + + echo "" + echo "--- ${name} (${ip}) ---" + drain_node "$name" + update_node "$ip" + + if wait_healthy "$ip"; then + enable_node "$name" + else + echo " ALERT: ${name} failed health check — leaving drained!" + echo " Manual intervention required." + # Continue to next node — don't leave the whole service down + fi + done + + echo "" + echo "=== Hot deploy complete ===" +} + +# --- Maintenance deploy (full stop, migrate, restart) --- + +maintenance_deploy() { + echo "=== Maintenance deploy ===" + echo "" + + # Drain all nodes + echo "Step 1: Draining all nodes..." + for entry in "${APP_NODES[@]}"; do + drain_node "$(node_name "$entry")" + done + + echo "" + echo "Step 2: Updating first node and running migration..." + local first_ip + first_ip=$(node_ip "${APP_NODES[0]}") + update_node "$first_ip" + ssh "root@${first_ip}" 'cd /opt/mcp-awareness && sudo -u awareness bash -c "set -a && source /etc/awareness/env && set +a && /opt/mcp-awareness/venv/bin/mcp-awareness-migrate upgrade head"' + echo " Migration complete on ${first_ip}" + wait_healthy "$first_ip" || echo " WARNING: ${first_ip} not healthy after migration" + + echo "" + echo "Step 3: Updating remaining nodes..." + for entry in "${APP_NODES[@]:1}"; do + local ip + ip=$(node_ip "$entry") + update_node "$ip" + wait_healthy "$ip" || echo " WARNING: ${ip} not healthy yet" + done + + echo "" + echo "Step 4: Re-enabling all nodes..." + for entry in "${APP_NODES[@]}"; do + enable_node "$(node_name "$entry")" + done + + echo "" + echo "=== Maintenance deploy complete ===" +} + +# --- Main --- + +case "$MODE" in + hot) + hot_deploy + ;; + maintenance) + echo "This will briefly take the service offline for migrations." + read -p "Continue? [y/N] " -r + [[ $REPLY =~ ^[Yy]$ ]] || exit 0 + maintenance_deploy + ;; + *) + echo "Usage: deploy.sh " >&2 + exit 1 + ;; +esac