Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
263 changes: 263 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Agents: SSH Workflow with Per-User Unix Accounts (No Containers)

This document is the canonical spec for how agents collaborate on a shared host over **SSH** using **dedicated Unix user accounts** for isolation. It **removes the containerized option** and standardizes on per-user isolation only.

## Core Principles

- Always verify **git state** matches between local and remote before work begins.
- Do **all development/testing on the remote host** as the claimed Unix user.
- At the end, **pull the committed changes locally** and present a code-only PR.
- Use a **race-free user claim protocol** (lock + heartbeat) so agents never collide.

---

## Quick Start (TL;DR)

```bash
# 0) Ensure we start from the same git ref
./bin/agent git:compare # runs locally

# 1) SSH and claim a user, then drop into a session
ssh "$REMOTE" "cd $TARGET_REPO && ./bin/agent user:claim --ref '$GIT_REF' --branch '$BRANCH' && ./bin/agent user:shell"

# 2) Inside the session, work normally (build/test/commit/push)
# ...

# 3) Finalize locally and render a code-only PR
./bin/agent finalize --ref "$GIT_REF" --remote "$REMOTE" --branch "$BRANCH"
```

---

## Host Prerequisites

- A pool of dedicated users: `agent-0`, `agent-1`, `agent-2`, and `agent-3` (expandable if needed).
- Orchestrator user can switch with passwordless `sudo -iu <agent>`.
- Standard tools: `flock`, `pgrep`, `who`, `date`, `awk`, `sudo` (and optionally `loginctl`).

### Files & Directories

```
/etc/agents/users.txt # pool: one username per line
/var/lock/agents/ # lockfiles (root-owned)
/var/run/agents/ # leases (root-owned, tmpfs recommended)
```

### Sudoers

```
# /etc/sudoers.d/agents
Cmnd_Alias AGENT_SWITCH = /usr/bin/sudo -iu *, /bin/su -
%agents ALL=(ALL:ALL) NOPASSWD: AGENT_SWITCH
```

Add the orchestrator user to the `agents` group.

---

## Active Session Detection

A user is **in use** if any of the following are true:

- A **fresh lease** exists and its PID is alive.
- `who` shows the username as logged in.
- `pgrep -u "$user" -fa "sshd: $user@"` finds an active SSH session.
- (Optional) `loginctl list-sessions` shows an active session.

Helper script `bin/agent-user-active`:

```bash
#!/usr/bin/env bash
set -euo pipefail
u=${1:?user}
lease="/var/run/agents/${u}.lease"

is_alive() { kill -0 "$1" 2>/dev/null; }
if [[ -f "$lease" ]]; then
read -r pid ts < "$lease" || true
now=$(date +%s)
if [[ -n "${pid:-}" ]] && is_alive "$pid" && (( now - ${ts:-0} < 120 )); then exit 0; fi
fi
pgrep -u "$u" -fa "sshd: $u@" >/dev/null && exit 0
who | awk '{print $1}' | grep -qx "$u" && exit 0
exit 1
```

---

## Race-Free User Claiming (Lock + Heartbeat)

We rely on **per-user lockfiles** guarded by `flock` and a **heartbeat lease** to avoid collisions and stale claims.

### Acquire (`bin/agent-user-acquire`)

```bash
#!/usr/bin/env bash
set -euo pipefail
POOL_FILE=${POOL_FILE:-/etc/agents/users.txt}
LEASE_DIR=${LEASE_DIR:-/var/run/agents}
LOCK_DIR=${LOCK_DIR:-/var/lock/agents}
HEARTBEAT=${HEARTBEAT:-30}

# Shuffle so agents spread evenly
mapfile -t pool < <(shuf "$POOL_FILE")

for u in "${pool[@]}"; do
lock="$LOCK_DIR/$u.lock"
exec {fd}<>"$lock" || continue
if flock -n "$fd"; then
# Double-check other activity
if bin/agent-user-active "$u"; then
flock -u "$fd"; continue
fi
lease="$LEASE_DIR/$u.lease"
echo "$$ $(date +%s)" > "$lease"
# Heartbeat while parent lives
(
while kill -0 $$ 2>/dev/null; do
echo "$$ $(date +%s)" > "$lease"; sleep "$HEARTBEAT"
done
) & disown
# Output: username and lock fd path so caller can release
echo "$u $fd $lease"
exit 0
fi
done

echo "No available users in pool" >&2
exit 2
```

### Release & GC

`bin/agent-user-release`:

```bash
#!/usr/bin/env bash
set -euo pipefail
u=${1:?user}
rm -f "/var/run/agents/${u}.lease" || true
```

`bin/agent-user-gc`:

```bash
#!/usr/bin/env bash
set -euo pipefail
LEASE_DIR=${LEASE_DIR:-/var/run/agents}
MAX_AGE=${MAX_AGE:-600}
now=$(date +%s)
for f in "$LEASE_DIR"/*.lease; do
[[ -e "$f" ]] || continue
read -r pid ts < "$f" || continue
if ! kill -0 "$pid" 2>/dev/null || (( now - ts > MAX_AGE )); then
rm -f "$f"
fi
done
```

---

## Orchestration CLI (`bin/agent`)

A single entrypoint agents/humans call for the full workflow.

```bash
#!/usr/bin/env bash
set -euo pipefail
cmd=${1:-help}; shift || true

case "$cmd" in
git:compare)
: "${BRANCH:?Set BRANCH}";
LOCAL=$(git rev-parse "$BRANCH")
REMOTE=$(git ls-remote --heads origin "$BRANCH" | awk '{print $1}')
test "$LOCAL" = "$REMOTE" || { echo "Branch mismatch $LOCAL != $REMOTE" >&2; exit 1; }
;;

user:claim)
read USER FD LEASE < <(bin/agent-user-acquire)
echo "$USER" > .agent.user
# Tiny window: re-check then proceed
if bin/agent-user-active "$USER"; then
flock -u "$FD"; rm -f "$LEASE"; exec "$0" user:claim "$@"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Script Locking Issues and Incorrect PID Handling

The bin/agent-user-acquire script immediately releases the flock upon exiting because its file descriptor is closed, preventing the calling process from holding the lock and creating a race condition. Additionally, the heartbeat subprocess incorrectly checks its own PID instead of the parent's, leading to indefinite execution and stale leases.

Fix in Cursor Fix in Web

fi
;;

user:shell)
USER=$(cat .agent.user)
sudo -iu "$USER" bash -lc '
set -euo pipefail
if [ ! -d "$TARGET_REPO" ]; then git clone $REPO_URL $TARGET_REPO; fi
cd "$TARGET_REPO"
git fetch --all --tags
git checkout -B "$BRANCH" "$GIT_REF"
# Your bootstrap here (deps, tests, etc.)
exec bash
'
;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Command Fails Due to FD Closure and Variable Expansion Issues

The user:claim command tries to release a flock using a file descriptor number from a subprocess, but this FD is already closed, potentially leaving locks unreleased. It also ignores the --ref and --branch arguments. Separately, the user:shell command fails because variables like $TARGET_REPO are not expanded or passed into the sudo environment, making them undefined for git operations.

Fix in Cursor Fix in Web


run)
USER=$(cat .agent.user)
sudo -iu "$USER" bash -lc "$*" ;;

commit)
USER=$(cat .agent.user)
sudo -iu "$USER" bash -lc "cd \"$TARGET_REPO\" && git add -A && git commit -m 'agent: update' && git push -u origin \"$BRANCH\"" ;;

finalize)
: "${BRANCH:?Set BRANCH}"; git fetch && git checkout "$BRANCH" && git pull --ff-only
git --no-pager diff --name-status origin/main...HEAD
;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Agent Script Fails to Propagate Variables

The bin/agent script has inconsistent variable handling. Commands executed via sudo -iu (e.g., user:shell, run, commit) don't propagate necessary variables like $TARGET_REPO and $BRANCH to the target user's environment. Additionally, the finalize command is documented to accept --ref, --remote, and --branch arguments but doesn't parse them, instead relying on an unset $BRANCH environment variable.

Fix in Cursor Fix in Web


user:release)
USER=$(cat .agent.user)
bin/agent-user-release "$USER" || true
rm -f .agent.user || true
;;

*)
echo "Usage: bin/agent [git:compare|user:claim|user:shell|run|commit|finalize|user:release]" ;;
esac
```

---

## End-to-End SSH Flow

1. **Local**: `./bin/agent git:compare` to assert local vs remote branch/SHAs match.
2. **Remote** (via SSH): `./bin/agent user:claim` to atomically acquire an available user.
3. **Remote**: `./bin/agent user:shell` to enter a login shell as the claimed user and bootstrap the repo at the target ref.
4. **Remote**: develop, test, `./bin/agent commit`.
5. **Local**: `./bin/agent finalize` to pull and present a code-only diff/PR.
6. **Remote**: `./bin/agent user:release` (also handled automatically if the orchestrator dies due to FD-tied lock + heartbeat GC).

---

## Security & Policy

- **Least privilege**: Claimed users have only the permissions they need in their home and the repo workspace.
- **Visibility**: Non-root users see only their own processes; orchestrator can audit via `sudo` if necessary.
- **Auditing**: Tag commits with `user.name = Agent <agent-X>` where `X` is 0–3.
- **Quotas**: Optional per-user disk quotas or per-user ZFS/LVM datasets.
- **Network policy**: Egress allowlist if running third-party agents.

---

## FAQ

**Why not containers?**
We standardized on Unix users to reduce complexity and avoid container runtime dependencies while keeping isolation via file permissions and process visibility.

**How do we avoid races?**
Atomic `flock` per user, immediate re-check, and a heartbeat lease + GC keep claims correct even under crashes.

**Can we scale beyond four users?**
Yes—append to `/etc/agents/users.txt` (e.g., `agent-4`, `agent-5`, …). The claim loop shuffles to spread load.

**How do we add language toolchains?**
Install per-user or system-wide pinned toolchains; or add Nix (`nix develop`) for reproducible dev shells without containers.

---

*End of spec.*