diff --git a/.agents/scripts/commands/pulse.md b/.agents/scripts/commands/pulse.md index e18c24ef5..ffb0da185 100644 --- a/.agents/scripts/commands/pulse.md +++ b/.agents/scripts/commands/pulse.md @@ -41,10 +41,29 @@ MAX_WORKERS=$(cat ~/.aidevops/logs/pulse-max-workers 2>/dev/null || echo 4) # Count running workers (only .opencode binaries, not node launchers) WORKER_COUNT=$(ps axo command | grep '/full-loop' | grep '\.opencode' | grep -v grep | wc -l | tr -d ' ') AVAILABLE=$((MAX_WORKERS - WORKER_COUNT)) + +# Priority-class allocations (t1423) — read from pre-fetched state +# The "Priority-Class Worker Allocations" section in the pre-fetched state +# shows PRODUCT_MIN and TOOLING_MAX. Read these values: +PRODUCT_MIN=$(grep '^PRODUCT_MIN=' ~/.aidevops/logs/pulse-priority-allocations 2>/dev/null | cut -d= -f2 || echo 0) +TOOLING_MAX=$(grep '^TOOLING_MAX=' ~/.aidevops/logs/pulse-priority-allocations 2>/dev/null | cut -d= -f2 || echo "$MAX_WORKERS") ``` If `AVAILABLE <= 0`: you can still merge ready PRs, but don't dispatch new workers. +### Priority-class enforcement (t1423) + +Worker slots are partitioned between **product** repos (`"priority": "product"` in repos.json) and **tooling** repos (`"priority": "tooling"`). Product repos get a guaranteed minimum share (default 60%) to prevent tooling hygiene from starving user-facing work. + +**Before dispatching each worker, apply this check:** + +1. Determine the target repo's priority class (from the pre-fetched state repo header or repos.json). +2. Count running workers per class: scan the Active Workers section — match each worker's `--dir` path to a repo in repos.json to determine its class. +3. **If dispatching a tooling worker:** check whether product-class workers are using fewer than `PRODUCT_MIN` slots. If `product_active < PRODUCT_MIN` AND product repos have pending work (open issues or failing PRs), the remaining product slots are **reserved** — skip the tooling dispatch and look for product work instead. +4. **If dispatching a product worker:** always proceed — product has no ceiling (only a floor). +5. **Exemptions:** Merges (priority 1) and CI-fix dispatches (priority 2) are exempt from class checks — they always proceed regardless of class. +6. **Soft reservation:** When product repos have no pending work (no open issues, no failing-CI PRs, no orphaned PRs), their reserved slots become available for tooling. The reservation protects product work when it exists, not when it doesn't. + ## Step 2: Use Pre-Fetched State **The wrapper has ALREADY fetched open PRs and issues for all pulse-enabled repos.** The data is in your prompt above (between `--- PRE-FETCHED STATE ---` markers). Do NOT re-fetch with `gh pr list` or `gh issue list` — that wastes time and was the root cause of the "only processes first repo" bug (the agent would spend all its context analyzing the first repo's fetch results and never reach the others). @@ -610,7 +629,7 @@ batch-strategy-helper.sh validate --tasks "$TASKS_JSON" 2. PRs with failing CI or review feedback → fix (uses a slot, but closer to done than new issues) 3. Issues labelled `priority:high` or `bug` 4. Active mission features (keeps multi-day projects moving — see Step 3.5) -5. Product repos (`"priority": "product"` in repos.json) over tooling +5. Product repos (`"priority": "product"` in repos.json) over tooling — **enforced by priority-class reservations (t1423)**. Product repos have `PRODUCT_MIN` reserved slots; tooling cannot consume them when product work is pending. See "Priority-class enforcement" in Step 1. 6. Smaller/simpler tasks over large ones (faster throughput) 7. `quality-debt` issues (unactioned review feedback from merged PRs) 8. `simplification-debt` issues (human-approved simplification opportunities) @@ -1027,7 +1046,7 @@ Output a brief summary of what you did (past tense), then exit. 3. **NEVER close an issue without a comment.** The comment must explain why and link to the PR(s) or evidence. Silent closes are audit failures. 4. **NEVER use `claude` CLI.** Always `opencode run`. 5. **NEVER include private repo names** in public issue titles/bodies/comments. -6. **NEVER exceed MAX_WORKERS.** Count before dispatching. +6. **NEVER exceed MAX_WORKERS or violate priority-class reservations.** Count before dispatching. Check class allocations (Step 1) — tooling workers must not consume product-reserved slots when product work is pending. 7. **Do your job completely, then exit.** Don't loop or re-analyze — one pass through all repos, act on everything, exit. 8. **NEVER create "pulse summary" or "supervisor log" issues.** The pulse runs every 2 minutes — creating an issue per cycle produces hundreds of spam issues per day. Your output text IS the log (it's captured by the wrapper to `~/.aidevops/logs/pulse.log`). The audit trail lives in PR/issue comments on the items you acted on, not in separate summary issues. 9. **NEVER create an issue if one already exists for the same task ID.** Before `gh issue create`, check `gh issue list --repo --search "tNNN" --state all` to see if an issue with that task ID prefix already exists. If it does (open or closed), use the existing one — don't create a duplicate. This applies to both issue-sync-helper and manual issue creation. diff --git a/.agents/scripts/pulse-wrapper.sh b/.agents/scripts/pulse-wrapper.sh index 9af5fd0c2..ed64a4824 100755 --- a/.agents/scripts/pulse-wrapper.sh +++ b/.agents/scripts/pulse-wrapper.sh @@ -63,6 +63,7 @@ RAM_RESERVE_MB="${RAM_RESERVE_MB:-8192}" # 8 GB reserved for OS MAX_WORKERS_CAP="${MAX_WORKERS_CAP:-8}" # Hard ceiling regardless of RAM QUALITY_SWEEP_INTERVAL="${QUALITY_SWEEP_INTERVAL:-86400}" # 24 hours between sweeps DAILY_PR_CAP="${DAILY_PR_CAP:-5}" # Max PRs created per repo per day (GH#3821) +PRODUCT_RESERVATION_PCT="${PRODUCT_RESERVATION_PCT:-60}" # % of worker slots reserved for product repos (t1423) # Process guard limits (t1398) CHILD_RSS_LIMIT_KB="${CHILD_RSS_LIMIT_KB:-2097152}" # 2 GB default — kill child if RSS exceeds this @@ -82,6 +83,7 @@ RAM_RESERVE_MB=$(_validate_int RAM_RESERVE_MB "$RAM_RESERVE_MB" 8192) MAX_WORKERS_CAP=$(_validate_int MAX_WORKERS_CAP "$MAX_WORKERS_CAP" 8) QUALITY_SWEEP_INTERVAL=$(_validate_int QUALITY_SWEEP_INTERVAL "$QUALITY_SWEEP_INTERVAL" 86400) DAILY_PR_CAP=$(_validate_int DAILY_PR_CAP "$DAILY_PR_CAP" 5 1) +PRODUCT_RESERVATION_PCT=$(_validate_int PRODUCT_RESERVATION_PCT "$PRODUCT_RESERVATION_PCT" 60 0) CHILD_RSS_LIMIT_KB=$(_validate_int CHILD_RSS_LIMIT_KB "$CHILD_RSS_LIMIT_KB" 2097152 1) CHILD_RUNTIME_LIMIT=$(_validate_int CHILD_RUNTIME_LIMIT "$CHILD_RUNTIME_LIMIT" 1800 1) SHELLCHECK_RSS_LIMIT_KB=$(_validate_int SHELLCHECK_RSS_LIMIT_KB "$SHELLCHECK_RSS_LIMIT_KB" 1048576 1) @@ -312,6 +314,9 @@ prefetch_state() { # Append repo hygiene data for LLM triage (t1417) prefetch_hygiene >>"$STATE_FILE" + # Append priority-class worker allocations (t1423) + _append_priority_allocations >>"$STATE_FILE" + # Export PULSE_SCOPE_REPOS — comma-separated list of repo slugs that # workers are allowed to create PRs/branches on (t1405, GH#2928). # Workers CAN file issues on any repo (cross-repo self-improvement), @@ -716,6 +721,54 @@ prefetch_active_workers() { return 0 } +####################################### +# Append priority-class worker allocations to state file (t1423) +# +# Reads the allocation file written by calculate_priority_allocations() +# and formats it as a section the pulse agent can act on. +# +# The pulse agent uses this to enforce soft reservations: product repos +# get a guaranteed minimum share of worker slots, tooling gets the rest. +# When one class has no pending work, the other can use freed slots. +# +# Output: allocation summary to stdout (appended to STATE_FILE by caller) +####################################### +_append_priority_allocations() { + local alloc_file="${HOME}/.aidevops/logs/pulse-priority-allocations" + + echo "" + echo "# Priority-Class Worker Allocations (t1423)" + echo "" + + if [[ ! -f "$alloc_file" ]]; then + echo "- Allocation data not available — using flat pool (no reservations)" + echo "" + return 0 + fi + + # Read allocation values + local max_workers product_repos tooling_repos product_min tooling_max reservation_pct + max_workers=$(grep '^MAX_WORKERS=' "$alloc_file" | cut -d= -f2) || max_workers=4 + product_repos=$(grep '^PRODUCT_REPOS=' "$alloc_file" | cut -d= -f2) || product_repos=0 + tooling_repos=$(grep '^TOOLING_REPOS=' "$alloc_file" | cut -d= -f2) || tooling_repos=0 + product_min=$(grep '^PRODUCT_MIN=' "$alloc_file" | cut -d= -f2) || product_min=0 + tooling_max=$(grep '^TOOLING_MAX=' "$alloc_file" | cut -d= -f2) || tooling_max=0 + reservation_pct=$(grep '^PRODUCT_RESERVATION_PCT=' "$alloc_file" | cut -d= -f2) || reservation_pct=60 + + echo "Worker pool: **${max_workers}** total slots" + echo "Product repos (${product_repos}): **${product_min}** reserved slots (${reservation_pct}% minimum)" + echo "Tooling repos (${tooling_repos}): **${tooling_max}** slots (remainder)" + echo "" + echo "**Enforcement rules:**" + echo "- Before dispatching a tooling-repo worker, check: are product-repo workers using fewer than ${product_min} slots? If yes, the remaining product slots are reserved — do NOT fill them with tooling work." + echo "- If product repos have no pending work (no open issues, no failing PRs), their reserved slots become available for tooling." + echo "- If all ${max_workers} slots are needed for product work, tooling gets 0 (product reservation is a minimum, not a maximum)." + echo "- Merges (priority 1) and CI fixes (priority 2) are exempt — they always proceed regardless of class." + echo "" + + return 0 +} + ####################################### # Pre-fetch repo hygiene data for LLM triage (t1417) # @@ -2948,6 +3001,7 @@ main() { cleanup_worktrees cleanup_stashes calculate_max_workers + calculate_priority_allocations check_session_count >/dev/null # Run housekeeping BEFORE the pulse — these are shell-level operations @@ -3091,6 +3145,91 @@ calculate_max_workers() { return 0 } +####################################### +# Calculate priority-class worker allocations (t1423) +# +# Reads repos.json to count product vs tooling repos, then computes +# per-class slot reservations based on PRODUCT_RESERVATION_PCT. +# +# Product repos get a guaranteed minimum share of worker slots. +# Tooling repos get the remainder. When one class has no pending work, +# the other class can use the freed slots (soft reservation). +# +# Output: writes allocation data to pulse-priority-allocations file +# and appends a summary section to STATE_FILE for the pulse agent. +# +# Depends on: calculate_max_workers() having run first (reads pulse-max-workers) +####################################### +calculate_priority_allocations() { + local repos_json="${REPOS_JSON}" + local max_workers_file="${HOME}/.aidevops/logs/pulse-max-workers" + local alloc_file="${HOME}/.aidevops/logs/pulse-priority-allocations" + + if [[ ! -f "$repos_json" ]] || ! command -v jq &>/dev/null; then + echo "[pulse-wrapper] repos.json or jq not available — skipping priority allocations" >>"$LOGFILE" + return 0 + fi + + local max_workers + max_workers=$(cat "$max_workers_file" 2>/dev/null || echo 4) + [[ "$max_workers" =~ ^[0-9]+$ ]] || max_workers=4 + + # Count pulse-enabled repos by priority class (single jq pass) + local product_repos tooling_repos + read -r product_repos tooling_repos < <(jq -r ' + .initialized_repos | + map(select(.pulse == true and (.local_only // false) == false and .slug != "")) | + [ + (map(select(.priority == "product")) | length), + (map(select(.priority == "tooling")) | length) + ] | @tsv + ' "$repos_json" 2>/dev/null) || true + product_repos=${product_repos:-0} + tooling_repos=${tooling_repos:-0} + [[ "$product_repos" =~ ^[0-9]+$ ]] || product_repos=0 + [[ "$tooling_repos" =~ ^[0-9]+$ ]] || tooling_repos=0 + + # Calculate reservations + # product_min = ceil(max_workers * PRODUCT_RESERVATION_PCT / 100) + # Using integer arithmetic: ceil(a/b) = (a + b - 1) / b + local product_min tooling_max + if [[ "$product_repos" -eq 0 ]]; then + # No product repos — all slots available for tooling + product_min=0 + tooling_max="$max_workers" + elif [[ "$tooling_repos" -eq 0 ]]; then + # No tooling repos — all slots available for product + product_min="$max_workers" + tooling_max=0 + else + product_min=$(((max_workers * PRODUCT_RESERVATION_PCT + 99) / 100)) + # Ensure product_min doesn't exceed max_workers + if [[ "$product_min" -gt "$max_workers" ]]; then + product_min="$max_workers" + fi + # Ensure at least 1 slot for tooling when tooling repos exist + # but only when there are multiple slots to distribute (with 1 slot, + # product keeps it — the reservation is a minimum guarantee) + if [[ "$max_workers" -gt 1 && "$product_min" -ge "$max_workers" && "$tooling_repos" -gt 0 ]]; then + product_min=$((max_workers - 1)) + fi + tooling_max=$((max_workers - product_min)) + fi + + # Write allocation file (key=value, readable by pulse.md) + { + echo "MAX_WORKERS=${max_workers}" + echo "PRODUCT_REPOS=${product_repos}" + echo "TOOLING_REPOS=${tooling_repos}" + echo "PRODUCT_MIN=${product_min}" + echo "TOOLING_MAX=${tooling_max}" + echo "PRODUCT_RESERVATION_PCT=${PRODUCT_RESERVATION_PCT}" + } >"$alloc_file" + + echo "[pulse-wrapper] Priority allocations: product_min=${product_min}, tooling_max=${tooling_max} (${product_repos} product, ${tooling_repos} tooling repos, ${max_workers} total slots)" >>"$LOGFILE" + return 0 +} + # Only run main when executed directly, not when sourced. # The pulse agent sources this file to access helper functions # (check_external_contributor_pr, check_permission_failure_pr) diff --git a/todo/tasks/t1423-brief.md b/todo/tasks/t1423-brief.md new file mode 100644 index 000000000..6f6590177 --- /dev/null +++ b/todo/tasks/t1423-brief.md @@ -0,0 +1,34 @@ +# t1423: Priority-class worker reservations for per-repo concurrency fairness + +## Session Origin + +Interactive session, 2026-03-09. User asked whether workers still share concurrency across all repos.json. Confirmed yes — global pool with no per-class partitioning. User chose option 2 (priority-class reservations) over per-repo min/max or status quo. + +## What + +Add priority-class worker slot reservations to the pulse supervisor. Product repos (`"priority": "product"` in repos.json) get a guaranteed minimum share of worker slots (default 60%). Tooling repos get the remainder. Soft reservation — when one class has no pending work, the other can use freed slots. + +## Why + +Without reservations, tooling hygiene work (quality-debt, simplification-debt, CI fixes) can consume all worker slots before product repos' new features get dispatched. The existing priority order in pulse.md (item 5: "product over tooling") is LLM guidance, not enforcement — a busy tooling repo with many failing-CI PRs (priority 1-2) consumes all slots before product repos' lower-priority issues get a chance. + +## How + +1. **pulse-wrapper.sh**: Add `PRODUCT_RESERVATION_PCT` config (default 60%), `calculate_priority_allocations()` function that reads repos.json, counts product vs tooling repos, computes `PRODUCT_MIN` and `TOOLING_MAX`, writes to `~/.aidevops/logs/pulse-priority-allocations`. +2. **pulse-wrapper.sh**: Add `_append_priority_allocations()` to format allocation data for the STATE_FILE. +3. **pulse.md**: Update Step 1 to read allocation file and enforce class reservations before dispatch. Update priority order item 5 to reference enforcement. Update Hard Rule 6. + +## Acceptance Criteria + +- [ ] `calculate_priority_allocations()` correctly computes allocations for: normal case, small pool, 1 worker, no tooling, no product repos +- [ ] Allocation data appears in pulse state file +- [ ] pulse.md Step 1 includes class enforcement guidance +- [ ] ShellCheck clean (SC1091 only) +- [ ] All existing pulse-wrapper tests still pass + +## Context + +- 8 pulse-enabled repos: 4 product (cloudron-netbird-app, turbostarter-plus, awardsapp, essentials.com), 4 tooling (aidevops, aidevops.sh, quickfile-mcp, aidevops-cloudron-app) +- Current MAX_WORKERS is RAM-based: `(free_mb - 8GB) / 1GB`, capped at 8 +- DAILY_PR_CAP=5 per repo already prevents PR flood, but doesn't prevent worker slot starvation +- Quality-debt cap (30%) and simplification-debt cap (10%) are global against MAX_WORKERS