Skip to content

t1165.2: Container pool manager for supervisor#2184

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1165.2
Feb 23, 2026
Merged

t1165.2: Container pool manager for supervisor#2184
marcusquinn merged 1 commit intomainfrom
feature/t1165.2

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 23, 2026

Summary

  • Adds container pool manager module (container-pool.sh) to the supervisor for managing Docker/OrbStack container pools
  • Implements spawn/destroy lifecycle, health checks, round-robin dispatch selection, and per-container rate limit tracking
  • Integrates with dispatch.sh via opt-in SUPERVISOR_USE_CONTAINER_POOL=true environment variable

What's included

New files

  • .agents/scripts/supervisor/container-pool.sh — Pool manager module (849 lines)
  • tests/test-container-pool.sh — 45 unit tests covering all pool operations

Modified files

  • .agents/scripts/supervisor/database.sh — Schema migration for container_pool and container_dispatch_log tables
  • .agents/scripts/supervisor/dispatch.sh — Pool-aware dispatch hook (selects container via round-robin before dispatch)
  • .agents/scripts/supervisor-helper.sh — Module source + pool command routing

Features

  • Spawn/destroy: Docker containers with OAuth token injection (gopass or env var), volume mounts for agents
  • Health checks: Docker state verification + rate limit cooldown expiry detection
  • Round-robin dispatch: Selects healthy container with oldest last_dispatch_at timestamp
  • Rate limit tracking: Per-container cooldown with configurable duration, auto-triggered on rate_limited outcome
  • Idle cleanup: Destroys containers idle beyond timeout, respects CONTAINER_POOL_MIN
  • CLI: supervisor-helper.sh pool {spawn|destroy|list|status|health|select|rate-limit|cleanup|help}

Quality

  • ShellCheck clean (zero violations)
  • 45/45 unit tests passing
  • All functions have explicit returns
  • local var="$1" pattern throughout

Ref #1764

Summary by CodeRabbit

Release Notes

  • New Features

    • Added container pool management functionality accessible via the pool command
    • Pool supports spawning and destroying containers with automatic lifecycle management
    • Built-in health checking and round-robin container selection for workload distribution
    • Rate limiting capabilities to control container usage
    • Pool introspection tools including status listing and statistics
  • Tests

    • Added comprehensive test coverage for pool management operations

Implements container pool management in the supervisor with:
- Spawn/destroy Docker containers with OAuth token injection
- Health checks (Docker state + rate limit cooldown expiry)
- Round-robin dispatch across healthy containers (oldest-first)
- Per-container rate limit tracking with configurable cooldown
- Idle container cleanup respecting pool minimum size
- CLI: supervisor-helper.sh pool {spawn|destroy|list|status|health|select|rate-limit|cleanup}
- DB schema: container_pool + container_dispatch_log tables
- Integration hook in dispatch.sh (opt-in via SUPERVISOR_USE_CONTAINER_POOL=true)
- 45/45 unit tests passing, ShellCheck clean
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the supervisor's capabilities by introducing a sophisticated container pooling system. This system automates the management of worker containers, ensuring efficient resource utilization, load balancing through round-robin dispatch, and resilience against rate limits or unhealthy containers. It provides a scalable foundation for executing tasks in parallel, improving overall system performance and reliability.

Highlights

  • New Container Pool Manager: A new module, container-pool.sh, has been added to the supervisor to manage a pool of Docker/OrbStack containers for parallel task dispatch.
  • Container Lifecycle and Management: The manager implements full container lifecycle operations including spawning, destroying, health checks, round-robin dispatch selection, and per-container rate limit tracking with configurable cooldowns.
  • Database Integration: New database tables, container_pool and container_dispatch_log, have been introduced to store container registry information, health/rate-limit states, and dispatch history.
  • Dispatch System Integration: The dispatch.sh script has been updated to optionally integrate with the container pool, allowing tasks to be routed to available containers based on a round-robin strategy via the SUPERVISOR_USE_CONTAINER_POOL environment variable.
  • Comprehensive Testing and CLI: The new module includes 45 unit tests covering all pool operations and exposes a pool subcommand through supervisor-helper.sh for command-line management.
Changelog
  • .agents/scripts/supervisor-helper.sh
    • Added pool subcommand routing to integrate the new container pool manager.
    • Included the new container-pool.sh module.
  • .agents/scripts/supervisor/container-pool.sh
    • Introduced a new module for managing Docker/OrbStack container pools.
    • Defined schema for container_pool and container_dispatch_log tables.
    • Implemented functions for container spawning, destruction, health checking, selection, and rate limit management.
    • Provided CLI commands for interacting with the container pool.
  • .agents/scripts/supervisor/database.sh
    • Added migration logic to create the container_pool and container_dispatch_log tables if they do not exist.
    • Included a call to _create_container_pool_schema during database initialization.
  • .agents/scripts/supervisor/dispatch.sh
    • Integrated logic to select a container from the pool using pool_select_container for task dispatch when enabled.
    • Added functionality to record task dispatches to specific containers using pool_record_dispatch.
  • tests/test-container-pool.sh
    • Added a new suite of 45 unit tests for the container-pool.sh module.
    • Tests cover schema creation, container spawning and destruction, health checks, round-robin selection, rate limit tracking, idle cleanup, pool statistics, and CLI command routing.
Activity
  • No specific activity (comments, reviews, progress) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

This PR introduces a comprehensive container pool management system to the supervisor framework. It adds a new pool subcommand with SQLite-backed lifecycle management, health monitoring, round-robin container selection with rate limiting, dispatch logging, and idle cleanup.

Changes

Cohort / File(s) Summary
CLI Entry & Module Routing
.agents/scripts/supervisor-helper.sh
Adds pool subcommand dispatch to cmd_pool and sources the new container-pool.sh module; updated help/usage text.
Container Pool Implementation
.agents/scripts/supervisor/container-pool.sh
Introduces comprehensive pool manager (~850 lines): container spawn/destroy, per-container and pool-wide health checks, round-robin selection with rate-limit awareness, dispatch logging/completion, rate-limit management, idle cleanup, pool statistics, and CLI routing via cmd_pool.
Database Integration
.agents/scripts/supervisor/database.sh
Adds migration logic and initialization hook to create container pool schema (tables: container_pool, container_dispatch_log) with indices during init_db and migrations.
Dispatch Integration
.agents/scripts/supervisor/dispatch.sh
Introduces optional pool-based dispatch path: when SUPERVISOR_USE_CONTAINER_POOL is enabled, selects containers via pool_select_container and records dispatch events.
Test Suite
tests/test-container-pool.sh
Comprehensive test suite (~583 lines) covering schema creation, spawn/destroy, health checks, round-robin selection, rate limiting, completion recording, statistics, cleanup, CLI routing, and Docker invocation verification.

Sequence Diagram

sequenceDiagram
    participant CLI as supervisor-helper.sh pool
    participant PM as pool_*() functions
    participant DB as SQLite Database
    participant Docker as Docker CLI
    
    CLI->>PM: cmd_pool spawn [args]
    PM->>DB: Insert container record (status=starting)
    PM->>Docker: docker run (local or remote)
    Docker-->>PM: Container ID
    PM->>DB: Update with container ID & image
    
    rect rgba(100, 150, 200, 0.5)
    Note over PM,DB: Health Check Loop
    PM->>Docker: docker inspect
    PM->>DB: Update status & health_check_ts
    end
    
    rect rgba(150, 200, 100, 0.5)
    Note over PM,DB: Dispatch Flow (when enabled)
    CLI->>PM: pool_select_container()
    PM->>DB: Query healthy containers (skip rate-limited)
    PM-->>CLI: Container (round-robin)
    CLI->>PM: pool_record_dispatch(container_id)
    PM->>DB: Insert dispatch_log entry
    PM->>DB: Increment dispatch_count
    end
    
    rect rgba(200, 100, 100, 0.5)
    Note over PM,DB: Completion & Rate Limiting
    CLI->>PM: pool_record_completion(container_id, outcome)
    PM->>DB: Update completion_ts
    alt Rate Limited Outcome
        PM->>PM: pool_mark_rate_limited()
        PM->>DB: Set rate_limit_until cooldown
    end
    end
    
    CLI->>PM: pool_destroy_idle()
    PM->>DB: Query idle containers
    PM->>Docker: docker stop/rm
    PM->>DB: Update status=stopped
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🚀 A pool of containers, so ready and wise,
Health checks keeping watch with supervisory eyes,
Round-robin dispatch with rate limits in place,
SQLite tracks every container's grace,
From spawn to idle cleanup, orchestrated with care! 🌊

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a container pool manager module to the supervisor infrastructure.
Docstring Coverage ✅ Passed Docstring coverage is 80.77% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1165.2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 40 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Mon Feb 23 01:18:29 UTC 2026: Code review monitoring started
Mon Feb 23 01:18:29 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 40

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 40
  • VULNERABILITIES: 0

Generated on: Mon Feb 23 01:18:32 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that adds a significant and well-tested container pool manager feature. The new module is well-structured, and the inclusion of a comprehensive test suite is commendable. My review includes a critical fix to prevent error suppression and a suggestion to optimize database queries for better performance, both aligned with established repository rules.

pool_container_id=$(pool_select_container 2>/dev/null) || pool_container_id=""
if [[ -n "$pool_container_id" ]]; then
log_info "Container pool: selected $pool_container_id for $task_id (t1165.2)"
pool_record_dispatch "$pool_container_id" "$task_id" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of 2>/dev/null here suppresses all error output from pool_record_dispatch. This is risky because the function performs database operations that could fail for various reasons (e.g., database locked, disk full, schema mismatch). Suppressing these errors can hide critical problems and make debugging difficult. The project's general rules advise against suppressing stderr in such cases, and specifically recommend against blanket suppression of command errors to ensure issues remain visible for debugging. While || true is used to prevent script exit, 2>/dev/null should be removed to allow potential errors to be logged.

Suggested change
pool_record_dispatch "$pool_container_id" "$task_id" 2>/dev/null || true
pool_record_dispatch "$pool_container_id" "$task_id" || true
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
  2. In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.

Comment on lines +759 to +765
local total healthy unhealthy rate_limited stopped failed
total=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool;" 2>/dev/null || echo "0")
healthy=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='healthy';" 2>/dev/null || echo "0")
unhealthy=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='unhealthy';" 2>/dev/null || echo "0")
rate_limited=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='rate_limited';" 2>/dev/null || echo "0")
stopped=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='stopped';" 2>/dev/null || echo "0")
failed=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='failed';" 2>/dev/null || echo "0")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These six separate database queries to get status counts can be consolidated into a single, more efficient query using conditional aggregation. This reduces the overhead of multiple sqlite3 process invocations and database accesses, making the function faster and more atomic. This aligns with the rule to efficiently fetch multiple aggregate statistics from SQLite using a single query that returns a delimited string. Additionally, the 2>/dev/null in the suggested code should be removed to avoid blanket suppression of errors, ensuring that potential issues like database locks or syntax errors remain visible for debugging, as per repository guidelines. The || echo fallback is sufficient to provide default values without hiding errors.

Suggested change
local total healthy unhealthy rate_limited stopped failed
total=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool;" 2>/dev/null || echo "0")
healthy=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='healthy';" 2>/dev/null || echo "0")
unhealthy=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='unhealthy';" 2>/dev/null || echo "0")
rate_limited=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='rate_limited';" 2>/dev/null || echo "0")
stopped=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='stopped';" 2>/dev/null || echo "0")
failed=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM container_pool WHERE status='failed';" 2>/dev/null || echo "0")
local total healthy unhealthy rate_limited stopped failed
local stats_line
stats_line=$(db -separator ' ' "$SUPERVISOR_DB" "
SELECT
COALESCE(COUNT(*), 0),
COALESCE(SUM(CASE WHEN status = 'healthy' THEN 1 ELSE 0 END), 0),
COALESCE(SUM(CASE WHEN status = 'unhealthy' THEN 1 ELSE 0 END), 0),
COALESCE(SUM(CASE WHEN status = 'rate_limited' THEN 1 ELSE 0 END), 0),
COALESCE(SUM(CASE WHEN status = 'stopped' THEN 1 ELSE 0 END), 0),
COALESCE(SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END), 0)
FROM container_pool;
" || echo "0 0 0 0 0 0")
read -r total healthy unhealthy rate_limited stopped failed <<<"$stats_line"
References
  1. To efficiently fetch multiple aggregate statistics from SQLite in a shell script, use a single query with subselects that returns a delimited string, and parse it using IFS='|' read.
  2. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/container-pool.sh:
- Around line 182-197: The indirect expansion ${!token_ref} can fail under set
-u when token_ref contains characters invalid for an env var name (e.g. paths
like "foo/bar"); before falling back to the env-var lookup in the token
resolution block, validate token_ref is a legal shell identifier (e.g. matches
/^[a-zA-Z_][a-zA-Z0-9_]*$/) and only then perform the indirect expansion to set
token_value, otherwise skip that branch (and keep the existing warning path).
Update the code around token_ref/token_value/docker_args in the token injection
logic (the block that attempts gopass then env var) to perform this identifier
check and avoid indirect expansion errors.
- Around line 483-490: The SELECT that assigns selected in the container-pool
script must exclude containers with active/incomplete dispatches: update the SQL
inside the db "$SUPERVISOR_DB" call (the query that selects from container_pool)
to add an exclusion such as AND NOT EXISTS (SELECT 1 FROM dispatch WHERE
dispatch.container_id = container_pool.id AND dispatch.completed_at IS NULL)
(or, if your schema uses a status column, AND NOT EXISTS (SELECT 1 FROM dispatch
WHERE dispatch.container_id = container_pool.id AND dispatch.status NOT IN
('succeeded','failed','cancelled'))). This change should be made where selected
is set so the round-robin never picks a container that currently has an
unfinished dispatch.
- Around line 567-577: pool_mark_rate_limited currently interpolates the
cooldown directly into the SQL timestamp expression; validate and sanitize the
cooldown before interpolation by ensuring it's a non-negative integer (e.g.
check with a digit-only regex or coerce with printf '%d' and fallback to the
default CONTAINER_POOL_RATE_LIMIT_COOLDOWN if invalid), optionally clamp to a
reasonable max, then use that sanitized numeric variable in the SQL (leave
sql_escape for container_id unchanged). Update the pool_mark_rate_limited
function to perform this validation/fallback and reference the sanitized
variable (e.g. cooldown_seconds) in the strftime('+${cooldown_seconds} seconds')
fragment.

In @.agents/scripts/supervisor/database.sh:
- Around line 767-776: ensure_db and init_db call the helper
_create_container_pool_schema which is defined in container-pool.sh but
database.sh doesn’t source that file; update database.sh to ensure the function
is available before calling it by adding a defensive check in ensure_db/init_db:
test whether _create_container_pool_schema is defined (e.g., via declare -f or
type) and if not, source container-pool.sh (or abort with a clear error), or
alternatively source container-pool.sh near the top of database.sh to guarantee
availability; reference ensure_db, init_db, and _create_container_pool_schema
when making the change.

In @.agents/scripts/supervisor/dispatch.sh:
- Around line 3222-3234: The pool selection is never wired into routing—after
getting pool_container_id you must set and persist the task's dispatch target
and short-circuit to a pool-specific dispatch branch so the selected container
is actually used. Concretely: when pool_select_container returns a non-empty
pool_container_id, set dispatch_target="pool:$pool_container_id", persist that
value to whatever task metadata/store the dispatcher later reads (so downstream
code sees it), and route the task through the pool-specific dispatch path that
sends the job to that container (and keep calling pool_record_dispatch to update
metrics). Alternatively, only invoke pool_select_container when the subsequent
routing supports a "pool:" dispatch_target; ensure any routing logic (the parts
that read dispatch_target) recognizes the "pool:<id>" prefix and sends the task
to that container.

Comment on lines +182 to +197
# Inject OAuth token if provided
if [[ -n "$token_ref" ]]; then
local token_value=""
# Try gopass first, then env var
if command -v gopass &>/dev/null; then
token_value=$(gopass show "$token_ref" 2>/dev/null || echo "")
fi
if [[ -z "$token_value" ]]; then
# Try as env var name
token_value="${!token_ref:-}"
fi
if [[ -n "$token_value" ]]; then
docker_args+=("-e" "CLAUDE_CODE_OAUTH_TOKEN=$token_value")
else
log_warn "OAuth token ref '$token_ref' could not be resolved — container may lack auth"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard env-var fallback for non‑identifier token refs.

If gopass isn’t available and --token-ref is a path (e.g., foo/bar), ${!token_ref} triggers an “invalid indirect expansion” under set -u and aborts pool_spawn. Only attempt indirect expansion for valid env var names.

Suggested fix
-		if [[ -z "$token_value" ]]; then
-			# Try as env var name
-			token_value="${!token_ref:-}"
-		fi
+		if [[ -z "$token_value" ]]; then
+			# Try as env var name only when token_ref is a valid identifier
+			if [[ "$token_ref" =~ ^[A-Za-z_][A-Za-z0-9_]*$ ]]; then
+				token_value="${!token_ref:-}"
+			fi
+		fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Inject OAuth token if provided
if [[ -n "$token_ref" ]]; then
local token_value=""
# Try gopass first, then env var
if command -v gopass &>/dev/null; then
token_value=$(gopass show "$token_ref" 2>/dev/null || echo "")
fi
if [[ -z "$token_value" ]]; then
# Try as env var name
token_value="${!token_ref:-}"
fi
if [[ -n "$token_value" ]]; then
docker_args+=("-e" "CLAUDE_CODE_OAUTH_TOKEN=$token_value")
else
log_warn "OAuth token ref '$token_ref' could not be resolved — container may lack auth"
fi
# Inject OAuth token if provided
if [[ -n "$token_ref" ]]; then
local token_value=""
# Try gopass first, then env var
if command -v gopass &>/dev/null; then
token_value=$(gopass show "$token_ref" 2>/dev/null || echo "")
fi
if [[ -z "$token_value" ]]; then
# Try as env var name only when token_ref is a valid identifier
if [[ "$token_ref" =~ ^[A-Za-z_][A-Za-z0-9_]*$ ]]; then
token_value="${!token_ref:-}"
fi
fi
if [[ -n "$token_value" ]]; then
docker_args+=("-e" "CLAUDE_CODE_OAUTH_TOKEN=$token_value")
else
log_warn "OAuth token ref '$token_ref' could not be resolved — container may lack auth"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/container-pool.sh around lines 182 - 197, The
indirect expansion ${!token_ref} can fail under set -u when token_ref contains
characters invalid for an env var name (e.g. paths like "foo/bar"); before
falling back to the env-var lookup in the token resolution block, validate
token_ref is a legal shell identifier (e.g. matches /^[a-zA-Z_][a-zA-Z0-9_]*$/)
and only then perform the indirect expansion to set token_value, otherwise skip
that branch (and keep the existing warning path). Update the code around
token_ref/token_value/docker_args in the token injection logic (the block that
attempts gopass then env var) to perform this identifier check and avoid
indirect expansion errors.

Comment on lines +483 to +490
local selected
selected=$(db "$SUPERVISOR_DB" "
SELECT id FROM container_pool
WHERE status = 'healthy'
AND (rate_limit_until IS NULL OR rate_limit_until <= strftime('%Y-%m-%dT%H:%M:%SZ','now'))
$host_clause
ORDER BY COALESCE(last_dispatch_at, '1970-01-01T00:00:00Z') ASC
LIMIT 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exclude containers with active dispatches from selection.

Round‑robin should not pick a container that already has an incomplete dispatch. Otherwise a busy container can receive multiple tasks.

Suggested fix
 	selected=$(db "$SUPERVISOR_DB" "
 		SELECT id FROM container_pool
 		WHERE status = 'healthy'
 		  AND (rate_limit_until IS NULL OR rate_limit_until <= strftime('%Y-%m-%dT%H:%M:%SZ','now'))
 		  $host_clause
+		  AND NOT EXISTS (
+		      SELECT 1 FROM container_dispatch_log cdl
+		      WHERE cdl.container_id = container_pool.id AND cdl.completed_at IS NULL
+		  )
 		ORDER BY COALESCE(last_dispatch_at, '1970-01-01T00:00:00Z') ASC
 		LIMIT 1;
 	" 2>/dev/null || echo "")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local selected
selected=$(db "$SUPERVISOR_DB" "
SELECT id FROM container_pool
WHERE status = 'healthy'
AND (rate_limit_until IS NULL OR rate_limit_until <= strftime('%Y-%m-%dT%H:%M:%SZ','now'))
$host_clause
ORDER BY COALESCE(last_dispatch_at, '1970-01-01T00:00:00Z') ASC
LIMIT 1;
local selected
selected=$(db "$SUPERVISOR_DB" "
SELECT id FROM container_pool
WHERE status = 'healthy'
AND (rate_limit_until IS NULL OR rate_limit_until <= strftime('%Y-%m-%dT%H:%M:%SZ','now'))
$host_clause
AND NOT EXISTS (
SELECT 1 FROM container_dispatch_log cdl
WHERE cdl.container_id = container_pool.id AND cdl.completed_at IS NULL
)
ORDER BY COALESCE(last_dispatch_at, '1970-01-01T00:00:00Z') ASC
LIMIT 1;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/container-pool.sh around lines 483 - 490, The
SELECT that assigns selected in the container-pool script must exclude
containers with active/incomplete dispatches: update the SQL inside the db
"$SUPERVISOR_DB" call (the query that selects from container_pool) to add an
exclusion such as AND NOT EXISTS (SELECT 1 FROM dispatch WHERE
dispatch.container_id = container_pool.id AND dispatch.completed_at IS NULL)
(or, if your schema uses a status column, AND NOT EXISTS (SELECT 1 FROM dispatch
WHERE dispatch.container_id = container_pool.id AND dispatch.status NOT IN
('succeeded','failed','cancelled'))). This change should be made where selected
is set so the round-robin never picks a container that currently has an
unfinished dispatch.

Comment on lines +567 to +577
pool_mark_rate_limited() {
local container_id="$1"
local cooldown="${2:-$CONTAINER_POOL_RATE_LIMIT_COOLDOWN}"

db "$SUPERVISOR_DB" "
UPDATE container_pool
SET status = 'rate_limited',
rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${cooldown} seconds'),
rate_limit_count = rate_limit_count + 1,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$container_id")';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate cooldown to prevent SQL injection / malformed timestamps.

cooldown is interpolated directly into SQL; a non‑numeric value can break the query or inject SQL. Validate as an integer and fall back to the default.

Suggested fix
 pool_mark_rate_limited() {
 	local container_id="$1"
 	local cooldown="${2:-$CONTAINER_POOL_RATE_LIMIT_COOLDOWN}"
+
+	if ! [[ "$cooldown" =~ ^[0-9]+$ ]]; then
+		log_warn "Invalid cooldown '$cooldown' — falling back to default"
+		cooldown="$CONTAINER_POOL_RATE_LIMIT_COOLDOWN"
+	fi
 
 	db "$SUPERVISOR_DB" "
 		UPDATE container_pool
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pool_mark_rate_limited() {
local container_id="$1"
local cooldown="${2:-$CONTAINER_POOL_RATE_LIMIT_COOLDOWN}"
db "$SUPERVISOR_DB" "
UPDATE container_pool
SET status = 'rate_limited',
rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${cooldown} seconds'),
rate_limit_count = rate_limit_count + 1,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$container_id")';
pool_mark_rate_limited() {
local container_id="$1"
local cooldown="${2:-$CONTAINER_POOL_RATE_LIMIT_COOLDOWN}"
if ! [[ "$cooldown" =~ ^[0-9]+$ ]]; then
log_warn "Invalid cooldown '$cooldown' — falling back to default"
cooldown="$CONTAINER_POOL_RATE_LIMIT_COOLDOWN"
fi
db "$SUPERVISOR_DB" "
UPDATE container_pool
SET status = 'rate_limited',
rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${cooldown} seconds'),
rate_limit_count = rate_limit_count + 1,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$container_id")';
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/container-pool.sh around lines 567 - 577,
pool_mark_rate_limited currently interpolates the cooldown directly into the SQL
timestamp expression; validate and sanitize the cooldown before interpolation by
ensuring it's a non-negative integer (e.g. check with a digit-only regex or
coerce with printf '%d' and fallback to the default
CONTAINER_POOL_RATE_LIMIT_COOLDOWN if invalid), optionally clamp to a reasonable
max, then use that sanitized numeric variable in the SQL (leave sql_escape for
container_id unchanged). Update the pool_mark_rate_limited function to perform
this validation/fallback and reference the sanitized variable (e.g.
cooldown_seconds) in the strftime('+${cooldown_seconds} seconds') fragment.

Comment on lines +767 to +776
# Migrate: create container_pool tables if missing (t1165.2)
# Container pool manager — spawn/destroy containers, health checks,
# round-robin dispatch, per-container rate limit tracking.
local has_container_pool
has_container_pool=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='container_pool';" 2>/dev/null || echo "0")
if [[ "$has_container_pool" -eq 0 ]]; then
log_info "Creating container_pool tables (t1165.2)..."
_create_container_pool_schema
log_success "Created container_pool and container_dispatch_log tables (t1165.2)"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all scripts that source database.sh
echo "=== Scripts sourcing database.sh ==="
rg -n "source .*database\.sh|^\s*\. .*database\.sh" -g '*.sh' --no-heading

echo -e "\n=== Scripts sourcing container-pool.sh ==="
rg -n "source .*container-pool\.sh|^\s*\. .*container-pool\.sh" -g '*.sh' --no-heading

echo -e "\n=== Entry points (main scripts in supervisor/) ==="
fd -t f -e sh "^[^.]" .agents/scripts/supervisor/ | head -20

echo -e "\n=== Check supervisor/ directory structure ==="
ls -la .agents/scripts/supervisor/ | grep -E "\.sh$"

Repository: marcusquinn/aidevops

Length of output: 3856


🏁 Script executed:

#!/bin/bash
echo "=== Check backfill-blocked-comments.sh calls to ensure_db/init_db ==="
rg -n "ensure_db|init_db" .agents/scripts/supervisor/backfill-blocked-comments.sh

echo -e "\n=== Check test-dual-cli-e2e.sh calls to ensure_db/init_db (sample) ==="
rg -n "ensure_db|init_db" tests/test-dual-cli-e2e.sh | head -10

echo -e "\n=== Check test-dispatch-claude-cli.sh calls to ensure_db/init_db (sample) ==="
rg -n "ensure_db|init_db" tests/test-dispatch-claude-cli.sh | head -10

echo -e "\n=== Check if _create_container_pool_schema is conditionally defined ==="
rg -n "_create_container_pool_schema" .agents/scripts/supervisor/database.sh

echo -e "\n=== Check sourcing order in supervisor-helper.sh ==="
sed -n '210,245p' .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 2290


🏁 Script executed:

#!/bin/bash
echo "=== Check for module-level calls to ensure_db/init_db in database.sh ==="
# Look for calls outside of function definitions
rg -n "ensure_db|init_db" .agents/scripts/supervisor/database.sh | head -20

echo -e "\n=== Look at the actual function definitions and calls in database.sh ==="
rg -B2 -A5 "^ensure_db\(\)|^init_db\(\)" .agents/scripts/supervisor/database.sh | head -40

echo -e "\n=== Check if backfill-blocked-comments.sh actually needs container-pool ==="
cat .agents/scripts/supervisor/backfill-blocked-comments.sh

Repository: marcusquinn/aidevops

Length of output: 4504


🏁 Script executed:

#!/bin/bash
echo "=== Context around lines 941 and 973 (ensure_db calls) ==="
sed -n '935,945p' .agents/scripts/supervisor/database.sh
echo "---"
sed -n '968,978p' .agents/scripts/supervisor/database.sh

echo -e "\n=== Check what function/block contains these calls ==="
sed -n '900,950p' .agents/scripts/supervisor/database.sh | cat -n

Repository: marcusquinn/aidevops

Length of output: 2950


🏁 Script executed:

#!/bin/bash
echo "=== Check end of database.sh for module-level invocations ==="
tail -50 .agents/scripts/supervisor/database.sh

echo -e "\n=== Check if database.sh sources container-pool.sh internally ==="
head -50 .agents/scripts/supervisor/database.sh | grep -i "source.*container-pool"

Repository: marcusquinn/aidevops

Length of output: 1368


Potential scoping issue: _create_container_pool_schema is called by ensure_db/init_db but may not be defined at call time.

ensure_db (lines 767-776) and init_db (line 930) invoke _create_container_pool_schema, which is defined in container-pool.sh. However, database.sh does not source container-pool.sh internally. While supervisor-helper.sh correctly sources container-pool.sh after database.sh, direct consumers of database.sh (such as backfill-blocked-comments.sh or test utilities) may source only database.sh without container-pool.sh. This creates a fragile dependency: if any code path manually sources database.sh and later calls ensure_db or init_db, it will fail with an undefined function error.

Recommendation: Either (1) add source container-pool.sh at the start of database.sh to guarantee the function is available, (2) move _create_container_pool_schema into database.sh, or (3) add a defensive check in ensure_db/init_db to validate the function exists before calling it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/database.sh around lines 767 - 776, ensure_db and
init_db call the helper _create_container_pool_schema which is defined in
container-pool.sh but database.sh doesn’t source that file; update database.sh
to ensure the function is available before calling it by adding a defensive
check in ensure_db/init_db: test whether _create_container_pool_schema is
defined (e.g., via declare -f or type) and if not, source container-pool.sh (or
abort with a clear error), or alternatively source container-pool.sh near the
top of database.sh to guarantee availability; reference ensure_db, init_db, and
_create_container_pool_schema when making the change.

Comment on lines +3222 to +3234
# t1165.2: Container pool dispatch — if pool has healthy containers,
# select one via round-robin and record the dispatch. The container ID
# is stored in dispatch_target as "pool:<container_id>" for tracking.
# Pool dispatch is opt-in: only used when SUPERVISOR_USE_CONTAINER_POOL=true
# or when the task has dispatch_target=pool.
local pool_container_id=""
if [[ "${SUPERVISOR_USE_CONTAINER_POOL:-false}" == "true" ]]; then
pool_container_id=$(pool_select_container 2>/dev/null) || pool_container_id=""
if [[ -n "$pool_container_id" ]]; then
log_info "Container pool: selected $pool_container_id for $task_id (t1165.2)"
pool_record_dispatch "$pool_container_id" "$task_id" 2>/dev/null || true
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pool selection is recorded but never used for routing.

pool_select_container sets pool_container_id and updates pool stats, yet the dispatch still follows the normal local/remote path and never uses the selected container. This makes pool metrics incorrect and the comment about dispatch_target=pool isn’t honored. Either wire the selected container into a pool-specific dispatch branch (and persist it on the task), or gate selection so it only runs when an actual pool dispatch path exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 3222 - 3234, The pool
selection is never wired into routing—after getting pool_container_id you must
set and persist the task's dispatch target and short-circuit to a pool-specific
dispatch branch so the selected container is actually used. Concretely: when
pool_select_container returns a non-empty pool_container_id, set
dispatch_target="pool:$pool_container_id", persist that value to whatever task
metadata/store the dispatcher later reads (so downstream code sees it), and
route the task through the pool-specific dispatch path that sends the job to
that container (and keep calling pool_record_dispatch to update metrics).
Alternatively, only invoke pool_select_container when the subsequent routing
supports a "pool:" dispatch_target; ensure any routing logic (the parts that
read dispatch_target) recognizes the "pool:<id>" prefix and sends the task to
that container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant