diff --git a/CLAUDE.md b/CLAUDE.md index 0a4af5514f..8bf0f5af82 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,16 +6,16 @@ - **Python**: 3.14+ (PEP 649 native lazy annotations) - **License**: BUSL-1.1 (converts to Apache 2.0 on 2030-02-27) - **Layout**: `src/ai_company/` (src layout), `tests/` (unit/integration/e2e) -- **Design**: [DESIGN_SPEC.md](DESIGN_SPEC.md) (full high-level spec) +- **Design**: [DESIGN_SPEC.md](DESIGN_SPEC.md) (pointer to `docs/design/` pages) ## Design Spec (MANDATORY) -- **ALWAYS read `DESIGN_SPEC.md`** before implementing any feature or planning any issue +- **ALWAYS read the relevant `docs/design/` page** before implementing any feature or planning any issue. [DESIGN_SPEC.md](DESIGN_SPEC.md) is a pointer file linking to the 7 design pages. - The design spec is the **starting point** for architecture, data models, and behavior - If implementation deviates from the spec (better approach found, scope evolved, etc.), **alert the user and explain why** — user decides whether to proceed or update the spec - Do NOT silently diverge — every deviation needs explicit user approval -- When a spec section is referenced (e.g. "Section 10.2"), read that section verbatim before coding -- When approved deviations occur, update `DESIGN_SPEC.md` to reflect the new reality +- When a spec topic is referenced (e.g. "the Agents page" or "the Engine page's Crash Recovery section"), read the relevant `docs/design/` page before coding +- When approved deviations occur, update the relevant `docs/design/` page to reflect the new reality ## Planning (MANDATORY) @@ -45,11 +45,15 @@ uv run mkdocs serve # local docs preview (http://127.0.0. ## Documentation - **Docs source**: `docs/` (MkDocs markdown + mkdocstrings auto-generated API reference) +- **Design spec**: `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations) +- **Architecture**: `docs/architecture/` (overview, tech-stack, decision log) +- **Roadmap**: `docs/roadmap/` (status, open questions, future vision) +- **Reference**: `docs/reference/` (research, standards) - **Landing page**: `site/` (Astro, Concept C hybrid design) - **Config**: `mkdocs.yml` at repo root - **API reference**: auto-generated from docstrings via mkdocstrings + Griffe (AST-based, no imports) - **CI**: `.github/workflows/pages.yml` — builds Astro landing + MkDocs docs, merges, deploys to GitHub Pages -- **Architecture decision**: `docs/decisions/ADR-003-documentation-architecture.md` +- **Architecture decisions**: `docs/architecture/decisions.md` (decision log) - **Dependencies**: `docs` group in `pyproject.toml` (`mkdocs-material`, `mkdocstrings[python]`, `griffe-pydantic`) ## Docker @@ -86,8 +90,8 @@ src/ai_company/ core/ # Shared domain models, base classes, and resilience config (RetryConfig, RateLimiterConfig) engine/ # Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, task lifecycle, recovery, shutdown, workspace isolation, coordination error classification, and prompt policy validation hr/ # HR engine: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping) - memory/ # Persistent agent memory (Mem0 initial, custom stack future — ADR-001), retrieval pipeline (ranking, injection, context formatting, non-inferable filtering), shared org memory (org/), consolidation/archival (consolidation/) - persistence/ # Operational data persistence — pluggable PersistenceBackend protocol, SQLite initial (§7.6) + memory/ # Persistent agent memory (Mem0 initial, custom stack future — see Decision Log), retrieval pipeline (ranking, injection, context formatting, non-inferable filtering), shared org memory (org/), consolidation/archival (consolidation/) + persistence/ # Operational data persistence — pluggable PersistenceBackend protocol, SQLite initial (see Memory & Persistence design page) observability/ # Structured logging, correlation tracking, log sinks providers/ # LLM provider abstraction (LiteLLM adapter) security/ # SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume) @@ -144,7 +148,7 @@ src/ai_company/ - **Timeout**: 30 seconds per test - **Parallelism**: `pytest-xdist` via `-n auto` — **ALWAYS** include `-n auto` when running pytest, never run tests sequentially - **Parametrize**: Prefer `@pytest.mark.parametrize` for testing similar cases -- **Vendor-agnostic everywhere**: NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: `example-provider`, `example-large-001`, `example-medium-001`, `example-small-001`, `large`/`medium`/`small` as aliases. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list (listing supported providers), (2) `.claude/` skill/agent files, (3) third-party import paths/module names (e.g. `litellm.types.llms.openai`). Tests must use `test-provider`, `test-small-001`, etc. +- **Vendor-agnostic everywhere**: NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: `example-provider`, `example-large-001`, `example-medium-001`, `example-small-001`, `large`/`medium`/`small` as aliases. Vendor names may only appear in: (1) Operations design page provider list (`docs/design/operations.md`), (2) `.claude/` skill/agent files, (3) third-party import paths/module names (e.g. `litellm.types.llms.openai`). Tests must use `test-provider`, `test-small-001`, etc. ## Git @@ -152,7 +156,7 @@ src/ai_company/ - **Enforced by**: commitizen (commit-msg hook) - **Branches**: `/` from main - **Pre-commit hooks**: trailing-whitespace, end-of-file-fixer, check-yaml, check-toml, check-json, check-merge-conflict, check-added-large-files, no-commit-to-branch (main), ruff check+format, gitleaks -- **GitHub issue queries**: use `gh issue list` via Bash (not MCP tools) — MCP `list_issues` returns `null` for milestone data +- **GitHub issue queries**: use `gh issue list` via Bash (not MCP tools) — MCP `list_issues` has unreliable field data - **PR issue references**: preserve existing `Closes #NNN` references — never remove unless explicitly asked ## Post-Implementation (MANDATORY) diff --git a/DESIGN_SPEC.md b/DESIGN_SPEC.md index 26204c5996..dbf14b6e9a 100644 --- a/DESIGN_SPEC.md +++ b/DESIGN_SPEC.md @@ -4,3524 +4,27 @@ --- -## Table of Contents - -1. [Vision & Philosophy](#1-vision--philosophy) — 1.4 MVP Definition, 1.5 Configuration Philosophy -2. [Core Concepts](#2-core-concepts) -3. [Agent System](#3-agent-system) -4. [Company Structure](#4-company-structure) -5. [Communication Architecture](#5-communication-architecture) — 5.6 Conflict Resolution, 5.7 Meeting Protocol -6. [Task & Workflow Engine](#6-task--workflow-engine) — 6.5 Execution Loop, 6.6 Crash Recovery, **6.7 Graceful Shutdown**, **6.8 Workspace Isolation**, **6.9 Task Decomposability & Coordination Topology** -7. [Memory & Persistence](#7-memory--persistence) — 7.4 Shared Org Memory (Research Directions), **7.5 Memory Backend Protocol**, **7.6 Operational Data Persistence** -8. [HR & Workforce Management](#8-hr--workforce-management) -9. [Model Provider Layer](#9-model-provider-layer) -10. [Cost & Budget Management](#10-cost--budget-management) -11. [Tool & Capability System](#11-tool--capability-system) — **11.1.3 MCP Integration**, **11.1.4 Action Type System**, 11.3 Progressive Trust -12. [Security & Approval System](#12-security--approval-system) — 12.4 Approval Timeout -13. [Human Interaction Layer](#13-human-interaction-layer) -14. [Templates & Builder](#14-templates--builder) -15. [Technical Architecture](#15-technical-architecture) — 15.5 Engineering Conventions -16. [Research & Prior Art](#16-research--prior-art) — **16.3 Agent Scaling Research**, 16.4 Build vs Fork Decision -17. [Open Questions & Risks](#17-open-questions--risks) -18. [Backlog & Future Vision](#18-backlog--future-vision) - ---- - -## 1. Vision & Philosophy - -### 1.1 Core Vision - -Build a **configurable AI company framework** where AI agents operate within a virtual organization. Each agent has a defined role, personality, skills, memory, and model backend. The company can be configured from a 2-person startup to a 50+ enterprise, handling software development, business operations, creative work, or any domain. - -### 1.2 Design Principles - -| Principle | Description | -|-----------|-------------| -| **Configuration over Code** | Company structures, roles, and workflows defined via config, not hardcoded | -| **Provider Agnostic** | Any LLM backend: cloud APIs, OpenRouter, Ollama, custom endpoints | -| **Composable** | Mix and match roles, teams, workflows. Build any type of company | -| **Observable** | Every agent action, communication, and decision is logged and visible | -| **Autonomy Spectrum** | From full human oversight to fully autonomous operation | -| **Cost Aware** | Built-in budget tracking, model routing optimization, spending controls | -| **Extensible** | Plugin architecture for new roles, tools, providers, and workflows | -| **Local First** | Runs locally with option to expose on network or host remotely later | - -### 1.3 What This Is NOT - -- Not a chatbot or conversational AI product -- Not locked to software development only (though that is a primary use case) -- Not a wrapper around a single model or provider -- Not a toy/demo - designed for real, production-quality output - -### 1.4 MVP Definition - -The MVP validates the core hypothesis: **a single agent can complete a real task end-to-end** within the framework's architecture. - -**MVP scope:** - -- Single agent executing tasks via the **ReAct** execution loop -- **Subprocess sandbox** for file system and git tools (Docker optional for code execution) -- **Fail-and-reassign** crash recovery -- **Cooperative graceful shutdown** with configurable timeout -- **Proxy metrics**: turns/tokens/cost per task -- System prompt builder with agent personality injection - -> **How to read this spec:** Sections describe the full vision. The full design is documented upfront to inform architecture decisions — protocol interfaces are designed even for features that are not yet implemented. - -> **Implementation snapshot (2026-03-10):** -> All major subsystems are implemented: config/core models, provider layer, single-agent engine, multi-agent orchestration (message bus, delegation, loop prevention, conflict resolution, meeting protocols), API surface (REST + WebSocket), Docker sandbox, MCP bridge, code runner, HR engine (hiring/firing/onboarding/offboarding/registry, performance tracking, promotion/demotion), memory layer (retrieval pipeline, shared org memory, consolidation/archival — backend selected per [ADR-001](docs/decisions/ADR-001-memory-layer.md)), persistence (SQLite backend, audit entry persistence), budget enforcement (BudgetEnforcer, cost tiers, quota/subscription tracking, CFO cost optimization), SecOps agent (rule engine, audit log, output scanner, output scan response policies, risk classifier, ToolInvoker integration), progressive trust (4 strategies behind TrustStrategy protocol), autonomy levels (presets, resolver, change strategy), and approval timeout policies (4 policies, park/resume service, risk tier classifier). -> - **Remaining:** Mem0 adapter backend, approval workflow gates. - -### 1.5 Configuration Philosophy - -The framework follows **progressive disclosure** — users only configure what they need: - -1. **Templates** handle 90% of users — pick a template, override 2–3 values, go -2. **Minimal config** for custom setups — everything has sensible defaults -3. **Full config** for power users — every knob exposed but none required - -**Minimal custom company** (all other settings use defaults): - -```yaml -company: - name: "Acme Corp" - template: "startup" - budget_monthly: 50.00 -``` - -All configuration systems in the framework are **pluggable** — strategies, backends, and policies are swappable via protocol interfaces without modifying existing code. Sensible defaults are chosen for each, documented in the relevant section alongside the full configuration reference. - ---- - -## 2. Core Concepts - -### 2.1 Glossary - -| Term | Definition | -|------|-----------| -| **Agent** | An AI entity with a role, personality, model backend, memory, and tool access. The primary entity in the framework. Within a company context, agents serve as the company's employees. | -| **Company** | A configured organization of agents with structure, hierarchy, and workflows | -| **Department** | A grouping of related roles (Engineering, Product, Design, Operations, etc.) | -| **Role** | A job definition with required skills, responsibilities, authority level, and tool access | -| **Skill** | A capability an agent possesses (coding, writing, analysis, design, etc.) | -| **Task** | A unit of work assigned to one or more agents | -| **Project** | A collection of related tasks with a goal, deadline, and assigned team | -| **Meeting** | A structured multi-agent interaction for decisions, reviews, or planning | -| **Artifact** | Any output produced by agents: code, documents, designs, reports, etc. | - -### 2.2 Entity Relationships - -```text -Company - ├── Departments[] - │ ├── Department Head (Agent) - │ └── Members (Agent[]) - ├── Projects[] - │ ├── Tasks[] - │ │ ├── Assigned Agent(s) - │ │ ├── Artifacts[] - │ │ └── Status / History - │ └── Team (Agent[]) - ├── Config - │ ├── Autonomy Level - │ ├── Budget - │ ├── Communication Settings - │ └── Tool Permissions - └── HR Registry - ├── Active Agents[] - ├── Available Roles[] - └── Hiring Queue -``` - ---- - -## 3. Agent System - -### 3.1 Agent Identity Card - -Every agent has a comprehensive identity. At the design level, agent data splits into two layers: - -- **Config (immutable)**: identity, personality, skills, model preferences, tool permissions, authority. Defined at hire time, changed only by explicit reconfiguration. Represented as frozen Pydantic models. -- **Runtime state (mutable-via-copy)**: current status, active task, conversation history, execution metrics. Evolves during agent operation. Represented as Pydantic models using `model_copy(update=...)` for state transitions — never mutated in place. - -> **Current state:** Both layers are implemented. Config layer: `AgentIdentity` (frozen, in `core/agent.py`). Runtime state layer: `TaskExecution`, `AgentContext`, `AgentContextSnapshot` (frozen + `model_copy`, in `engine/`). `AgentEngine` orchestrates execution via `run()`. All identifier/name fields use `NotBlankStr` (from `core.types`) for automatic whitespace rejection; optional identifier fields use `NotBlankStr | None`; tuple fields use `tuple[NotBlankStr, ...]` for per-element validation. - -**Personality dimensions** split into two tiers: - -- **Big Five (OCEAN-variant)** — floats (0.0–1.0) used for internal compatibility scoring only (not injected into prompts). `stress_response` replaces traditional neuroticism with inverted polarity (1.0 = very calm). Scored by `core/personality.py`. -- **Behavioral enums** — injected into system prompts as natural-language labels that LLMs respond to: - - `DecisionMakingStyle`: `analytical`, `intuitive`, `consultative`, `directive` - - `CollaborationPreference`: `independent`, `pair`, `team` - - `CommunicationVerbosity`: `terse`, `balanced`, `verbose` - - `ConflictApproach`: `avoid`, `accommodate`, `compete`, `compromise`, `collaborate` (Thomas-Kilmann model) - -```yaml -# --- Config layer — AgentIdentity (frozen) --- -agent: - id: "uuid" - name: "Sarah Chen" - role: "Senior Backend Developer" - department: "Engineering" - level: "Senior" # Junior, Mid, Senior, Lead, Principal, Director, VP, C-Suite - personality: - traits: - - analytical - - detail-oriented - - pragmatic - communication_style: "concise and technical" - risk_tolerance: "low" # low, medium, high - creativity: "medium" # low, medium, high - description: > - Sarah is a methodical backend developer who prioritizes clean architecture - and thorough testing. She pushes back on shortcuts and advocates for - proper error handling. Prefers Pythonic solutions. - # Big Five (OCEAN-variant) dimensions — internal scoring (0.0-1.0) - openness: 0.4 # curiosity, creativity - conscientiousness: 0.9 # thoroughness, reliability - extraversion: 0.3 # assertiveness, sociability - agreeableness: 0.5 # cooperation, empathy - stress_response: 0.75 # emotional stability (1.0 = very calm) - # Behavioral enums — injected into system prompts - decision_making: "analytical" # analytical, intuitive, consultative, directive - collaboration: "independent" # independent, pair, team - verbosity: "balanced" # terse, balanced, verbose - conflict_approach: "compromise" # avoid, accommodate, compete, compromise, collaborate - skills: - primary: - - python - - litestar - - postgresql - - system-design - secondary: - - docker - - redis - - testing - model: - provider: "example-provider" # example provider - model_id: "example-medium-001" # example model — actual models TBD per agent/role - temperature: 0.3 - max_tokens: 8192 - fallback_model: "openrouter/example-medium-001" # example fallback - memory: - type: "persistent" # persistent, project, session, none - retention_days: null # null = forever - tools: - access_level: "standard" # sandboxed | restricted | standard | elevated | custom - allowed: - - file_system - - git - - code_execution - - web_search - - terminal - denied: - - deployment - - database_admin - authority: - can_approve: ["junior_dev_tasks", "code_reviews"] - reports_to: "engineering_lead" - can_delegate_to: ["junior_developers"] - budget_limit: 5.00 # max USD per task - autonomy_level: null # optional: full, semi, supervised, locked (overrides department/company default, §12.2) - hiring_date: "2026-02-27" - status: "active" # active, on_leave, terminated (on config model today) - -# --- Runtime state — engine/ (frozen + model_copy) --- -# TaskExecution wraps Task with evolving execution state: -# status: TaskStatus # evolves via with_transition() -# transition_log: tuple[StatusTransition, ...] -# accumulated_cost: TokenUsage # running totals -# turn_count: int # LLM turns completed -# started_at / completed_at: AwareDatetime | None -# -# AgentContext wraps AgentIdentity + TaskExecution with: -# execution_id: str # uuid4, unique per run -# conversation: tuple[ChatMessage, ...] -# accumulated_cost: TokenUsage # running totals -# turn_count: int # LLM turns completed -# max_turns: int # hard limit (default 20) -# started_at: AwareDatetime -``` - -### 3.2 Seniority & Authority Levels - -| Level | Authority | Typical Model | Cost Tier | -|-------|----------|---------------|-----------| -| Intern/Junior | Execute assigned tasks only | small / local | $ | -| Mid | Execute + suggest improvements | medium / local | $$ | -| Senior | Execute + design + review others | medium / large | $$$ | -| Lead | All above + approve + delegate | large / medium | $$$ | -| Principal/Staff | All above + architectural decisions | large | $$$$ | -| Director | Strategic decisions + budget authority | large | $$$$ | -| VP | Department-wide authority | large | $$$$ | -| C-Suite (CEO/CTO/CFO) | Company-wide authority + final approvals | large | $$$$ | - -### 3.3 Role Catalog (Extensible) - -#### C-Suite / Executive - -- **CEO** - Overall strategy, final decision authority, cross-department coordination -- **CTO** - Technical vision, architecture decisions, technology choices -- **CFO** - Budget management, cost optimization, resource allocation -- **COO** - Operations, process optimization, workflow management -- **CPO** - Product strategy, roadmap, feature prioritization - -#### Product & Design - -- **Product Manager** - Requirements, user stories, prioritization, stakeholder communication -- **UX Designer** - User research, wireframes, user flows, usability -- **UI Designer** - Visual design, component design, design systems -- **UX Researcher** - User interviews, analytics, A/B test design -- **Technical Writer** - Documentation, API docs, user guides - -#### Engineering - -- **Software Architect** - System design, technology decisions, patterns -- **Frontend Developer** (Junior/Mid/Senior) - UI implementation, components, state management -- **Backend Developer** (Junior/Mid/Senior) - APIs, business logic, databases -- **Full-Stack Developer** (Junior/Mid/Senior) - End-to-end implementation -- **DevOps/SRE Engineer** - Infrastructure, CI/CD, monitoring, deployment -- **Database Engineer** - Schema design, query optimization, migrations -- **Security Engineer** - Security audits, vulnerability assessment, secure coding - -#### Quality Assurance - -- **QA Lead** - Test strategy, quality gates, release readiness -- **QA Engineer** - Test plans, manual testing, bug reporting -- **Automation Engineer** - Test frameworks, CI integration, E2E tests -- **Performance Engineer** - Load testing, profiling, optimization - -#### Data & Analytics - -- **Data Analyst** - Metrics, dashboards, business intelligence -- **Data Engineer** - Pipelines, ETL, data infrastructure -- **ML Engineer** - Model training, inference, MLOps - -#### Operations & Support - -- **Project Manager** - Timelines, dependencies, risk management, status tracking -- **Scrum Master** - Agile ceremonies, impediment removal, team health -- **HR Manager** - Hiring recommendations, team composition, performance tracking -- **Security Operations** - Request validation, safety checks, approval workflows - -#### Creative & Marketing - -- **Content Writer** - Blog posts, marketing copy, social media -- **Brand Strategist** - Messaging, positioning, competitive analysis -- **Growth Marketer** - Campaigns, analytics, conversion optimization - -### 3.4 Dynamic Roles - -Users can define custom roles via config: - -```yaml -custom_roles: - - name: "Blockchain Developer" - department: "Engineering" - skills: ["solidity", "web3", "smart-contracts"] - system_prompt_template: "blockchain_dev.md" - authority_level: "senior" - suggested_model: "large" -``` - ---- - -## 4. Company Structure - -### 4.1 Company Types (Templates) - -| Template | Size | Roles | Use Case | -|----------|------|-------|----------| -| **Solo Founder** | 1-2 | CEO + Full-Stack Dev | Quick prototypes, solo projects | -| **Startup** | 3-5 | CEO, CTO, 2 Devs, PM | Small projects, MVPs | -| **Dev Shop** | 5-10 | Lead, Sr Dev, Jr Devs, QA, DevOps | Software development focus | -| **Product Team** | 8-15 | PM, Designer, Devs, QA, Data Analyst | Product-focused development | -| **Agency** | 10-20 | Multiple PMs, Designers, Devs, Content | Client work, multiple projects | -| **Full Company** | 20-50+ | All departments, full hierarchy | Enterprise simulation | -| **Research Lab** | 5-10 | Lead Researcher, Analysts, Engineers | Research and analysis | -| **Custom** | Any | User-defined | Anything | - -### 4.2 Organizational Hierarchy - -```text - ┌─────────┐ - │ CEO │ - └────┬────┘ - ┌──────────────┼──────────────┐ - ┌────┴────┐ ┌────┴────┐ ┌─────┴────┐ - │ CTO │ │ CPO │ │ CFO │ - └────┬────┘ └────┬────┘ └────┬─────┘ - │ │ │ - ┌─────────┼────────┐ │ Budget Mgmt - │ │ │ │ -┌───┴───┐ ┌──┴──┐ ┌───┴──┐ ├── Product Managers -│ Eng │ │ QA │ │DevOps│ ├── UX/UI Designers -│ Lead │ │Lead │ │ Lead │ └── Tech Writers -└───┬───┘ └──┬──┘ └──┬───┘ - │ │ │ - Sr Devs QA Eng SRE - Jr Devs Auto Eng -``` - -### 4.3 Department Configuration - -```yaml -departments: - engineering: - head: "cto" - budget_percent: 60 - teams: - - name: "backend" - lead: "backend_lead" - members: ["sr_backend_1", "mid_backend_1", "jr_backend_1"] - - name: "frontend" - lead: "frontend_lead" - members: ["sr_frontend_1", "mid_frontend_1"] - product: - head: "cpo" - budget_percent: 20 - teams: - - name: "core" - lead: "pm_lead" - members: ["pm_1", "ux_designer_1", "ui_designer_1"] - operations: - head: "coo" - budget_percent: 10 - teams: - - name: "devops" - lead: "devops_lead" - members: ["sre_1"] - quality: - head: "qa_lead" - budget_percent: 10 - teams: - - name: "qa" - lead: "qa_lead" - members: ["qa_engineer_1", "automation_engineer_1"] -``` - -### 4.4 Dynamic Scaling - -The company can dynamically grow or shrink: - -- **Auto-scale**: HR agent detects workload increase, proposes new hires -- **Manual scale**: Human adds/removes agents via config or UI -- **Budget-driven**: CFO agent caps headcount based on budget constraints -- **Skill-gap**: HR analyzes team capabilities, identifies missing skills, proposes hires - ---- - -## 5. Communication Architecture - -### 5.1 Communication Patterns - -The system supports multiple communication patterns, configurable per company: - -#### Pattern 1: Event-Driven Message Bus (Recommended Default) - -```text -┌──────────┐ ┌─────────────────┐ ┌──────────┐ -│ Agent A │────▶│ Message Bus │◀────│ Agent B │ -└──────────┘ │ (Topics/Queues) │ └──────────┘ - └────────┬────────┘ - │ - ┌───────────┼───────────┐ - ▼ ▼ ▼ - #engineering #product #all-hands - #code-review #design #incidents -``` - -- Agents publish to topics, subscribe to relevant channels -- Async by default, enables parallelism -- Decoupled - agents don't need to know about each other -- Natural audit trail of all communications -- **Best for**: Most scenarios, scales well, production-ready pattern - -#### Pattern 2: Hierarchical Delegation - -```text -CEO ──▶ CTO ──▶ Eng Lead ──▶ Sr Dev ──▶ Jr Dev - │ - └──▶ QA Lead ──▶ QA Eng -``` - -- Tasks flow down the hierarchy, results flow up -- Each level can decompose/refine tasks before delegating -- Authority enforcement built into the flow -- **Best for**: Structured organizations, clear chains of command - -#### Pattern 3: Meeting-Based - -```text -┌─────────────────────────────────┐ -│ Sprint Planning │ -│ PM + CTO + Devs + QA + Design │ -│ Output: Sprint backlog │ -└─────────────────────────────────┘ - │ -┌────────┴────────┐ -│ Daily Standup │ -│ Devs + QA │ -│ Output: Status │ -└─────────────────┘ -``` - -- Structured multi-agent conversations at defined intervals -- Standup, sprint planning, retrospective, design review, code review -- **Best for**: Agile workflows, decision-making, alignment - -#### Pattern 4: Hybrid (Recommended for Full Company) - -Combines all three: -- **Message bus** for async daily work and notifications -- **Hierarchical delegation** for task assignment and approvals -- **Meetings** for cross-team decisions and planning ceremonies - -### 5.2 Communication Standards - -The framework should align with emerging industry standards: - -- **A2A Protocol** (Agent-to-Agent, Linux Foundation) - For inter-agent task delegation, capability discovery via Agent Cards, and structured task lifecycle management -- **MCP** (Model Context Protocol, Agentic AI Foundation / Linux Foundation) - For agent-to-tool integration, providing standardized tool discovery and invocation - -### 5.3 Message Format - -```json -{ - "id": "msg-uuid", - "timestamp": "2026-02-27T10:30:00Z", - "from": "sarah_chen", - "to": "engineering", - "type": "task_update", - "priority": "normal", - "channel": "#backend", - "content": "Completed API endpoint for user authentication. PR ready for review.", - "attachments": [ - {"type": "artifact", "ref": "pr-42"} - ], - "metadata": { - "task_id": "task-123", - "project_id": "proj-456", - "tokens_used": 1200, - "cost_usd": 0.018 - } -} -``` - -### 5.4 Communication Config - -```yaml -communication: - default_pattern: "hybrid" - message_bus: - backend: "internal" # internal, redis, rabbitmq, kafka - channels: - - "#all-hands" - - "#engineering" - - "#product" - - "#design" - - "#incidents" - - "#code-review" - - "#watercooler" - meetings: - enabled: true - types: - - name: "daily_standup" - frequency: "per_sprint_day" - participants: ["engineering", "qa"] - duration_tokens: 2000 - - name: "sprint_planning" - frequency: "bi_weekly" - participants: ["all"] - duration_tokens: 5000 - - name: "code_review" - trigger: "on_pr" - participants: ["author", "reviewers"] - hierarchy: - enforce_chain_of_command: true - allow_skip_level: false # can a junior message the CEO directly? -``` - -### 5.5 Loop Prevention - -Agent communication loops (A delegates to B who delegates back to A) are a critical risk. The framework enforces multiple safeguards: - -| Mechanism | Description | Default | -|-----------|-------------|---------| -| **Max delegation depth** | Hard limit on chain length (A→B→C→D stops at depth N) | 5 | -| **Message rate limit** | Max messages per agent pair within a time window | 10 per minute | -| **Identical request dedup** | Detects and rejects duplicate task delegations within a window | 60s window | -| **Circuit breaker** | If an agent pair exceeds error/bounce threshold, block further messages until manual reset or cooldown | 3 bounces → 5min cooldown | -| **Task ancestry tracking** | Every delegated task carries its full delegation chain; agents cannot delegate back to any ancestor in the chain | Always on | - -```yaml -loop_prevention: - max_delegation_depth: 5 - rate_limit: - max_per_pair_per_minute: 10 - burst_allowance: 3 - dedup_window_seconds: 60 - circuit_breaker: - bounce_threshold: 3 - cooldown_seconds: 300 - ancestry_tracking: true # always on, not configurable -``` - -When a loop is detected, the framework: -1. Blocks the looping message -2. Notifies the sending agent with the detected loop chain -3. Escalates to the sender's manager (or human if at top of hierarchy) -4. Logs the loop for analytics and process improvement - -> **Current state:** The communication foundation is implemented: `MessageBus` protocol with `InMemoryMessageBus` backend (asyncio queues, pull-model `receive()`), `MessageDispatcher` for concurrent handler routing via `asyncio.TaskGroup`, `AgentMessenger` per-agent facade (auto-fills sender/timestamp/ID, deterministic direct-channel naming `@{sorted_a}:{sorted_b}`), and `DeliveryEnvelope` for delivery tracking. Loop prevention (§5.5) is implemented: `DelegationGuard` orchestrates five mechanisms (ancestry, depth, dedup, rate limit, circuit breaker) with `LoopPreventionConfig`. Hierarchical delegation is implemented via `DelegationService` with `HierarchyResolver` and `AuthorityValidator`. Task model extended with `parent_task_id` and `delegation_chain` fields. Conflict resolution (§5.6) is implemented: `ConflictResolver` protocol with four strategies (Authority, Debate, HumanEscalation, Hybrid), `ConflictResolutionService` orchestrator, `DissentRecord` audit trail, and `HierarchyResolver.get_lowest_common_manager()` for cross-department conflict escalation. Meeting protocol (§5.7) is implemented with all 3 protocols (round-robin, position papers, structured phases) via `MeetingOrchestrator` in `communication/meeting/`. - -### 5.6 Conflict Resolution Protocol - -When two or more agents disagree on an approach (architecture, implementation, priority, etc.), the framework provides multiple configurable resolution strategies behind a `ConflictResolver` protocol. New strategies can be added without modifying existing ones. The strategy is configurable per company, per department, or per conflict type. - -> **Current state:** All four strategies implemented: `AuthorityResolver` (seniority + hierarchy proximity), `DebateResolver` (judge-based with `JudgeEvaluator` protocol), `HumanEscalationResolver` (stub pending approval queue #37), `HybridResolver` (automated review + escalation). `ConflictResolutionService` orchestrates strategy selection and audit trail (`DissentRecord`). Models: `Conflict`, `ConflictPosition`, `ConflictResolution` (frozen Pydantic). Config: `ConflictResolutionConfig`, `DebateConfig`, `HybridConfig`. `HierarchyResolver` extended with `get_lowest_common_manager()` and `get_delegation_depth()`. Event constants in `observability/events/conflict.py`. - -#### Strategy 1: Authority + Dissent Log (Default) - -The agent with higher authority level decides. Cross-department conflicts (incomparable authority) escalate to the lowest common manager in the hierarchy. The losing agent's reasoning is preserved as a **dissent record** — a structured log entry containing the conflict context, both positions, and the resolution. Dissent records feed into organizational learning and can be reviewed during retrospectives. - -```yaml -conflict_resolution: - strategy: "authority" # authority, debate, human, hybrid -``` - -- Deterministic, zero extra tokens, fast resolution -- Dissent records create institutional memory of alternative approaches - -#### Strategy 2: Structured Debate + Judge - -Both agents present arguments (1 round each). A judge — their shared manager, the CEO, or a configurable arbitrator agent — evaluates both positions and decides. The judge's reasoning and both arguments are logged as a dissent record. - -```yaml -conflict_resolution: - strategy: "debate" - debate: - judge: "shared_manager" # shared_manager, ceo, designated_agent -``` - -- Better decisions — forces agents to articulate reasoning -- Higher token cost, adds latency proportional to argument length - -#### Strategy 3: Human Escalation - -All genuine conflicts go to the human approval queue with both positions summarized. The agent(s) park the conflicting task and work on other tasks while waiting (see §12.4 Approval Timeout). - -```yaml -conflict_resolution: - strategy: "human" -``` - -- Safest — human always makes the call -- Bottleneck at scale, depends on human availability - -#### Strategy 4: Hybrid (Recommended for Production) - -Combines strategies with an intelligent review layer: - -1. Both agents present arguments (1 round) — preserving dissent -2. A **conflict review agent** evaluates the result: - - If the resolution is **clear** (one position is objectively better, or authority applies cleanly) → resolve automatically, log dissent record - - If the resolution is **ambiguous** (genuine trade-offs, no clear winner) → escalate to human queue with both positions + the review agent's analysis - -```yaml -conflict_resolution: - strategy: "hybrid" - hybrid: - review_agent: "conflict_reviewer" # dedicated agent or role - escalate_on_ambiguity: true -``` - -- Best balance: most conflicts resolve fast, humans only see genuinely hard calls -- Most complex to implement; review agent itself needs careful prompt design - -### 5.7 Meeting Protocol - -Meetings (§5.1 Pattern 3) follow configurable protocols that determine how agents interact during structured multi-agent conversations. Different meeting types naturally suit different protocols. All protocols implement a `MeetingProtocol` protocol, making the system extensible — new protocols can be registered and selected per meeting type. Cost bounds are enforced by `duration_tokens` in meeting config (§5.4). - -> **Current state:** All 3 meeting protocols are implemented in `communication/meeting/`: `RoundRobinProtocol`, `PositionPapersProtocol`, and `StructuredPhasesProtocol`. The `MeetingOrchestrator` runs meetings end-to-end with token budget enforcement via `TokenTracker`. Shared LLM response parsing for decisions and action items is in `_parsing.py`. All protocols implement the `MeetingProtocol` protocol interface. - -#### Protocol 1: Round-Robin Transcript - -The meeting leader calls each participant in turn. A shared transcript grows as each agent responds, seeing all prior contributions. The leader summarizes and extracts action items at the end. - -```yaml -meeting_protocol: "round_robin" -round_robin: - max_turns_per_agent: 2 - max_total_turns: 16 - leader_summarizes: true -``` - -- Simple, natural conversation feel, each agent sees full context -- Token cost grows quadratically; last speaker has more context (ordering bias) -- **Best for**: Daily standups, status updates, small groups (3-5 agents) - -#### Protocol 2: Async Position Papers + Synthesizer - -Each agent independently writes a short position paper (parallel execution, no shared context). A synthesizer agent reads all positions, identifies agreements and conflicts, and produces decisions + action items. - -```yaml -meeting_protocol: "position_papers" -position_papers: - max_tokens_per_position: 300 - synthesizer: "meeting_leader" # who synthesizes -``` - -- Cheapest — parallel calls, no quadratic growth, no ordering bias, no groupthink -- Loses back-and-forth dialogue; agents can't challenge each other's ideas -- **Best for**: Brainstorming, architecture proposals, large groups, cost-sensitive meetings - -#### Protocol 3: Structured Phases - -Meeting split into phases with targeted participation: - -1. **Agenda broadcast** — leader shares agenda and context to all participants -2. **Input gathering** — each agent submits input independently (parallel) -3. **Discussion round** — only triggered if conflicts are detected between inputs; relevant agents debate (1 round, capped tokens) -4. **Decision + action items** — leader synthesizes, creates tasks from action items - -```yaml -meeting_protocol: "structured_phases" -auto_create_tasks: true # action items become tasks (top-level, applies to any protocol) -structured_phases: - skip_discussion_if_no_conflicts: true - max_discussion_tokens: 1000 -``` - -- Cost-efficient — parallel input, discussion only when needed -- More complex orchestration; conflict detection between inputs needs design -- **Best for**: Sprint planning, design reviews, architecture decisions - ---- - -## 6. Task & Workflow Engine - -### 6.1 Task Lifecycle - -```text - ┌──────────┐ - │ CREATED │ - └─────┬─────┘ - │ assignment - ┌─────▼─────┐ ┌──────────┐ - ┌──────│ ASSIGNED │──────────▶│ FAILED │ - │ └─────┬─────┘◀───┐ └────┬─────┘ - │ │ starts │ reassign │ - │ ┌─────▼─────┐ │ ┌────▼─────┐ - │ │IN_PROGRESS │───┼─────▶│ (retry) │ - │ └─────┬─────┘ │ └──────────┘ - │ │ ◀── (rework) - │ │ agent done - │ ┌─────▼─────┐ - │ │ IN_REVIEW │ - │ └─────┬─────┘ - │ │ approved - │ ┌─────▼─────┐ - │ │ COMPLETED │ - │ └────────────┘ - │ - │ blocked cancelled (from ASSIGNED or IN_PROGRESS) - ┌─────▼─────┐ ┌────────────┐ - │ BLOCKED │ │ CANCELLED │ ◀── ASSIGNED / IN_PROGRESS - └─────┬─────┘ └────────────┘ - │ unblocked (terminal) - └──▶ ASSIGNED - - shutdown signal: - ┌─────────────┐ - │ INTERRUPTED │──── reassign on restart ──▶ ASSIGNED - └─────────────┘ -``` - -> **Non-terminal states:** BLOCKED, FAILED, and INTERRUPTED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6), INTERRUPTED returns to ASSIGNED on restart (see §6.7). COMPLETED and CANCELLED are terminal states with no outgoing transitions. -> -> **Transitions into FAILED:** Both `ASSIGNED → FAILED` (early setup failures) and `IN_PROGRESS → FAILED` (runtime crashes) are valid. `FAILED → ASSIGNED` enables reassignment when `retry_count < max_retries`. -> -> **Transitions into INTERRUPTED:** Both `ASSIGNED → INTERRUPTED` and `IN_PROGRESS → INTERRUPTED` are valid (graceful shutdown can occur at any active phase). `INTERRUPTED → ASSIGNED` enables reassignment on restart. - -> **Runtime wrapper:** During execution, `Task` is wrapped by `TaskExecution` (in `engine/task_execution.py`). `TaskExecution` is a frozen Pydantic model that tracks status transitions via `model_copy(update=...)`, accumulates `TokenUsage` cost, and records a `StatusTransition` audit trail. The original `Task` is preserved unchanged; `to_task_snapshot()` produces a `Task` copy with the current execution status for persistence. - -### 6.2 Task Definition - -```yaml -task: - id: "task-123" - title: "Implement user authentication API" - description: "Create REST endpoints for login, register, logout with JWT tokens" - type: "development" # development, design, research, review, meeting, admin - priority: "high" # critical, high, medium, low - project: "proj-456" - created_by: "product_manager_1" - assigned_to: "sarah_chen" - reviewers: ["engineering_lead", "security_engineer"] - dependencies: ["task-120", "task-121"] - artifacts_expected: - - type: "code" - path: "src/auth/" - - type: "tests" - path: "tests/auth/" - - type: "documentation" - path: "docs/api/auth.md" - acceptance_criteria: - - "JWT-based auth with refresh tokens" - - "Rate limiting on login endpoint" - - "Unit and integration tests with >80% coverage" - - "API documentation" - estimated_complexity: "medium" # simple, medium, complex, epic - task_structure: "parallel" # sequential, parallel, mixed (see §6.9) - coordination_topology: "auto" # auto, sas, centralized, decentralized, context_dependent (see §6.9) - budget_limit: 2.00 # max USD for this task - deadline: null - max_retries: 1 # max reassignment attempts after failure (0 = no retry) - status: "assigned" - parent_task_id: null # parent task ID when created via delegation - delegation_chain: [] # ordered agent IDs of delegators (root first) -``` - -### 6.3 Workflow Types - -#### Sequential Pipeline - -```text -Requirements ──▶ Design ──▶ Implementation ──▶ Review ──▶ Testing ──▶ Deploy -``` - -#### Parallel Execution - -```text - ┌──▶ Frontend Dev ──┐ -Task ───┤ ├──▶ Integration ──▶ QA - └──▶ Backend Dev ──┘ -``` - -> **Current state:** `ParallelExecutor` (in `engine/parallel.py`) implements concurrent agent execution with `asyncio.TaskGroup`, configurable concurrency limits, resource locking for exclusive file access, error isolation, and progress tracking. Models in `engine/parallel_models.py`: `AgentAssignment`, `ParallelExecutionGroup`, `AgentOutcome`, `ParallelExecutionResult`, `ParallelProgress`. - -#### Kanban Board - -```text -Backlog │ Ready │ In Progress │ Review │ Done - ○ │ ○ │ ● │ ○ │ ●●● - ○ │ ○ │ ● │ │ ●● - ○ │ │ │ │ ● -``` - -#### Agile Sprints - -```text -Sprint Backlog → Sprint Execution → Review → Retrospective → Next Sprint -``` - -### 6.4 Task Routing & Assignment - -Tasks can be assigned through multiple strategies: - -| Strategy | Description | -|----------|-------------| -| **Manual** | Human or manager explicitly assigns | -| **Role-based** | Auto-assign to agents with matching role/skills | -| **Load-balanced** | Distribute evenly across available agents | -| **Auction** | Agents "bid" on tasks based on confidence/capability | -| **Hierarchical** | Flow down through management chain | -| **Cost-optimized** | Assign to cheapest capable agent | - -> **Current state:** All six strategies are implemented behind the `TaskAssignmentStrategy` protocol. Manual, Role-based, Load-balanced, Cost-optimized, and Auction strategies are in the static `STRATEGY_MAP`. Hierarchical requires a `HierarchyResolver` at runtime via `build_strategy_map(hierarchy=...)`. Config-level `TaskAssignmentConfig` validates strategy names against the known set. Scoring-based strategies filter out agents at capacity via `AssignmentRequest.max_concurrent_tasks`. Error signaling contract: `ManualAssignmentStrategy` raises exceptions (`TaskAssignmentError`, `NoEligibleAgentError`); scoring-based strategies return `AssignmentResult(selected=None)`. `TaskAssignmentService` propagates both patterns. - -### 6.5 Agent Execution Loop - -The agent execution loop defines how an agent processes a task from start to finish. The framework provides multiple configurable loop architectures behind an `ExecutionLoop` protocol, making the system extensible. The default can vary by task complexity, and is configurable per agent or role. - -> **Current state:** ReAct (Loop 1) and Plan-and-Execute (Loop 2) are implemented. `ParallelExecutor` enables concurrent `AgentEngine.run()` calls with `TaskGroup` + Semaphore concurrency limits, resource locking, and error isolation (see §6.3). Hybrid loop and auto-selection are planned. - -#### ExecutionLoop Protocol - -All loop implementations satisfy the `ExecutionLoop` runtime-checkable protocol (defined in `engine/loop_protocol.py`): - -- **`get_loop_type() -> str`** — returns a unique identifier (e.g. `"react"`) -- **`execute(...) -> ExecutionResult`** — runs the loop to completion, accepting `AgentContext`, `CompletionProvider`, optional `ToolInvoker`, optional `BudgetChecker`, optional `ShutdownChecker`, and optional `CompletionConfig` - -Supporting models: - -- **`TerminationReason`** — enum: `COMPLETED`, `MAX_TURNS`, `BUDGET_EXHAUSTED`, `SHUTDOWN`, `ERROR` -- **`TurnRecord`** — frozen per-turn stats (tokens, cost, tool calls, finish reason) -- **`ExecutionResult`** — frozen outcome with final context, termination reason, turn records, and optional error message (required when reason is `ERROR`) -- **`BudgetChecker`** — callback type `Callable[[AgentContext], bool]` invoked before each LLM call -- **`ShutdownChecker`** — callback type `Callable[[], bool]` checked at turn boundaries to initiate cooperative shutdown - -#### Loop 1: ReAct (Default for Simple Tasks) - -A single interleaved loop: the agent reasons about the current state, selects an action (tool call or response), observes the result, and repeats until done or `max_turns` is reached. - -```text -┌──────────────────────────────────────────┐ -│ ReAct Loop │ -│ │ -│ ┌─────────┐ ┌──────┐ ┌──────────┐ │ -│ │ Think │──▶│ Act │──▶│ Observe │ │ -│ └─────────┘ └──────┘ └────┬─────┘ │ -│ ▲ │ │ -│ └─────────────────────────┘ │ -│ │ -│ Terminate when: task complete, max │ -│ turns, budget exhausted, or error │ -└──────────────────────────────────────────┘ -``` - -```yaml -execution_loop: "react" # react, plan_execute, hybrid, auto -``` - -- Simple, proven, flexible. Easy to implement. Works well for short tasks -- Token-heavy on long tasks (re-reads full context every turn). No long-term planning — greedy step-by-step -- **Best for**: Simple tasks, quick fixes, single-file changes - -#### Loop 2: Plan-and-Execute - -A two-phase approach: the agent first generates a step-by-step plan, then executes each step sequentially. On failure, the agent can replan. Different models can be used for planning vs execution (e.g., large for planning, small for execution steps). - -```text -┌──────────────────────────────────────────┐ -│ Plan-and-Execute │ -│ │ -│ ┌──────────┐ ┌───────────────────┐ │ -│ │ Plan │───▶│ Execute Steps │ │ -│ │ (1 call) │ │ (N calls) │ │ -│ └──────────┘ └────────┬──────────┘ │ -│ ▲ │ │ -│ └────── replan ──────┘ │ -│ (on step failure) │ -└──────────────────────────────────────────┘ -``` - -```yaml -execution_loop: "plan_execute" -plan_execute: - planner_model: null # null = use agent's model; override for cost optimization - executor_model: null - max_replans: 3 -``` - -- Token-efficient for long tasks. Auditable plan artifact. Supports model tiering -- Rigid — plan may be wrong, replanning is expensive. Over-plans simple tasks -- **Best for**: Complex multi-step tasks, epic-level work, tasks spanning multiple files - -#### Loop 3: Hybrid Plan + ReAct Steps (Recommended for Complex Tasks) - -The agent creates a high-level plan (3-7 steps). Each step is executed as a mini-ReAct loop with its own turn limit. After each step, the agent checkpoints — summarizing progress and optionally replanning remaining steps. Checkpoints are natural points for human inspection or task suspension. - -```text -┌──────────────────────────────────────────────┐ -│ Hybrid: Plan + ReAct Steps │ -│ │ -│ ┌──────────┐ │ -│ │ Plan │ │ -│ └────┬─────┘ │ -│ │ │ -│ ┌────▼────────────────────────────────────┐ │ -│ │ Step 1: mini-ReAct (think→act→observe) │ │ -│ └────┬────────────────────────────────────┘ │ -│ │ checkpoint: summarize progress │ -│ ┌────▼────────────────────────────────────┐ │ -│ │ Step 2: mini-ReAct │ │ -│ └────┬────────────────────────────────────┘ │ -│ │ checkpoint: replan if needed │ -│ ┌────▼────────────────────────────────────┐ │ -│ │ Step N: mini-ReAct │ │ -│ └─────────────────────────────────────────┘ │ -└──────────────────────────────────────────────┘ -``` - -```yaml -execution_loop: "hybrid" -hybrid: - max_plan_steps: 7 - max_turns_per_step: 5 - checkpoint_after_each_step: true - allow_replan: true -``` - -- Strategic planning + tactical flexibility. Natural checkpoints for suspension/inspection -- Most complex to implement. Plan granularity needs tuning per task type -- **Best for**: Complex tasks, multi-file refactoring, tasks requiring both planning and adaptivity - -> **Auto-selection (optional):** When `execution_loop: "auto"`, the framework selects the loop based on `estimated_complexity`: simple → ReAct, medium → Plan-and-Execute, complex/epic → Hybrid. Configurable via `auto_loop_rules` — a mapping of complexity thresholds to loop implementations (e.g., `{simple_max_tokens: 500, medium_max_tokens: 3000}` with corresponding loop assignments). - -#### AgentEngine Orchestrator - -`AgentEngine` (in `engine/agent_engine.py`) is the top-level entry point for running an agent on a task. It composes the execution loop with prompt construction, context management, tool invocation, and cost tracking into a single `run()` call. - -**`async run(identity, task, completion_config?, max_turns?, memory_messages?, timeout_seconds?) -> AgentRunResult`** - -Pipeline steps: - -1. **Validate inputs** — agent must be `ACTIVE`, task must be `ASSIGNED` or `IN_PROGRESS`. Raises `ExecutionStateError` on violation. -2. **Pre-flight budget enforcement** — if `BudgetEnforcer` is provided, check monthly hard stop and daily limit via `check_can_execute()`, then apply auto-downgrade via `resolve_model()`. Raises `BudgetExhaustedError` or `DailyLimitExceededError` on violation. -3. **Build system prompt** — calls `build_system_prompt()` with agent identity and task. Tool definitions are NOT included — they are supplied via the API's `tools` parameter (see D22 below). Follows the **non-inferable-only principle**: system prompts include only information the agent cannot discover by reading the codebase or environment (role constraints, custom conventions, organizational policies). Generic architecture overviews and file structure descriptions are excluded — [research](https://arxiv.org/abs/2602.11988) shows they reduce success rates while increasing costs 20%+. - -> **Decision ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D22):** Do NOT list available tools in the system prompt — the API's `tools` parameter already injects richer tool definitions including JSON schemas. The system prompt listing is strictly inferior (no schemas) and wastes 200-400+ tokens per call. Behavioral guidance ("when to use tool X vs Y") may be added later as non-redundant value. -4. **Create context** — `AgentContext.from_identity()` with the configured `max_turns`. -5. **Seed conversation** — injects system prompt, optional memory messages, and formatted task instruction as initial messages. -6. **Transition task** — `ASSIGNED` → `IN_PROGRESS` (pass-through if already `IN_PROGRESS`). -7. **Prepare tools and budget** — creates `ToolInvoker` from registry and `BudgetChecker` from `BudgetEnforcer` (task + monthly + daily limits with pre-computed baselines and alert deduplication) or from task budget limit alone when no enforcer is configured. -8. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. If `timeout_seconds` is set, wraps the call in `asyncio.wait_for`; on expiry the run returns with `TerminationReason.ERROR` but cost recording and post-execution processing still occur. -9. **Record costs** — records accumulated `TokenUsage` to `CostTracker` (if available). Cost recording failures are logged but do not affect the result. -10. **Apply post-execution transitions** — on `COMPLETED` termination: IN_PROGRESS → IN_REVIEW → COMPLETED (two-hop auto-complete; reviewers planned). On `SHUTDOWN` termination: current status → INTERRUPTED (see §6.7). On `ERROR` termination: recovery strategy is applied (default `FailAndReassignStrategy` transitions to FAILED; see §6.6). All other termination reasons (`MAX_TURNS`, `BUDGET_EXHAUSTED`) leave the task in its current state. Transition failures are logged but do not discard the successful execution result. -11. **Return result** — wraps `ExecutionResult` in `AgentRunResult` with engine-level metadata. - -Error handling: `MemoryError` and `RecursionError` propagate unconditionally. `BudgetExhaustedError` (including `DailyLimitExceededError`) returns `TerminationReason.BUDGET_EXHAUSTED` without recovery — budget exhaustion is a controlled stop, not a crash. All other exceptions are caught and wrapped in an `AgentRunResult` with `TerminationReason.ERROR`. - -Constructor accepts: `provider` (required), `execution_loop` (defaults to `ReactLoop`), `tool_registry`, `cost_tracker`, `recovery_strategy` (defaults to `FailAndReassignStrategy`), `shutdown_checker`, `budget_enforcer`. The `run()` method also accepts `memory_messages` — optional working memory to inject between the system prompt and task instruction. - -Logs structured events under the `execution.engine.*` namespace (13 constants in `events/execution.py`): creation, start, prompt built, completion, errors, budget stopped, invalid input, task transitions, cost recording outcomes, task metrics, and timeout. - -**`AgentRunResult`** — frozen Pydantic model wrapping `ExecutionResult` with engine metadata: - -- `execution_result` — outcome from the execution loop -- `system_prompt` — the `SystemPrompt` used for this run -- `duration_seconds` — wall-clock run time -- `agent_id`, `task_id` — identifiers -- Computed fields: `termination_reason`, `total_turns`, `total_cost_usd`, `is_success`, `completion_summary` - -### 6.6 Agent Crash Recovery - -When an agent execution fails unexpectedly (unhandled exception, OOM, process kill), the framework needs a recovery mechanism. Recovery strategies are implemented behind a `RecoveryStrategy` protocol, making the system pluggable — new strategies can be added without modifying existing ones. - -> **MVP: Fail-and-Reassign only (Strategy 1).** Checkpoint Recovery is planned. - -**`RecoveryStrategy` protocol:** - -| Method | Signature | Description | -|--------|-----------|-------------| -| `recover` | `async def recover(*, task_execution: TaskExecution, error_message: str, context: AgentContext) -> RecoveryResult` | Apply recovery to a failed task execution | -| `get_strategy_type` | `def get_strategy_type() -> str` | Return strategy type identifier (must not be empty) | - -**`RecoveryResult` model (frozen):** - -| Field | Type | Description | -|-------|------|-------------| -| `task_execution` | `TaskExecution` | Updated execution after recovery (typically `FAILED`) | -| `strategy_type` | `NotBlankStr` | Strategy identifier | -| `context_snapshot` | `AgentContextSnapshot` | Redacted snapshot (turn count, accumulated cost, message count, max turns — no message contents) | -| `error_message` | `NotBlankStr` | Error that triggered recovery | -| `can_reassign` | `bool` (computed) | `retry_count < task.max_retries` | - -#### Strategy 1: Fail-and-Reassign (Default / MVP) - -The engine catches the failure at its outermost boundary, logs a redacted `AgentContext` snapshot (turn count, accumulated cost — excluding message contents to avoid leaking sensitive prompts/tool outputs), transitions the task to `FAILED`, and makes it available for reassignment (manual or automatic via the task router). - -> **Non-terminal state:** `FAILED` is a `TaskStatus` variant alongside `CANCELLED`. `FAILED` differs from `CANCELLED` (which is terminal) in that failed tasks are eligible for automatic reassignment. Valid transitions: `IN_PROGRESS → FAILED`, `ASSIGNED → FAILED` (early setup failures), `FAILED → ASSIGNED` (reassignment). See the updated §6.1 lifecycle diagram. - -```yaml -crash_recovery: - strategy: "fail_reassign" # fail_reassign, checkpoint -``` - -- Simple, no persistence dependency -- All progress is lost on crash — acceptable for short single-agent tasks in the MVP - -On crash: -1. Catch exception at the `AgentEngine` boundary (outermost `try/except` in `AgentEngine.run()`) -2. Log at ERROR with redacted `AgentContextSnapshot` (turn count, accumulated cost, message count, max turns — message contents excluded) -3. Transition `TaskExecution` → `FAILED` with the exception as the failure reason -4. `RecoveryResult.can_reassign` reports whether `retry_count < max_retries` - -> **Current limitation:** The `can_reassign` flag is computed and returned in `RecoveryResult`, but automated reassignment is not yet implemented — the task router (§6.4) will consume this in a future release. The caller (task router) is responsible for incrementing `retry_count` when creating the next `TaskExecution`. - -#### Strategy 2: Checkpoint Recovery (Planned) - -The engine persists an `AgentContext` snapshot after each completed turn. On crash, the framework detects the failure (via heartbeat timeout or exception), loads the last checkpoint, and resumes execution from the exact turn where it left off. The immutable `model_copy(update=...)` pattern makes checkpointing trivial — each `AgentContext` is a complete, self-contained frozen state that serializes cleanly via `model_dump_json()`. - -```yaml -crash_recovery: - strategy: "checkpoint" - checkpoint: - persist_every_n_turns: 1 # checkpoint frequency - storage: "sqlite" # sqlite, filesystem - heartbeat_interval_seconds: 30 # detect unresponsive agents - max_resume_attempts: 2 # retry limit before falling back to fail_reassign -``` - -- Preserves progress — critical for long tasks (multi-step plans, epic-level work) -- Requires persistence layer and environment state reconciliation on resume -- Natural fit with the existing immutable state model - -> **Environment reconciliation:** When resuming from a checkpoint, the agent's tools and workspace may have changed (other agents modified files, external state drifted). The checkpoint strategy includes a reconciliation step: the resumed agent receives a summary of changes since the checkpoint timestamp and can adapt its plan accordingly. This is analogous to a developer returning to a branch after colleagues have pushed changes. - -### 6.7 Graceful Shutdown Protocol - -When the process receives SIGTERM/SIGINT (user Ctrl+C, Docker stop, systemd shutdown), the framework needs to stop cleanly without losing work or leaking costs. Shutdown strategies are implemented behind a `ShutdownStrategy` protocol, making the system pluggable — new strategies can be added without modifying existing ones. - -> **MVP: Cooperative with Timeout only (Strategy 1).** Other strategies are future options enabled by the protocol interface. - -#### Strategy 1: Cooperative with Timeout (Default / MVP) - -The engine sets a shutdown event, stops accepting new tasks, and gives in-flight agents a grace period to finish their current turn. Agents check the shutdown event at turn boundaries (between LLM calls, before tool invocations) and exit cooperatively. After the grace period, remaining agents are force-cancelled. **All tasks terminated by shutdown — whether they exited cooperatively or were force-cancelled — are marked `INTERRUPTED`** by the engine layer. - -```yaml -graceful_shutdown: - strategy: "cooperative_timeout" # cooperative_timeout, immediate, finish_tool, checkpoint - cooperative_timeout: - grace_seconds: 30 # time for agents to finish cooperatively - cleanup_seconds: 5 # time for final cleanup (persist cost records, close connections) -``` - -On shutdown signal: -1. Set `shutdown_event` (`asyncio.Event`) — agents check this at turn boundaries -2. Stop accepting new tasks (drain gate closes) -3. Wait up to `grace_seconds` for agents to exit cooperatively -4. Force-cancel remaining agents (`task.cancel()`) — tasks transition to `INTERRUPTED` -5. Cleanup phase (`cleanup_seconds`): persist cost records, close provider connections, flush logs - -> **Non-terminal status:** `INTERRUPTED` is a `TaskStatus` variant. Unlike `FAILED` (eligible for automatic reassignment) or `CANCELLED` (terminal), `INTERRUPTED` indicates the task was stopped due to process shutdown — regardless of whether the agent exited cooperatively or was force-cancelled — and is eligible for manual or automatic reassignment on restart. Valid transitions: `ASSIGNED → INTERRUPTED`, `IN_PROGRESS → INTERRUPTED`, `INTERRUPTED → ASSIGNED` (reassignment on restart). See the updated §6.1 lifecycle diagram. -> -> **Windows compatibility:** `loop.add_signal_handler()` is not supported on Windows. The implementation uses `signal.signal()` as a fallback. SIGINT (Ctrl+C) works cross-platform; SIGTERM on Windows requires `os.kill()`. -> -> **In-flight LLM calls:** Non-streaming API calls that are interrupted result in tokens billed but no response received (silent cost leak). The engine logs request start (with input token count) before each provider call, so interrupted calls have at minimum an input-cost audit record. Streaming calls are charged only for tokens sent before disconnect. - -#### Strategy 2: Immediate Cancel (Future Option) - -All agent tasks are cancelled immediately via `task.cancel()`. Fastest shutdown but highest data loss — partial tool side effects, billed-but-lost LLM responses. - -#### Strategy 3: Finish Current Tool (Future Option) - -Like cooperative timeout, but waits for the current tool invocation to complete even if it exceeds the grace period. Needs per-tool timeout as a backstop for long-running sandboxed execution. - -#### Strategy 4: Checkpoint and Stop (Planned) - -On shutdown signal, each agent persists its full `AgentContext` snapshot and transitions to `SUSPENDED`. On restart, the engine loads checkpoints and resumes execution. This naturally extends the `CheckpointStrategy` from §6.6 — the only difference is whether the checkpoint was written proactively (graceful shutdown) or loaded from the last turn (crash recovery). - -> **Planned non-terminal status:** `SUSPENDED` is a new `TaskStatus` variant for checkpoint-based shutdown, to be added alongside `INTERRUPTED`. - -### 6.8 Concurrent Workspace Isolation - -> **Current state:** The `WorkspaceIsolationStrategy` protocol, `PlannerWorktreeStrategy` (git worktree backend), `MergeOrchestrator` (sequential merge with configurable conflict escalation), and `WorkspaceIsolationService` (lifecycle orchestrator with rollback and best-effort teardown) are implemented in `engine/workspace/`. `_validate_git_ref` raises context-appropriate exception types (`WorkspaceMergeError` in merge, `WorkspaceCleanupError` in teardown) with matching log events. `_run_git` similarly accepts a `log_event` parameter for context-aware timeout logging. Runtime multi-agent coordination using these components is planned. - -When multiple agents work on the same codebase concurrently, they may need to edit overlapping files. The framework provides a pluggable `WorkspaceIsolationStrategy` protocol for managing concurrent file access. The default strategy combines intelligent task decomposition with git worktree isolation — the dominant industry pattern (used by OpenAI Codex, Cursor, Claude Code, VS Code background agents). - -#### Strategy 1: Planner + Git Worktrees (Default) - -The task planner decomposes work to minimize file overlap across agents. Each agent operates in its own git worktree (shared `.git` object database, independent working tree). On completion, branches are merged sequentially. - -```text -Planner decomposes task: -├─ Agent A: src/auth/ (worktree-A) -├─ Agent B: src/api/ (worktree-B) -└─ Agent C: tests/ (worktree-C) - -Each in isolated git worktree - │ -On completion: sequential merge -├─ Merge A → main -├─ Rebase B on main, merge -└─ Rebase C on main, merge - │ -Textual conflicts: git detects, escalate to human or review agent -Semantic conflicts: review agent evaluates merged result -``` - -```yaml -workspace_isolation: - strategy: "planner_worktrees" # planner_worktrees, sequential, file_locking - planner_worktrees: - max_concurrent_worktrees: 8 - merge_order: "completion" # completion (first done merges first), priority, manual - conflict_escalation: "human" # human, review_agent -``` - -- True filesystem isolation — agents cannot overwrite each other's work -- Maximum parallelism during execution; conflicts deferred to merge time -- Leverages mature git infrastructure for merge, diff, and history - -#### Strategy 2: Sequential Dependencies (Future Option) - -Tasks with overlapping file scopes are ordered sequentially via a dependency graph. Prevents conflicts by construction but limits parallelism. Requires upfront knowledge of which files a task will touch. - -#### Strategy 3: File-Level Locking (Future Option) - -Files are locked at task assignment time. Eliminates conflicts at the source but requires predicting file access — difficult for LLM agents that discover what to edit as they go. Risk of deadlock if multiple agents need overlapping file sets. - -#### State Coordination vs Workspace Isolation - -These are complementary systems handling different types of shared state: - -| State Type | Coordination | Mechanism | -|-----------|-------------|-----------| -| Framework state (tasks, assignments, budget) | Centralized single-writer (`TaskEngine`) | `model_copy(update=...)` via async queue | -| Code and files (agent work output) | Workspace isolation (`WorkspaceIsolationStrategy`) | Git worktrees / branches | -| Agent memory (personal) | Per-agent ownership | Each agent owns its memory exclusively | -| Org memory (shared knowledge) | Single-writer (`OrgMemoryBackend`) | `OrgMemoryBackend` protocol with role-based write access control | - -### 6.9 Task Decomposability & Coordination Topology - -> **Current state:** Task structure classification (`TaskStructureClassifier`), DAG-based decomposition (`DecompositionService`, `DependencyGraph`, `ManualDecompositionStrategy`), LLM-based decomposition (`LlmDecompositionStrategy` with tool calling and JSON content fallback), status rollup (`StatusRollup`), agent-task scoring (`AgentTaskScorer`), routing (`TaskRoutingService`), and auto topology selection (`TopologySelector`) are implemented in `engine/decomposition/` and `engine/routing/`. Workspace isolation (`PlannerWorktreeStrategy`, `MergeOrchestrator`, `WorkspaceIsolationService`) is implemented in `engine/workspace/`. Runtime multi-agent coordination is planned. - -Empirical research on agent scaling ([Kim et al., 2025](https://arxiv.org/abs/2512.08296) — 180 controlled experiments across 3 LLM families and 4 benchmarks) demonstrates that **task decomposability is the strongest predictor of multi-agent effectiveness** — stronger than team size, model capability, or coordination architecture. - -#### Task Structure Classification - -Each task carries a `task_structure` field (see §6.2 Task Definition) classifying its decomposability: - -| Structure | Description | MAS Effect | Example | -|-----------|-------------|------------|---------| -| `sequential` | Steps must execute in strict order; each depends on prior state | **Negative** (−39% to −70%) | Multi-step build processes, ordered migrations, chained API calls | -| `parallel` | Sub-problems can be investigated independently, then synthesized | **Positive** (+57% to +81%) | Financial analysis (revenue + cost + market), multi-file review, research across sources | -| `mixed` | Some sub-tasks are parallel, but a sequential backbone connects phases | **Variable** (depends on ratio) | Feature implementation (design ∥ research → implement → test) | - -Classification can be: -- **Explicit** — set in task config by the task creator or manager agent -- **Inferred** — derived from task properties (tool count, dependency graph, acceptance criteria structure) by the task router - -#### Per-Task Coordination Topology - -The communication pattern (§5.1) is configured at the company level, but **coordination topology can be selected per-task** based on task structure and properties. This allows the engine to use the most efficient coordination approach for each task rather than applying a single company-wide pattern. - -| Task Properties | Recommended Topology | Rationale | -|----------------|---------------------|-----------| -| `sequential` + few artifacts (≤4) | **Single-agent (SAS)** | Coordination overhead fragments reasoning capacity on sequential tasks | -| `parallel` + structured domain | **Centralized** | Orchestrator decomposes, sub-agents execute in parallel, orchestrator synthesizes. Lowest error amplification (4.4×) | -| `parallel` + exploratory/open-ended | **Decentralized** | Peer debate enables diverse exploration of high-entropy search spaces | -| `mixed` | **Context-dependent** | Sequential backbone handled by single agent; parallel sub-tasks delegated to sub-agents | - -#### Auto Topology Selector - -When topology is set to `"auto"`, the engine selects coordination topology based on measurable task properties: - -```yaml -coordination: - topology: "auto" # auto, sas, centralized, decentralized, context_dependent - auto_topology_rules: - # sequential tasks → always single-agent - sequential_override: "sas" - # parallel tasks → select based on domain structure - parallel_default: "centralized" - # mixed tasks → SAS backbone for sequential phases, delegates parallel sub-tasks - mixed_default: "context_dependent" # hybrid: not a single topology — engine selects per-phase -``` - -The auto-selector uses task structure, artifact count, and (when available from the memory subsystem) historical single-agent success rate as inputs. The exact selection logic is an implementation detail — the spec defines the interface and the empirically-grounded heuristics above. - -> **Reference:** These heuristics are derived from Kim et al. (2025), which achieved 87% accuracy predicting optimal architecture from task properties across held-out configurations. Our context differs (role-differentiated agents vs. identical agents), so thresholds should be validated empirically once multi-agent execution is implemented. - ---- - -## 7. Memory & Persistence - -### 7.1 Memory Architecture - -```text -┌─────────────────────────────────────────────┐ -│ Agent Memory System │ -├──────────┬──────────┬───────────┬───────────┤ -│ Working │ Episodic │ Semantic │Procedural │ -│ Memory │ Memory │ Memory │ Memory │ -│ │ │ │ │ -│ Current │ Past │ Knowledge │ Skills & │ -│ task │ events & │ & facts │ how-to │ -│ context │ decisions│ learned │ │ -├──────────┴──────────┴───────────┴───────────┤ -│ Storage Backend │ -│ SQLite / PostgreSQL / File-based │ -│ + Mem0 (initial) / Custom Stack (future) │ -│ See ADR-001 │ -└─────────────────────────────────────────────┘ -``` - -### 7.2 Memory Types - -| Type | Scope | Persistence | Example | -|------|-------|-------------|---------| -| **Working** | Current task | None (in-context) | "I'm implementing the auth endpoint" | -| **Episodic** | Past events | Configurable | "Last sprint we chose JWT over sessions" | -| **Semantic** | Knowledge | Long-term | "This project uses Litestar with aiosqlite" | -| **Procedural** | Skills/patterns | Long-term | "Code reviews require 2 approvals here" | -| **Social** | Relationships | Long-term | "The QA lead prefers detailed test plans" | - -### 7.3 Memory Levels (Configurable) - -```yaml -memory: - level: "persistent" # none, session, project, persistent (default: session) - backend: "mem0" # mem0, custom, cognee, graphiti (future) — see ADR-001 - storage: - data_dir: "/data/memory" # mounted Docker volume path - vector_store: "qdrant" # qdrant (embedded), qdrant-external, etc. - history_store: "sqlite" # sqlite, postgresql - options: - retention_days: null # null = forever - max_memories_per_agent: 10000 - consolidation_interval: "daily" # compress old memories - shared_knowledge_base: true # agents can access shared facts (see §7.4) -``` - -### 7.4 Shared Organizational Memory - -Beyond individual agent memory (§7.1–7.3), the framework needs **organizational memory** — company-wide knowledge that all agents can access: policies, conventions, architecture decision records (ADRs), coding standards, and operational procedures. This is not personal episodic memory ("what I did last Tuesday") but institutional knowledge ("we always use Litestar, not Flask"). - -Shared organizational memory is implemented behind an `OrgMemoryBackend` protocol, making the system highly modular and extensible. New backends can be added without modifying existing ones. - -#### Backend 1: Hybrid Prompt + Retrieval (Default / MVP) - -Critical rules (5-10 items, e.g., "no commits to main," "all PRs need 2 approvals") are injected into every agent's system prompt. Extended knowledge (ADRs, detailed procedures, style guides) is stored in a queryable store and retrieved on demand at task start. - -```yaml -org_memory: - backend: "hybrid_prompt_retrieval" # hybrid_prompt_retrieval, graph_rag, temporal_kg - core_policies: # always in system prompt - - "All code must have 80%+ test coverage" - - "Use Litestar, not Flask" - - "PRs require 2 approvals" - extended_store: - backend: "sqlite" # sqlite, postgresql - max_retrieved_per_query: 5 - write_access: - policies: ["human"] # only humans write core policies - adrs: ["human", "senior", "lead", "c_suite"] - procedures: ["human", "senior", "lead", "c_suite"] -``` - -- Simple to implement. Core rules always present. Extended knowledge scales -- Basic retrieval may miss relational connections between policies - -#### Research Directions - -The following backends illustrate why `OrgMemoryBackend` is a protocol — the architecture supports future upgrades without modifying existing code. These are **not planned implementations**; they are research directions that may inform future work if/when organizational memory needs outgrow the Hybrid Prompt + Retrieval approach. - -#### Backend 2: GraphRAG Knowledge Graph (Research) - -Organizational knowledge stored as entities + relationships in a knowledge graph. Agents query via graph traversal, enabling multi-hop reasoning: "Litestar is our standard" → linked to → "don't use Flask" → linked to → "exception: data team uses Django for admin." - -```yaml -org_memory: - backend: "graph_rag" - graph: - store: "sqlite" # graph stored in relational DB, or dedicated graph DB - entity_extraction: "auto" # auto-extract entities from ADRs and policies -``` - -- Significant accuracy improvement over vector-only retrieval (some benchmarks report 3–4x gains). Multi-hop reasoning captures policy relationships -- More complex infrastructure. Entity extraction can be noisy. Heavier setup - -#### Backend 3: Temporal Knowledge Graph (Research) - -Like GraphRAG but tracks how facts change over time. "We used Flask until March 2026, then switched to Litestar." Agents see current truth but can query history for context. - -```yaml -org_memory: - backend: "temporal_kg" - temporal: - track_changes: true - history_retention_days: null # null = forever -``` - -- Handles policy evolution naturally. Agents understand when and why things changed -- Most complex. Potentially overkill for small companies or local-first use - -> **Extensibility:** All backends implement the `OrgMemoryBackend` protocol (`query(OrgMemoryQuery) → tuple[OrgFact, ...]`, `write(OrgFactWriteRequest, *, author: OrgFactAuthor) → NotBlankStr`, `list_policies() → tuple[OrgFact, ...]`, plus `connect`/`disconnect`/`health_check`/`is_connected`/`backend_name` lifecycle). The MVP ships with Backend 1; Backends 2 and 3 are research directions that may be explored if the default approach proves insufficient. The selected memory layer backend Mem0 (ADR-001) provides optional graph memory via Neo4j/FalkorDB, which could reduce implementation effort for Backends 2-3. -> **Write access control:** Core policies are human-only. ADRs and procedures can be written by senior+ agents. All writes are versioned and auditable. This prevents agents from corrupting shared organizational knowledge while allowing senior agents to document decisions. - -### 7.5 Memory Backend Protocol - -Agent memory (§7.1–7.4) is implemented behind a pluggable `MemoryBackend` protocol (Mem0 initial, custom stack future — ADR-001). Application code depends only on the protocol; the storage engine is an implementation detail swappable via config. - -#### Enums - -| Enum | Values | Purpose | -|------|--------|---------| -| `MemoryCategory` | WORKING, EPISODIC, SEMANTIC, PROCEDURAL, SOCIAL | Memory type categories (§7.2) | -| `MemoryLevel` | PERSISTENT, PROJECT, SESSION, NONE | Persistence level per agent (§7.3) | -| `ConsolidationInterval` | HOURLY, DAILY, WEEKLY, NEVER | How often old memories are compressed | - -#### MemoryBackend Protocol - -```python -@runtime_checkable -class MemoryBackend(Protocol): - """Lifecycle + CRUD for agent memory storage.""" - - async def connect(self) -> None: ... - async def disconnect(self) -> None: ... - async def health_check(self) -> bool: ... - - @property - def is_connected(self) -> bool: ... - @property - def backend_name(self) -> NotBlankStr: ... - - async def store(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... - async def retrieve(self, agent_id: NotBlankStr, query: MemoryQuery) -> tuple[MemoryEntry, ...]: ... - async def get(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> MemoryEntry | None: ... - async def delete(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... - async def count(self, agent_id: NotBlankStr, *, category: MemoryCategory | None = None) -> int: ... -``` - -#### MemoryCapabilities Protocol - -Backends that implement `MemoryCapabilities` expose what features they support, enabling runtime capability checks before attempting operations. - -```python -@runtime_checkable -class MemoryCapabilities(Protocol): - """Capability discovery for memory backends.""" - - @property - def supported_categories(self) -> frozenset[MemoryCategory]: ... - @property - def supports_graph(self) -> bool: ... - @property - def supports_temporal(self) -> bool: ... - @property - def supports_vector_search(self) -> bool: ... - @property - def supports_shared_access(self) -> bool: ... - @property - def max_memories_per_agent(self) -> int | None: ... -``` - -#### SharedKnowledgeStore Protocol - -Backends that support cross-agent shared knowledge implement this protocol alongside `MemoryBackend`. Not all backends need cross-agent queries — this keeps the base protocol clean. - -```python -@runtime_checkable -class SharedKnowledgeStore(Protocol): - """Cross-agent shared knowledge operations.""" - - async def publish(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... - async def search_shared(self, query: MemoryQuery, *, exclude_agent: NotBlankStr | None = None) -> tuple[MemoryEntry, ...]: ... - async def retract(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... -``` - -#### Error Hierarchy - -All memory errors inherit from `MemoryError` so callers can catch the entire family with a single except clause. - -| Error | When Raised | -|-------|------------| -| `MemoryError` | Base exception for all memory operations | -| `MemoryConnectionError` | Backend connection cannot be established or is lost | -| `MemoryStoreError` | A store or delete operation fails | -| `MemoryRetrievalError` | A retrieve, search, or count operation fails | -| `MemoryNotFoundError` | A specific memory ID is not found | -| `MemoryConfigError` | Memory configuration is invalid | -| `MemoryCapabilityError` | An unsupported operation is attempted for a backend | - -#### Configuration - -```yaml -memory: - backend: "mem0" - level: "persistent" # none, session, project, persistent (default: session) - storage: - data_dir: "/data/memory" - vector_store: "qdrant" - history_store: "sqlite" - options: - retention_days: null # null = forever - max_memories_per_agent: 10000 - consolidation_interval: "daily" - shared_knowledge_base: true -``` - -Configuration is modeled by `CompanyMemoryConfig` (top-level), `MemoryStorageConfig` (storage paths/backends), and `MemoryOptionsConfig` (behaviour tuning). All are frozen Pydantic models. The `create_memory_backend(config)` factory returns an isolated `MemoryBackend` instance per company. - -#### Consolidation & Retention Configuration - -Memory consolidation, retention enforcement, and archival are configured via frozen Pydantic models in `memory/consolidation/config.py`: - -| Config | Purpose | -|--------|---------| -| `ConsolidationConfig` | Top-level: `max_memories_per_agent` limit, nested `retention` and `archival` sub-configs | -| `RetentionConfig` | Per-category `RetentionRule` tuples (category + retention_days), optional `default_retention_days` fallback | -| `ArchivalConfig` | Enables/disables archival of consolidated entries to `ArchivalStore` | - -Note: Retention is currently per-category, not per-agent. Per-agent retention overrides are a scope gap to be addressed in a future iteration. - -### 7.6 Operational Data Persistence - -Agent memory (§7.1–7.5) is handled by the `MemoryBackend` protocol (Mem0 initial, custom stack future — ADR-001). **Operational data** — tasks, cost records, messages, audit logs — is a separate concern managed by a pluggable `PersistenceBackend` protocol. Application code depends only on repository protocols; the storage engine is an implementation detail swappable via config. - -```text -┌──────────────────────────────────────────────────────────────────┐ -│ Application Code │ -│ engine/ budget/ communication/ security/ │ -│ │ │ │ │ │ -│ ▼ ▼ ▼ ▼ │ -│ ┌──────┐ ┌──────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Task │ │ Cost │ │ Message │ │ Audit │ ← Repository │ -│ │ Repo │ │ Repo │ │ Repo │ │ Repo │ Protocols │ -│ └──┬───┘ └──┬───┘ └────┬─────┘ └────┬─────┘ │ -│ └────────┴──────────┴────────────┘ │ -│ │ │ -│ ┌───────────────────┴───────────────────────────────────────┐ │ -│ │ PersistenceBackend (protocol) │ │ -│ │ connect() · disconnect() · health_check() · migrate() │ │ -│ └───────────────────┬───────────────────────────────────────┘ │ -│ │ │ -│ ┌───────────────────┴───────────────────────────────────────┐ │ -│ │ SQLitePersistenceBackend (initial) │ │ -│ │ PostgresPersistenceBackend (future) │ │ -│ │ MariaDBPersistenceBackend (future) │ │ -│ └───────────────────────────────────────────────────────────┘ │ -└──────────────────────────────────────────────────────────────────┘ -``` - -#### Protocol Design - -```python -@runtime_checkable -class PersistenceBackend(Protocol): - """Lifecycle management for operational data storage.""" - - async def connect(self) -> None: ... - async def disconnect(self) -> None: ... - async def health_check(self) -> bool: ... - async def migrate(self) -> None: ... - - @property - def is_connected(self) -> bool: ... - @property - def backend_name(self) -> NotBlankStr: ... - - @property - def tasks(self) -> TaskRepository: ... - @property - def cost_records(self) -> CostRecordRepository: ... - @property - def messages(self) -> MessageRepository: ... - # ... plus lifecycle_events, task_metrics, collaboration_metrics, - # parked_contexts, audit_entries -``` - -Each entity type has its own repository protocol: - -```python -@runtime_checkable -class TaskRepository(Protocol): - """CRUD + query interface for Task persistence.""" - - async def save(self, task: Task) -> None: ... - async def get(self, task_id: str) -> Task | None: ... - async def list_tasks(self, *, status: TaskStatus | None = None, assigned_to: str | None = None, project: str | None = None) -> tuple[Task, ...]: ... - async def delete(self, task_id: str) -> bool: ... - -@runtime_checkable -class CostRecordRepository(Protocol): - """CRUD + aggregation interface for CostRecord persistence.""" - - async def save(self, record: CostRecord) -> None: ... - async def query(self, *, agent_id: str | None = None, task_id: str | None = None) -> tuple[CostRecord, ...]: ... - async def aggregate(self, *, agent_id: str | None = None) -> float: ... - -@runtime_checkable -class MessageRepository(Protocol): - """CRUD + query interface for Message persistence.""" - - async def save(self, message: Message) -> None: ... - async def get_history(self, channel: str, *, limit: int | None = None) -> tuple[Message, ...]: ... -``` - -#### Configuration - -```yaml -persistence: - backend: "sqlite" # sqlite, postgresql, mariadb (future) - sqlite: - path: "/data/synthorg.db" # database file path (mounted volume in Docker) - wal_mode: true # WAL for concurrent read performance - journal_size_limit: 67108864 # 64 MB WAL journal limit - # postgresql: # future - # url: "postgresql://user:pass@host:5432/synthorg" - # pool_size: 10 - # mariadb: # future - # url: "mariadb://user:pass@host:3306/synthorg" - # pool_size: 10 -``` - -#### Entities Persisted - -| Entity | Source Module | Repository | Key Queries | -|--------|-------------|------------|-------------| -| `Task` | `core/task.py` | `TaskRepository` | by status, by assignee, by project | -| `CostRecord` | `budget/cost_record.py` | `CostRecordRepository` | by agent, by task, aggregations | -| `Message` | `communication/message.py` | `MessageRepository` | by channel | -| `AuditEntry` | `security/models.py` | `AuditRepository` | by agent, by action type, by verdict, by risk level, time range | -| `ParkedContext` | `security/timeout/parked_context.py` | `ParkedContextRepository` | by execution_id, by agent_id, by task_id | -| Agent runtime state (planned) | `engine/` | `AgentStateRepository` (planned) | by agent_id, active agents | - -#### Migration Strategy - -- Migrations run programmatically at startup via `PersistenceBackend.migrate()` -- Initial migration creates all tables -- Versioned migrations implemented per-backend (e.g. `persistence/sqlite/migrations.py` for SQLite) -- SQLite uses `user_version` pragma for version tracking; PostgreSQL/MariaDB use a migrations table - -#### Key Principles - -- **App code never imports a concrete backend** — only repository protocols -- **Adding a new backend** requires implementing `PersistenceBackend` + all repository protocols — no changes to consumers -- **Same entity models everywhere** — repositories accept and return the existing frozen Pydantic models (Task, CostRecord, Message), no ORM models or data transfer objects -- **Async throughout** — all repository methods are async, matching the project's concurrency model - -#### Multi-Tenancy - -Each company gets its own database. The `PersistenceConfig` embedded in a company's `RootConfig` specifies the backend type and connection details (e.g. a unique SQLite file path or PostgreSQL database URL). The `create_backend(config)` factory returns an isolated `PersistenceBackend` instance per company — no shared state, no cross-company data leakage. - -```python -# One database per company — configured in each company's YAML -company_a_backend = create_backend(company_a_config.persistence) -company_b_backend = create_backend(company_b_config.persistence) -# Each backend has independent lifecycle: connect → migrate → use → disconnect -``` - -#### Future: Runtime Backend Switching - -Runtime backend switching (e.g. migrating a company from SQLite to PostgreSQL during operation) is a planned future capability. The protocol-based design already supports this — the engine would disconnect the current backend, connect a new one with different config, and migrate. Implementation details (data migration tooling, zero-downtime switchover, connection draining) are deferred to the PostgreSQL backend implementation. - -### 7.7 Memory Injection Strategies - -Agent memory reaches agents through pluggable injection strategies behind -the `MemoryInjectionStrategy` protocol. The strategy determines *how* -memories are surfaced to the agent during execution. - -#### Strategy 1: Context Injection (Default / MVP) - -Pre-retrieves relevant memories before execution, ranks by -relevance+recency, enforces token budget, formats as ChatMessage(s) -injected between system prompt and task instruction. Agent passively -receives memories. - -> **Non-inferable filter:** Retrieved memories should be filtered before injection to exclude content the agent can discover by reading the codebase or environment. Only inject memories containing non-inferable information: prior decisions, learned conventions, interpersonal context, historical outcomes. [Research](https://arxiv.org/abs/2602.11988) shows generic context increases cost 20%+ with minimal success improvement; LLM-generated context can actually reduce success rates. -> -> **Decision ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D23):** Pluggable `MemoryFilterStrategy` protocol. Initial: tag-based at write time. Define `non-inferable` tag convention with advisory validation at `MemoryBackend.store()` boundary (warns on missing tags, never blocks). System prompt instructs agents what qualifies: design rationale, team decisions, "why not X", cross-repo knowledge = non-inferable; code structure, API signatures, file contents = inferable. Uses existing `MemoryMetadata.tags` and `MemoryQuery.tags` — zero new models needed. Future strategies: LLM classification at retrieval, keyword/pattern heuristic. - -Pipeline: `MemoryBackend.retrieve()` -> rank by relevance+recency -> -filter by min_relevance -> apply `MemoryFilterStrategy` (D23, optional) -> -greedy token-budget packing -> format as ChatMessage (configured role: -SYSTEM or USER) with delimiters. - -Ranking algorithm: -1. `relevance = entry.relevance_score ?? config.default_relevance` -2. Personal entries: `relevance = min(relevance + personal_boost, 1.0)` -3. `recency = exp(-decay_rate * age_hours)` -4. `combined = relevance_weight * relevance + recency_weight * recency` -5. Filter: `combined >= min_relevance` -6. Sort descending by `combined_score` - -Shared memories (from `SharedKnowledgeStore`) are fetched in parallel, -merged with personal memories (no personal_boost for shared), and -ranked together. - -#### Strategy 2: Tool-Based Retrieval (Future) - -Agent has `recall_memory` / `search_memory` tools it calls on-demand -during execution. Agent actively decides when and what to remember. -More token-efficient (only retrieves when needed) but consumes -tool-call turns and requires agent discipline to invoke. - -#### Strategy 3: Self-Editing Memory (Future) - -Agent has structured memory blocks (core, archival, recall) it reads -AND writes during execution via dedicated tools. Core memory always -in context, archival/recall searched via tools. Most sophisticated -(Letta/MemGPT-inspired) but highest complexity and LLM overhead. - -#### Protocol - -All strategies implement `MemoryInjectionStrategy`: -- `prepare_messages(agent_id, query_text, token_budget) -> tuple[ChatMessage, ...]` -- `get_tool_definitions() -> tuple[ToolDefinition, ...]` -- `strategy_name -> str` - -Strategy selection via config: `memory.retrieval.strategy: context | tool_based | self_editing` - ---- - -## 8. HR & Workforce Management - -> **Implementation note:** Hiring pipeline (`HiringService`), offboarding pipeline -> (`OffboardingService`), onboarding checklists (`OnboardingService`), and agent registry -> (`AgentRegistryService`) are now implemented. Performance tracking subsystem -> (`hr/performance/`) complete with pluggable quality scoring, collaboration scoring, -> trend detection, and multi-window aggregation. Promotions/demotions (section 8.4) -> are implemented in `hr/promotion/` — ThresholdEvaluator (D13), SeniorityApprovalStrategy -> (D14), SeniorityModelMapping (D15), PromotionService orchestrator. - -### 8.1 Hiring Process - -The HR system manages the agent workforce dynamically: - -1. HR agent (or human) identifies skill gap or workload issue -2. HR generates **candidate cards** based on team needs: - - What skills are underrepresented? - - What seniority level is needed? - - What personality would complement the team? - - What model/provider fits the budget? -3. Candidate cards are presented for approval (to CEO or human) -4. Approved candidates are instantiated and onboarded -5. Onboarding includes: company context, project briefing, team introductions. - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D8):** -> -> - **D8.1 — Source:** Templates + LLM customization. Templates for common roles (reuses existing template system §14.1). LLM generates config for novel roles not covered by templates. Approval gate catches invalid/bad configs before instantiation. -> - **D8.2 — Persistence:** Operational store via `PersistenceBackend` (§7.6). YAML stays as bootstrap seed — operational store wins for runtime state. Enables rehiring, auditable history. -> - **D8.3 — Hot-plug:** Agents are hot-pluggable at runtime via a dedicated company/registry service (not `AgentEngine`, which remains the per-agent task runner). Thread-safe registry, wired into message bus + tools + budget. - -### 8.2 Firing / Offboarding - -1. Triggered by: budget cuts, poor performance metrics, project completion, human decision -2. Agent's memory is archived (not deleted) -3. Active tasks are reassigned -4. Team is notified - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D9, D10):** -> -> - **D9 — Task Reassignment:** Pluggable `TaskReassignmentStrategy` protocol. Initial: queue-return — tasks return to unassigned queue, existing `TaskRoutingService` (§6.4) re-routes with priority boost for reassigned tasks. Future strategies: same-department/lowest-load, manager-decides (LLM), HR agent decides. -> - **D10 — Memory Archival:** Pluggable `MemoryArchivalStrategy` protocol. Initial: full snapshot, read-only. Pipeline: retrieve all → archive to `ArchivalStore` → selectively promote semantic+procedural to `OrgMemoryBackend` (rule-based) → clean hot store → mark TERMINATED. Rehiring = restore archived memories into new `AgentIdentity`. Future strategies: selective discard, full-accessible. - -### 8.3 Performance Tracking - -```yaml -agent_metrics: - tasks_completed: 42 - tasks_failed: 2 - average_quality_score: 8.5 # from code reviews, peer feedback - average_cost_per_task: 0.45 - average_completion_time: "2h" - collaboration_score: 7.8 # peer ratings - last_review_date: "2026-02-20" -``` - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D2, D3, D11, D12):** -> -> - **D2 — Quality Scoring:** Pluggable `QualityScoringStrategy` protocol. Initial: layered combination — (1) FREE: objective CI signals (test pass/fail, lint, coverage delta), (2) ~$1/day: small-model LLM judge (different family than agent) evaluates output vs acceptance criteria, (3) on-demand: human override via API, highest weight. Start with Layer 1 only; add layers incrementally. Future strategies: CI-only, LLM-only, human-only. -> - **D3 — Collaboration Scoring:** Pluggable `CollaborationScoringStrategy` protocol. Initial: automated behavioral telemetry — `collaboration_score = weighted_average(delegation_success_rate, delegation_response_latency, conflict_resolution_constructiveness, meeting_contribution_rate, loop_prevention_score, handoff_completeness)`. Weights configurable per-role. Optional: periodic LLM sampling (1%) for calibration + human override via API. Future strategies: LLM evaluation, peer ratings, human-provided. -> - **D11 — Rolling Windows:** Pluggable `MetricsWindowStrategy` protocol. Initial: multiple simultaneous windows — 7d (acute regressions), 30d (sustained patterns), 90d (baseline/drift). Min 5 data points per window; below that, report "insufficient data." Future strategies: fixed single window, per-metric configurable. -> - **D12 — Trend Detection:** Pluggable `TrendDetectionStrategy` protocol. Initial: Theil-Sen regression slope per window + configurable thresholds classify as improving/stable/declining. Theil-Sen has 29.3% outlier breakdown (tolerates ~1 in 3 bad data points). Min 5 data points. Future strategies: period-over-period, OLS regression, threshold-only. - -### 8.4 Promotions & Demotions - -Agents can move between seniority levels based on performance: -- Promotion criteria: sustained high quality scores, task complexity handled, peer feedback -- Demotion criteria: repeated failures, quality drops, cost inefficiency -- Promotions can unlock higher tool access levels (see Progressive Trust) -- Model upgrades/downgrades may accompany level changes (configurable) - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D13, D14, D15):** -> -> - **D13 — Promotion Criteria:** Pluggable `PromotionCriteriaStrategy` protocol. Initial: configurable threshold gates. `ThresholdEvaluator` with `min_criteria_met: int` (N of M) + `required_criteria: list[str]`. Setting min=total gives AND; min=1 gives OR. Default: junior→mid = 2 of 3 criteria, mid→senior = all. Future strategies: pure AND, pure OR. -> - **D14 — Promotion Approval:** Pluggable `PromotionApprovalStrategy` protocol. Initial: senior+ requires human approval. Junior→mid auto-promotes (low cost impact: small→medium ~4x). Demotions: auto-apply for cost-saving (model downgrade), human approval for authority-reducing demotions. Future strategies: all-human, configurable-per-level. -> - **D15 — Model Mapping:** Pluggable `ModelMappingStrategy` protocol. Initial: default ON — `hr.promotions.model_follows_seniority: true`. Model changes at task boundaries only (never mid-execution, consistent with auto-downgrade §10.4). Per-agent `preferred_model` overrides seniority default. Smart routing (§9.4) still uses cheap models for simple tasks regardless of seniority. Future strategies: always-applied, opt-in-only. - ---- - -## 9. Model Provider Layer - -### 9.1 Provider Abstraction - -```text -┌─────────────────────────────────────────────┐ -│ Unified Model Interface │ -│ completion(messages, tools, config) → resp │ -├───────────┬───────────┬───────────┬─────────┤ -│ Cloud API │OpenRouter │ Ollama │ Custom │ -│ Adapter │ Adapter │ Adapter │ Adapter │ -├───────────┼───────────┼───────────┼─────────┤ -│ Direct │ 400+ LLMs│ Local LLMs│ Any API │ -│ API call │ via OR │ Self-host │ │ -└───────────┴───────────┴───────────┴─────────┘ -``` - -### 9.2 Provider Configuration - -> Note: Model IDs, pricing, and provider examples below are **illustrative**. Actual models, costs, and provider availability will be determined during implementation and should be loaded dynamically from provider APIs where possible. - -```yaml -providers: - example-provider: - api_key: "${PROVIDER_API_KEY}" - models: # example entries — real list loaded from provider - - id: "example-large-001" - alias: "large" - cost_per_1k_input: 0.015 # illustrative, verify at implementation time - cost_per_1k_output: 0.075 - max_context: 200000 - estimated_latency_ms: 1500 # optional, used by fastest strategy - - id: "example-medium-001" - alias: "medium" - cost_per_1k_input: 0.003 - cost_per_1k_output: 0.015 - max_context: 200000 - estimated_latency_ms: 500 - - id: "example-small-001" - alias: "small" - cost_per_1k_input: 0.0008 - cost_per_1k_output: 0.004 - max_context: 200000 - estimated_latency_ms: 200 - - openrouter: - api_key: "${OPENROUTER_API_KEY}" - base_url: "https://openrouter.ai/api/v1" - models: # example entries - - id: "vendor-a/model-medium" - alias: "or-medium" - - id: "vendor-b/model-pro" - alias: "or-pro" - - id: "vendor-c/model-reasoning" - alias: "or-reasoning" - - ollama: - base_url: "http://localhost:11434" - models: # example entries - - id: "llama3.3:70b" - alias: "local-llama" - cost_per_1k_input: 0.0 # free, local - cost_per_1k_output: 0.0 - - id: "qwen2.5-coder:32b" - alias: "local-coder" - cost_per_1k_input: 0.0 - cost_per_1k_output: 0.0 -``` - -> **Implementation note:** `ProviderConfig` now includes `subscription: SubscriptionConfig` and `degradation: DegradationConfig` fields for per-provider quota limits and subscription-aware degradation behavior. The default degradation strategy is `ALERT` (raise `QuotaExhaustedError`). `FALLBACK` (route to fallback providers) and `QUEUE` (delay and retry) strategies are defined in the model but **not yet implemented** — the engine currently always raises on quota exhaustion regardless of strategy. Regular quota polling / proactive alerting before quotas are hit is deferred to a follow-up issue. - -### 9.3 LiteLLM Integration - -Use **LiteLLM** as the provider abstraction layer: -- Unified API across 100+ providers -- Built-in cost tracking -- Automatic retries and fallbacks -- Load balancing across providers -- OpenAI-compatible interface (all providers normalized) - -### 9.4 Model Routing Strategy - -```yaml -routing: - strategy: "smart" # smart, cheapest, fastest, role_based, cost_aware, manual - # Strategy behaviors: - # manual — resolve an explicit model override; fails if not set - # role_based — match agent seniority level to routing rules, then catalog default - # cost_aware — match task-type rules, then pick cheapest model within budget - # cheapest — alias for cost_aware - # fastest — match task-type rules, then pick fastest model (by estimated_latency_ms) - # within budget; falls back to cheapest when no latency data is available - # smart — priority cascade: override > task-type > role > seniority > cheapest > fallback chain - rules: - - role_level: "C-Suite" - preferred_model: "large" - fallback: "medium" - - role_level: "Senior" - preferred_model: "medium" - fallback: "small" - - role_level: "Junior" - preferred_model: "small" - fallback: "local-small" - - task_type: "code_review" - preferred_model: "medium" - - task_type: "documentation" - preferred_model: "small" - - task_type: "architecture" - preferred_model: "large" - fallback_chain: - - "example-provider" - - "openrouter" - - "ollama" -``` - ---- - -## 10. Cost & Budget Management - -### 10.1 Budget Hierarchy - -```text -Company Budget ($100/month) - ├── Engineering Dept (50%) ── $50 - │ ├── Backend Team (40%) ── $20 - │ ├── Frontend Team (30%) ── $15 - │ └── DevOps Team (30%) ── $15 - ├── Quality/QA (10%) ── $10 - ├── Product Dept (15%) ── $15 - ├── Operations (10%) ── $10 - └── Reserve (15%) ── $15 -``` - -> Note: Percentages are illustrative defaults. All allocations are configurable per company. - -### 10.2 Cost Tracking - -Every API call is tracked (illustrative schema): - -```json -{ - "agent_id": "sarah_chen", - "task_id": "task-123", - "provider": "example-provider", - "model": "example-medium-001", - "input_tokens": 4500, - "output_tokens": 1200, - "cost_usd": 0.0315, - "timestamp": "2026-02-27T10:30:00Z" -} -``` - -> **Implementation note:** `CostRecord` stores `input_tokens` and `output_tokens`; `total_tokens` is not stored on `CostRecord` — it is a `@computed_field` property on `TokenUsage` (the model embedded in `CompletionResponse`). `_SpendingTotals` base class provides shared `total_cost_usd`, `total_input_tokens`, `total_output_tokens`, and `record_count` fields. `AgentSpending`, `DepartmentSpending`, and `PeriodSpending` extend it with their dimension-specific fields. - -### 10.3 CFO Agent Responsibilities - -> **Current state:** Budget tracking, per-task cost recording, and cost controls (§10.4) are enforced by `BudgetEnforcer` (a service the engine composes, not an agent). CFO cost optimization is implemented via `CostOptimizer`. - -The CFO agent (when enabled) acts as a cost management system: - -- Monitors real-time spending across all agents -- Alerts when departments approach budget limits -- Suggests model downgrades when budget is tight -- Reports daily/weekly spending summaries -- Recommends hiring/firing based on cost efficiency -- Blocks tasks that would exceed remaining budget -- Optimizes model routing for cost/quality balance - -> **Implementation note:** `CostOptimizer` service (`budget/optimizer.py`) -> implements anomaly detection (sigma + spike factor), per-agent efficiency -> analysis, model downgrade recommendations (via `ModelResolver`), routing -> optimization suggestions (cost + context-window comparison), and operation -> approval evaluation. `ReportGenerator` service (`budget/reports.py`) -> produces multi-dimensional spending reports with task/provider/model -> breakdowns and period-over-period comparison. - -### 10.4 Cost Controls - -> **Minimal config:** -> -> ```yaml -> budget: -> total_monthly: 100.00 -> ``` -> -> All other fields below have sensible defaults. - -```yaml -budget: - total_monthly: 100.00 - reset_day: 1 - alerts: - warn_at: 75 # percent - critical_at: 90 - hard_stop_at: 100 - per_task_limit: 5.00 - per_agent_daily_limit: 10.00 - auto_downgrade: - enabled: true - threshold: 85 # percent of budget used - boundary: "task_assignment" # task_assignment only — NEVER mid-execution - downgrade_map: # ordered pairs — aliases reference configured models - - ["large", "medium"] - - ["medium", "small"] - - ["small", "local-small"] -``` - -> **Auto-downgrade boundary:** Model downgrades apply only at **task assignment time**, never mid-execution. An agent halfway through an architecture review cannot be switched to a cheaper model — the task completes on its assigned model. The next task assignment respects the downgrade threshold. This prevents quality degradation from mid-thought model switches. - -> **Implementation note:** `BudgetEnforcer` composes `CostTracker` + -> `BudgetConfig` + optional `QuotaTracker` + optional `ModelResolver` to -> provide three enforcement layers: (1) pre-flight checks via -> `check_can_execute` (monthly hard stop + per-agent daily limit + provider -> quota enforcement when `QuotaTracker` is present), (2) in-flight budget -> checking via a sync `BudgetChecker` closure with pre-computed baselines -> (task + monthly + daily limits, alert deduplication), and (3) -> task-boundary auto-downgrade via `resolve_model`. Billing periods are -> scoped by `billing_period_start(reset_day)`. `DailyLimitExceededError` -> is a subclass of `BudgetExhaustedError` for granular error handling. - -### 10.5 LLM Call Analytics - -> **Current state:** Proxy metrics, call categorization + coordination metric data models, and error taxonomy classification pipeline are implemented. Runtime collection pipeline for coordination metrics and full analytics layer are planned. - -Every LLM provider call is tracked with comprehensive metadata for financial reporting, debugging, and orchestration overhead analysis. The analytics system builds incrementally. - -#### Per-Call Tracking + Proxy Overhead Metrics - -Every completion call produces a `CompletionResponse` with `TokenUsage` (token counts and cost). The engine layer creates a `CostRecord` (with agent/task context) and records it into `CostTracker` — the provider itself does not have agent/task context. The engine additionally logs **proxy overhead metrics** at task completion: - -- `turns_per_task` — number of LLM turns to complete the task (from `AgentRunResult.total_turns`) -- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost.total_tokens`) -- `cost_per_task` — total USD cost (from `AgentContext.accumulated_cost.cost_usd` via `AgentRunResult.total_cost_usd`) -- `duration_seconds` — wall-clock execution time in seconds (from `AgentRunResult.duration_seconds`) -- `prompt_tokens` — estimated system prompt tokens (from `SystemPrompt.estimated_tokens`) -- `prompt_token_ratio` — ratio of prompt tokens to total tokens (overhead indicator, `@computed_field`; warns when >0.3) - -These are natural overhead indicators — a task consuming 15 turns and 50k tokens for a one-line fix signals a problem. - -These metrics are captured in `TaskCompletionMetrics` (in `engine/metrics.py`), a frozen Pydantic model with a `from_run_result()` factory method. The engine logs these metrics at task completion via the `EXECUTION_ENGINE_TASK_METRICS` event. - -#### Call Categorization + Orchestration Ratio - -> **Current state:** Data models (`LLMCallCategory`, `CategoryBreakdown`, `OrchestrationRatio`, `CostRecord.call_category`) and query methods (`CostTracker.get_category_breakdown`, `get_orchestration_ratio`) are implemented. Runtime categorization logic (automatic tagging of calls during multi-agent execution) is planned. - -When multi-agent coordination exists, each `CostRecord` is tagged with a **call category**: - -| Category | Description | Examples | -|----------|-------------|---------| -| `productive` | Direct task work — tool calls, code generation, task output | Agent writing code, running tests | -| `coordination` | Inter-agent communication — delegation, reviews, meetings | Manager reviewing work, agent presenting in meeting | -| `system` | Framework overhead — system prompt injection, context loading | Initial prompt, memory retrieval injection | - -The **orchestration ratio** (`coordination / total`) is surfaced in metrics and alerts. If coordination tokens consistently exceed productive tokens, the company configuration needs tuning (fewer approval layers, simpler meeting protocols, etc.). - -#### Coordination Metrics Suite - -A comprehensive suite of coordination metrics derived from empirical agent scaling research ([Kim et al., 2025](https://arxiv.org/abs/2512.08296)). These metrics explain coordination dynamics and enable data-driven tuning of multi-agent configurations. - -| Metric | Symbol | Definition | What It Signals | -|--------|--------|------------|-----------------| -| **Coordination efficiency** | `Ec` | `success_rate / (turns / turns_sas)` — success normalized by relative turn count vs single-agent baseline | Overall coordination ROI. Low Ec = coordination costs exceed benefits | -| **Coordination overhead** | `O%` | `(turns_mas - turns_sas) / turns_sas × 100%` — relative turn increase | Communication cost. Optimal band: 200–300%. Above 400% = over-coordination | -| **Error amplification** | `Ae` | `error_rate_mas / error_rate_sas` — relative failure probability | Whether MAS corrects or propagates errors. Centralized ≈ 4.4×, Independent ≈ 17.2× | -| **Message density** | `c` | Inter-agent messages per reasoning turn | Communication intensity. Performance saturates at ≈ 0.39 messages/turn | -| **Redundancy rate** | `R` | Mean cosine similarity of agent output embeddings | Agent agreement. Optimal at ≈ 0.41 (balances fusion with independence) | - -> **Configurable collection:** All 5 metrics are opt-in via `coordination_metrics.enabled` in analytics config. `Ec` and `O%` are cheap (turn counting). `Ae` requires baseline comparison data. `c` and `R` require semantic analysis of agent outputs (embedding computation). Enable selectively based on data-gathering needs. - -```yaml -coordination_metrics: - enabled: false # opt-in — enable for data gathering - collect: - - efficiency # cheap — turn counting - - overhead # cheap — turn counting - - error_amplification # requires SAS baseline data - - message_density # requires message counting infrastructure - - redundancy # requires embedding computation on outputs - baseline_window: 50 # number of SAS runs to establish baseline for Ae - error_taxonomy: - enabled: false # opt-in — enable for targeted diagnosis - categories: - - logical_contradiction - - numerical_drift - - context_omission - - coordination_failure -``` - -#### Full Analytics Layer (Planned) - -Expanded per-call metadata for comprehensive financial and operational reporting: - -```yaml -call_analytics: - track: - - call_category # productive, coordination, system - - success # true/false - - retry_count # 0 = first attempt succeeded - - retry_reason # rate_limit, timeout, internal_error - - latency_ms # wall-clock time for the call (not estimated_latency_ms from config) - - finish_reason # stop, tool_use, max_tokens, error - - cache_hit # prompt caching hit/miss (provider-dependent) - aggregation: - - per_agent_daily # agent spending over time - - per_task # total cost per task - - per_department # department-level rollups - - per_provider # provider reliability and cost comparison - - orchestration_ratio # coordination vs productive tokens - alerts: - orchestration_ratio: - info: 0.30 # info if coordination > 30% of total - warn: 0.50 # warn if coordination > 50% of total - critical: 0.70 # critical if coordination > 70% of total - retry_rate_warn: 0.1 # warn if > 10% of calls need retries -``` - -> **Design principle:** Analytics metadata is append-only and never blocks execution. Failed analytics writes are logged and skipped — the agent's task is never delayed by telemetry. All analytics data flows through the existing `CostRecord` and structured logging infrastructure. - -#### Coordination Error Taxonomy - -> **Current state:** Error taxonomy classification pipeline is implemented in `engine/classification/`. Four heuristic-based detectors (logical contradiction, numerical drift, context omission, coordination failure) run post-execution when enabled via `error_taxonomy_config`. Integrated into `AgentEngine`. Classification results are log-only; programmatic access is planned. Full semantic analysis detectors are planned. - -When coordination metrics collection is enabled, the system can optionally classify coordination errors into structured categories. This enables targeted diagnosis — e.g., if coordination failures spike, the topology may be too complex; if context omissions spike, the orchestrator's synthesis is insufficient. - -| Error Category | Description | Detection Method | -|---------------|-------------|-----------------| -| **Logical contradiction** | Agent asserts both "X is true" and "X is false", or derives conclusions violating its stated premises | Semantic contradiction detection on agent outputs | -| **Numerical drift** | Accumulated computational errors from cascading rounding or unit conversion (>5% deviation) | Numerical comparison against ground truth or cross-agent verification | -| **Context omission** | Failure to reference previously established entities, relationships, or state required for current reasoning | Missing-reference detection across agent conversation history | -| **Coordination failure** | MAS-specific: message misinterpretation, task allocation conflicts, state synchronization errors between agents | Protocol-level error detection in orchestration layer | - -> **Configurable and opt-in:** Error taxonomy classification requires semantic analysis of agent outputs and is expensive. Enable via `coordination_metrics.error_taxonomy.enabled: true` only when actively gathering data for system tuning. The classification pipeline runs post-execution (never blocks agent work) and logs structured events to the observability layer. This configuration is part of the main `coordination_metrics` block defined in the Coordination Metrics Suite section above. - -> **Reference:** Error categories derived from [Kim et al., 2025](https://arxiv.org/abs/2512.08296) and the Multi-Agent System Failure Taxonomy (MAST) by Cemri et al. (2025). Architecture-specific patterns: centralized coordination reduces logical contradictions by 36.4% and context omissions by 66.8% via orchestrator synthesis; hybrid topology introduces 12.4% coordination failures due to protocol complexity. - ---- - -## 11. Tool & Capability System - -### 11.1 Tool Categories - -| Category | Tools | Typical Roles | -|----------|-------|---------------| -| **File System** | Read, write, edit, list, delete files | All developers, writers | -| **Code Execution** | Run code in sandboxed environments | Developers, QA | -| **Version Control** | Git operations, PR management | Developers, DevOps | -| **Web** | HTTP requests, web scraping, search | Researchers, analysts | -| **Database** | Query, migrate, admin | Backend devs, DBAs | -| **Terminal** | Shell commands (sandboxed) | DevOps, senior devs | -| **Design** | Image generation, mockup tools | Designers | -| **Communication** | Email, Slack, notifications | PMs, executives | -| **Analytics** | Metrics, dashboards, reporting | Data analysts, CFO | -| **Deployment** | CI/CD, container management | DevOps, SRE | -| **MCP Servers** | Any MCP-compatible tool | Configurable per agent | - -### 11.1.1 Tool Execution Model - -When the LLM requests multiple tool calls in a single turn, `ToolInvoker.invoke_all` executes them **concurrently** using `asyncio.TaskGroup`. An optional `max_concurrency` parameter (default unbounded) limits parallelism via `asyncio.Semaphore`. Recoverable errors are captured as `ToolResult(is_error=True)` without aborting sibling invocations; non-recoverable errors (`MemoryError`, `RecursionError`) are collected and re-raised after all tasks complete (bare exception for one, `ExceptionGroup` for multiple). - -`BaseTool.parameters_schema` deep-copies the caller-supplied schema at construction and wraps it in `MappingProxyType` for read-only enforcement; the property returns a deep copy on access to prevent mutation of internal state. `ToolInvoker` deep-copies arguments at the tool execution boundary before passing them to `tool.execute()`. `MappingProxyType` wrapping is also used in `ToolRegistry` for its internal collections. - -**Permission checking:** Each `BaseTool` carries a `category: ToolCategory` attribute used for access-level gating. `ToolInvoker` accepts an optional `ToolPermissionChecker` which enforces the agent's `ToolPermissions.access_level` (see §11.2). Permission checking occurs after tool lookup but before parameter validation: - -1. `get_permitted_definitions()` filters tool definitions sent to the LLM — the agent only sees tools it is permitted to use. -2. At invocation time, denied tools return `ToolResult(is_error=True)` with a descriptive denial reason (defense-in-depth against LLM hallucinating unpresented tools). - -The `ToolPermissionChecker` resolves permissions using a priority-based system: denied list (highest) → allowed list → access-level categories → deny (default). `AgentEngine._make_tool_invoker()` creates a permission-aware invoker from the agent's `ToolPermissions` at the start of each `run()` call. Note: the current implementation provides category-level gating only; the granular sub-constraints described in §11.2 (workspace scope, network mode) are planned for when sandboxing is implemented. - -> **Implementation note — Built-in git tools:** Six workspace-scoped git tools are implemented in `tools/git_tools.py` with a shared `_BaseGitTool` base class in `tools/_git_base.py`: `GitStatusTool`, `GitLogTool`, `GitDiffTool`, `GitBranchTool`, `GitCommitTool`, and `GitCloneTool`. The base class enforces workspace boundary security (path traversal prevention via `resolve()` + `relative_to()`) and provides a common `_run_git()` helper using `asyncio.create_subprocess_exec` (never `shell=True`). Security hardening includes: `GIT_TERMINAL_PROMPT=0` to prevent credential prompts, `GIT_CONFIG_NOSYSTEM=1`, `GIT_CONFIG_GLOBAL=os.devnull`, and `GIT_PROTOCOL_FROM_USER=0` to restrict config/protocol attack surfaces, rejection of flag-like argument values (starting with `-`) for refs, branch names, author filters, date strings, and other git arguments, URL scheme validation on clone (only `https://`, `ssh://`, `git://`, and SCP-like syntax — plain `http://` rejected for security) with `--` separator before positional URL argument, and clone URLs starting with `-` are rejected. All tools return `ToolExecutionResult` for errors rather than raising exceptions. When a `SandboxBackend` is injected, `_run_git()` delegates subprocess management to the sandbox via `_run_git_sandboxed()` — the sandbox handles environment filtering and workspace-scoped cwd enforcement, while `_validate_path` independently enforces workspace boundaries for git path arguments. Git hardening env vars are passed as `env_overrides` to the sandbox, and `SandboxResult` is converted to `ToolExecutionResult` via `_sandbox_result_to_execution_result`. Without a sandbox, the direct-subprocess path is used (backward compatible). Both paths explicitly close the subprocess transport on Windows (via `tools/_process_cleanup.py`) to prevent `ResourceWarning` on `ProactorEventLoop`. **Future:** Consider adding host/IP allowlisting for clone URLs to prevent SSRF against internal networks (loopback, link-local, private ranges). - -### 11.1.2 Tool Sandboxing - -Tool execution requires safety boundaries proportional to the risk of each tool category. The framework uses a **layered sandboxing strategy** with a pluggable `SandboxBackend` protocol — new backends can be added without modifying existing ones. The default configuration uses lighter isolation for low-risk tools and stronger isolation for high-risk tools. - -> **MVP: Subprocess sandbox for file/git tools. Docker optional for code execution.** K8s is future. -> -> **Decision ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D16):** Docker MVP only via `aiodocker` (async-native, Python 3.14 support). Pre-built image (Python 3.14 + Node.js LTS + basic utils, <500MB) + user-configurable via `docker.image` config. **Fail with clear error** if Docker unavailable — no unsafe subprocess fallback for code execution (file/git tools already use `SubprocessSandbox`). gVisor (`--runtime=runsc`) as free config-level hardening upgrade. WASM/Firecracker evaluation planned. `SandboxBackend` protocol makes adding backends trivial. - -#### Sandbox Backends - -| Backend | Isolation | Latency | Dependencies | Status | -|---------|-----------|---------|--------------|--------| -| `SubprocessSandbox` | Process-level: env filtering (allowlist + denylist), restricted PATH (configurable via `extra_safe_path_prefixes`), workspace-scoped cwd, timeout + process-group kill, library injection var blocking, explicit transport cleanup on Windows | ~ms | None | **Implemented** | -| `DockerSandbox` | Container-level: ephemeral container, mounted workspace, no network, resource limits (CPU/memory/time) | ~1-2s cold start | Docker | **Implemented** | -| `K8sSandbox` | Pod-level: per-agent containers, namespace isolation, resource quotas, network policies | ~2-5s | Kubernetes | Future | - -#### Default Layered Configuration - -```yaml -sandboxing: - default_backend: "subprocess" # subprocess, docker, k8s - overrides: # per-category backend overrides - file_system: "subprocess" # low risk — fast, no deps - git: "subprocess" # low risk — workspace-scoped - web: "docker" # medium risk — needs network isolation - code_execution: "docker" # high risk — strong isolation required - terminal: "docker" # high risk — arbitrary commands - database: "docker" # high risk — data mutation; see network note below - subprocess: - timeout_seconds: 30 - workspace_only: true # restrict filesystem access to project dir - restricted_path: true # strip dangerous binaries from PATH - docker: - image: "synthorg-sandbox:latest" # pre-built image with common runtimes - network: "none" # no network by default; per-category overrides below - network_overrides: # category-specific network policies - database: "bridge" # database tools need TCP access to DB host - web: "egress-only" # web tools need outbound HTTP; no inbound - allowed_hosts: [] # allowlist of host:port pairs (e.g. ["db:5432"]) - memory_limit: "512m" - cpu_limit: "1.0" - timeout_seconds: 120 - mount_mode: "ro" # read-only by default; workspace mounted separately - auto_remove: true # ephemeral — container removed after execution - k8s: # future — per-agent pod isolation - namespace: "synthorg-agents" - resource_requests: - cpu: "250m" - memory: "256Mi" - resource_limits: - cpu: "1" - memory: "1Gi" - network_policy: "deny-all" # default deny, allowlist per tool -``` - -> **User experience:** Docker is optional — only required when code execution, terminal, web, or database tools are enabled. File system and git tools work out of the box with subprocess isolation. This keeps the "local first" experience lightweight while providing strong isolation where it matters. - -> **Scaling path:** In a future Kubernetes deployment (§18.2 Phase 3-4), each agent can run in its own pod via `K8sSandbox`. At that point, the layered configuration becomes less relevant — all tools execute within the agent's isolated pod. The `SandboxBackend` protocol makes this transition seamless. - -### 11.1.3 MCP Integration - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D17, D18):** -> -> - **D17 — MCP SDK:** Official `mcp` Python SDK, pinned `==1.26.0`. Thin `MCPBridgeTool` adapter layer isolates the rest of the codebase from SDK API changes. Support **stdio** (local/dev) and **Streamable HTTP** (remote/production) transports. Skip deprecated SSE. v2 migration planned — pin range prevents accidental breaking upgrade. -> - **D18 — MCP Result Mapping:** Adapter in `MCPBridgeTool` keeps `ToolResult` as-is. Mapping: text blocks → concatenate to `content: str`; image/audio → `[image: {mimeType}]` placeholder + base64 in `metadata["attachments"]`; `structuredContent` → `metadata["structured_content"]`; `isError` → `is_error` (1:1). Future: extend `ToolResult` with optional `attachments` when multi-modal LLM tool results are needed. - -### 11.1.4 Action Type System - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D1):** -> -> Action types classify agent actions for use by autonomy presets (§12.2), SecOps validation (§12.3), tiered timeout policies (§12.4), and progressive trust (§11.3). Three sub-decisions: -> -> - **D1.1 — Registry:** `StrEnum` for ~25 built-in action types (type safety, autocomplete, typos caught at compile time) + `ActionTypeRegistry` for custom types via explicit registration. Unknown strings rejected at config load time. Critical for security — a typo in `human_approval` list silently means "skip approval." -> - **D1.2 — Granularity:** Two-level `category:action` hierarchy. Category shortcuts: `auto_approve: ["code"]` expands to all `code:*` actions. Fine-grained: `human_approval: ["code:create"]`. -> -> **Proposed taxonomy (~25 leaf types):** -> -> ```text -> code:read, code:write, code:create, code:delete, code:refactor -> test:write, test:run -> docs:write -> vcs:read, vcs:commit, vcs:push, vcs:branch -> deploy:staging, deploy:production -> comms:internal, comms:external -> budget:spend, budget:exceed -> org:hire, org:fire, org:promote -> db:query, db:mutate, db:admin -> arch:decide -> ``` -> -> - **D1.3 — Classification:** Static tool metadata. Each `BaseTool` declares its `action_type`. Default mapping from `ToolCategory` → action type. Non-tool actions (`org:hire`, `budget:spend`) triggered by engine-level operations. No LLM in the security classification path. - -### 11.2 Tool Access Levels - -```yaml -tool_access: - levels: - sandboxed: - description: "No external access. Isolated workspace." - file_system: "workspace_only" - code_execution: "containerized" - network: "none" - git: "local_only" - - restricted: - description: "Limited external access with approval." - file_system: "project_directory" - code_execution: "containerized" - network: "allowlist_only" - git: "read_and_branch" - requires_approval: ["deployment", "database_write"] - - standard: - description: "Normal development access." - file_system: "project_directory" - code_execution: "containerized" - network: "open" - git: "full" - terminal: "restricted_commands" - - elevated: - description: "Full access for senior/trusted agents." - file_system: "full" - code_execution: "host" - network: "open" - git: "full" - terminal: "full" - deployment: true - - custom: - description: "Per-agent custom configuration." -``` - -> **Implementation note:** The current `ToolPermissionChecker` implements **category-level gating only** — each access level maps to a set of permitted `ToolCategory` values (e.g., `STANDARD` permits `file_system`, `code_execution`, `version_control`, `web`, `terminal`, `analytics`). `SubprocessSandbox` provides workspace-scoped cwd enforcement and env filtering (see §11.1.2). The granular sub-constraints shown above (network mode, containerization) are planned for Docker/K8s sandbox backends. - -### 11.3 Progressive Trust - -Agents can earn higher tool access over time through configurable trust strategies. The trust system implements a `TrustStrategy` protocol, making it extensible. Multiple strategies are available, selectable via config. - -> **Current state:** All four strategies are implemented behind the `TrustStrategy` protocol: `DisabledTrustStrategy`, `WeightedTrustStrategy`, `PerCategoryTrustStrategy`, `MilestoneTrustStrategy`. Default is disabled (static access) — agents receive their configured access level at hire time. -> -> **Security invariant (all strategies):** The `standard_to_elevated` promotion **always** requires human approval. No agent can auto-gain production access regardless of trust strategy. - -#### Strategy: Disabled (Static Access) — Default - -Trust is disabled. Agents receive their configured access level at hire time and it never changes. Simplest option — useful when the human manages permissions manually. - -```yaml -trust: - strategy: "disabled" # disabled, weighted, per_category, milestone - initial_level: "standard" # fixed access level for all agents -``` - -#### Strategy: Weighted Score (Single Track) - -A single trust score computed from weighted factors: task difficulty completed, error rate, time active, and human feedback. One global trust level per agent, applied to all tool categories. - -```yaml -trust: - strategy: "weighted" - initial_level: "sandboxed" - weights: - task_difficulty: 0.3 # harder tasks completed = more trust - completion_rate: 0.25 - error_rate: 0.25 # inverse — fewer errors = more trust - human_feedback: 0.2 - promotion_thresholds: - sandboxed_to_restricted: 0.4 - restricted_to_standard: 0.6 - standard_to_elevated: - score: 0.8 - requires_human_approval: true # always human-gated -``` - -- Simple model, easy to understand. One number to track -- Too coarse — an agent trusted for file edits shouldn't auto-get deployment access - -#### Strategy: Per-Category Trust Tracks - -Separate trust tracks per tool category (filesystem, git, deployment, database, network). An agent can be "standard" for files but "sandboxed" for deployment. Promotion criteria differ per category. Human approval gate required for any production-touching category. - -```yaml -trust: - strategy: "per_category" - initial_levels: - file_system: "restricted" - git: "restricted" - code_execution: "sandboxed" - deployment: "sandboxed" - database: "sandboxed" - terminal: "sandboxed" - promotion_criteria: - file_system: - restricted_to_standard: - tasks_completed: 10 - quality_score_min: 7.0 - deployment: - sandboxed_to_restricted: - tasks_completed: 20 - quality_score_min: 8.5 - requires_human_approval: true # always human-gated for deployment -``` - -- Granular. Matches real security models (IAM roles). Prevents gaming via easy tasks -- More complex data model. Trust state is a matrix per agent, not a scalar - -#### Strategy: Milestone Gates (ATF-Inspired) - -Explicit capability milestones aligned with the Cloud Security Alliance Agentic Trust Framework. Automated promotion for low-risk levels. Human approval gates for elevated access. Trust is time-bound and subject to periodic re-verification — trust decays if the agent is idle for extended periods or error rate increases. - -```yaml -trust: - strategy: "milestone" - initial_level: "sandboxed" - milestones: - sandboxed_to_restricted: - tasks_completed: 5 - quality_score_min: 7.0 - auto_promote: true # no human needed - restricted_to_standard: - tasks_completed: 20 - quality_score_min: 8.0 - time_active_days: 7 - auto_promote: true - standard_to_elevated: - requires_human_approval: true # always human-gated - clean_history_days: 14 # no errors in last 14 days - re_verification: - enabled: true - interval_days: 90 # re-verify every 90 days - decay_on_idle_days: 30 # demote one level if idle 30+ days - decay_on_error_rate: 0.15 # demote if error rate exceeds 15% -``` - -- Industry-aligned. Re-verification prevents stale trust. Human gates where it matters -- Most complex. Trust decay may need tuning to avoid frustrating users - ---- - -## 12. Security & Approval System - -### 12.1 Approval Workflow - -```text - ┌──────────────┐ - │ Task/Action │ - └──────┬───────┘ - │ - ┌──────▼───────┐ - │ Security Ops │ - │ Agent │ - └──────┬───────┘ - ╱ ╲ - ┌─────▼─┐ ┌───▼────┐ - │APPROVE │ │ DENY │ - │(auto) │ │+ reason│ - └────┬───┘ └───┬────┘ - │ │ - Execute ┌───▼────────┐ - │ Human Queue │ - │ (Dashboard) │ - └───┬────────┘ - ╱ ╲ - ┌─────▼─┐ ┌───▼──────┐ - │Override│ │Alternative│ - │Approve │ │Suggested │ - └────────┘ └──────────┘ -``` - -### 12.2 Autonomy Levels - -> **Planned minimal config (not yet implemented — current schema uses a float):** -> -> ```yaml -> autonomy: -> level: "semi" -> ``` -> -> All presets below are built-in. Most users only set the level. - -```yaml -autonomy: - level: "semi" # full, semi, supervised, locked - presets: - full: - description: "Agents work independently. Human notified of results only." - auto_approve: ["all"] - human_approval: [] - - semi: - description: "Most work is autonomous. Major decisions need approval." - auto_approve: ["code", "test", "docs", "comms:internal"] - human_approval: ["deploy", "comms:external", "budget:exceed", "org:hire"] - security_agent: true - - supervised: - description: "Human approves major steps. Agents handle details." - auto_approve: ["code:write", "comms:internal"] - human_approval: ["arch", "code:create", "deploy", "vcs:push"] - security_agent: true - - locked: - description: "Human must approve every action." - auto_approve: [] - human_approval: ["all"] - security_agent: true # still runs for audit logging, but human is approval authority -``` - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D6, D7):** -> -> - **D6 — Autonomy Scope:** Three-level resolution chain: per-agent → per-department → company default. Optional `autonomy_level` on `AgentIdentity` and department config. Resolution: `agent.autonomy_level or department.autonomy_level or company.autonomy.level`. Seniority validation: Juniors/Interns cannot be set to `full`. -> - **D7 — Autonomy Changes at Runtime:** Pluggable `AutonomyChangeStrategy` protocol. Initial: **(a+c hybrid)** — human-only promotion via REST API (no agent including CEO can escalate privileges) **plus** automatic downgrade on: high error rate → one level down, budget exhausted → supervised, security incident → locked. Recovery from auto-downgrade: human-only. Precedent: no real-world security system automatically grants higher privileges. Future strategies: fully configurable conditions. - -### 12.3 Security Operations Agent - -A special meta-agent that reviews all actions before execution: - -- Evaluates safety of proposed actions -- Checks for data leaks, credential exposure, destructive operations -- Validates actions against company policies -- Maintains an audit log of all approvals/denials -- Escalates uncertain cases to human queue with explanation -- **Cannot be overridden by other agents** (only human can override) - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D4, D5):** -> -> - **D4 — LLM vs Rule-based:** Hybrid approach. Rule engine for known patterns (credentials, path traversal, destructive ops) — sub-ms, covers ~95% of cases. LLM fallback only for uncertain cases (~5%). Full autonomy mode: rules + audit logging only, no LLM path. Hard safety rules (credential exposure, data destruction) **never bypass** regardless of autonomy level. Precedent: AWS GuardDuty, LlamaFirewall, NeMo Guardrails all use hybrid. -> - **D5 — Integration Point:** Pluggable `SecurityInterceptionStrategy` protocol. Initial: before every tool invocation — slots into existing `ToolInvoker` between permission check and tool execution. Policy strictness (not interception point) configurable per autonomy level. Add post-tool-call scanning for sensitive data in outputs. Performance: sub-ms rule check is invisible against seconds of LLM inference. Future strategies: batch-level (before task step), assignment-only. - -#### Output Scan Response Policies - -After the output scanner detects sensitive data, a pluggable **`OutputScanResponsePolicy`** protocol decides how to handle the findings. Four built-in policies ship behind the protocol: - -| Policy | Behavior | Default for | -|--------|----------|-------------| -| **Redact** (default) | Return scanner's redacted content as-is | `SEMI`, `SUPERVISED` autonomy | -| **Withhold** | Clear redacted content — fail-closed, no partial data returned | `LOCKED` autonomy | -| **Log-only** | Discard findings (logs at WARNING), pass original output through | `FULL` autonomy | -| **Autonomy-tiered** | Delegate to a sub-policy based on effective autonomy level | Composite policy | - -Policy selection is declarative via `SecurityConfig.output_scan_policy_type` (`OutputScanPolicyType` enum). A factory function (`build_output_scan_policy`) resolves the enum to a concrete policy instance. Runtime constructor injection on `SecOpsService` is also supported for full flexibility. The policy is applied *after* audit recording, preserving audit fidelity regardless of policy outcome. - -### 12.4 Approval Timeout Policy - -When an action requires human approval (per autonomy level in §12.2), the agent must wait. The framework provides configurable timeout policies that determine what happens when a human doesn't respond. All policies implement a `TimeoutPolicy` protocol. The policy is configurable per autonomy level and per action risk tier. - -> **Current state:** All four timeout policies are implemented: `WaitForeverPolicy`, `AutoDenyPolicy`, `TieredPolicy`, `EscalationChainPolicy`. Park/resume service, risk tier classifier, and timeout checker are complete. - -During any wait — regardless of policy — the agent **parks** the blocked task (saving its full serialized `AgentContext` state: conversation, progress, accumulated cost, turn count — i.e., the complete persisted context, distinct from the compact `AgentContextSnapshot` used for telemetry) and picks up other available tasks from its queue. When approval eventually arrives, the agent **resumes** the original context exactly where it left off. This mirrors real company behavior: a junior developer starts another task while waiting for a code review, then returns to the original work when feedback arrives. - -#### Policy 1: Wait Forever (Default for Critical Actions) - -The action stays in the human queue indefinitely. No timeout, no auto-resolution. The agent is aware the task is parked awaiting approval and works on other tasks in the meantime. - -```yaml -approval_timeout: - policy: "wait" # wait, deny, tiered, escalation -``` - -- Safest — no risk of unauthorized actions. Mirrors "awaiting review" in real workflows -- Can stall tasks indefinitely if human is unavailable. Queue can grow unbounded - -#### Policy 2: Deny on Timeout - -All unapproved actions auto-deny after a configurable timeout. The agent receives a denial reason ("approval timeout — human did not respond within window") and can retry with a different approach or escalate explicitly. - -```yaml -approval_timeout: - policy: "deny" - timeout_minutes: 240 # 4 hours -``` - -- Industry consensus default ("fail closed"). Agent learns to prefer auto-approvable paths -- May stall legitimate work if human is consistently slow - -#### Policy 3: Tiered Timeout - -Different timeout behavior based on action risk level. Low-risk actions auto-approve after a short wait. Medium-risk actions auto-deny. High-risk/security-critical actions wait forever. - -```yaml -approval_timeout: - policy: "tiered" - tiers: - low_risk: - timeout_minutes: 60 - on_timeout: "approve" # auto-approve low-risk after 1 hour - actions: ["code:write", "comms:internal", "test"] - medium_risk: - timeout_minutes: 240 - on_timeout: "deny" # auto-deny medium-risk after 4 hours - actions: ["code:create", "vcs:push", "arch:decide"] - high_risk: - timeout_minutes: null # wait forever - on_timeout: "wait" - actions: ["deploy", "db:admin", "comms:external", "org:hire"] -``` - -- Pragmatic — low-risk stuff doesn't stall, critical stuff stays safe -- Auto-approve on timeout carries risk. Tuning tier boundaries requires experience - -#### Policy 4: Escalation Chain - -On timeout, the approval request escalates to the next human in a configured chain (e.g., primary reviewer → manager → VP → board). If the entire chain times out, the action is denied. - -```yaml -approval_timeout: - policy: "escalation" - chain: - - role: "direct_manager" - timeout_minutes: 120 - - role: "department_head" - timeout_minutes: 240 - - role: "ceo_or_board" - timeout_minutes: 480 - on_chain_exhausted: "deny" # deny if entire chain times out -``` - -- Mirrors real orgs — if your boss is out, their boss covers. Multiple chances for approval -- Requires configuring an escalation chain. More humans involved. Complex to implement - -> **Task Suspension and Resumption:** The park/resume mechanism relies on `AgentContext` snapshots (frozen Pydantic models). When a task is parked, the full context is persisted. When approval arrives, the framework loads the snapshot, restores the agent's conversation and state, and resumes execution from the exact point of suspension. This works naturally with the `model_copy(update=...)` immutability pattern — the snapshot is a complete, self-contained state. - -> **Decisions ([ADR-002](docs/decisions/ADR-002-design-decisions-batch-1.md) D19, D20, D21):** -> -> - **D19 — Risk Tier Classification:** Pluggable `RiskTierClassifier` protocol. Initial: configurable YAML mapping — `RiskTierMapping` config model with `dict[str, ApprovalRiskLevel]`. Sensible defaults matching examples above (e.g. `code:write` → low, `deploy:production` → critical). Unknown action types default to HIGH (fail-safe). Hot-reloadable. Leaves door open for future SecOps override. Future strategies: SecOps-assigned, fixed-per-type. -> - **D20 — Context Serialization:** Pydantic JSON via persistence backend. `ParkedContext` model with metadata columns (`execution_id`, `agent_id`, `task_id`, `parked_at`) + `context_json` blob. `ParkedContextRepository` protocol via existing `PersistenceBackend` (§7.6). Conversation stored **verbatim** — summarization is a context window management concern at resume time, not a persistence concern. -> - **D21 — Resume Injection:** Tool result injection. Approval requests modeled as tool calls (`request_human_approval`). Approval decision returned as `ToolResult` — semantically correct (approval IS the tool's return value). LLM conversation protocol requires a tool result after a tool call. Fallback: system message injection for engine-initiated parking (exception path). - ---- - -## 13. Human Interaction Layer - -### 13.1 Architecture: API-First - -The REST/WebSocket API is the **primary interface** for all consumers. The Web UI and any future CLI tool are thin clients that call the API — they contain no business logic. - -```text -┌─────────────────────────────────────────────┐ -│ SynthOrg Engine │ -│ (Core Logic, Agent Orchestration, Tasks) │ -└──────────────────┬──────────────────────────┘ - │ - ┌────────▼────────┐ - │ REST/WS API │ ← primary interface - │ (Litestar) │ - └───┬─────────┬───┘ - │ │ - ┌───────▼──┐ ┌───▼────────┐ - │ Web UI │ │ CLI Tool │ - │ (Future) │ │ (Future) │ - └──────────┘ └────────────┘ -``` - -> **CLI Tool (Future):** If needed, a thin CLI utility wrapping the REST API with terminal formatting (Typer + Rich or similar). Not a priority — the API is fully self-sufficient. To be determined whether a dedicated CLI is warranted or whether `curl`/`httpie` and the interactive Scalar docs at `/docs/api` suffice. - -### 13.2 API Surface - -```text -/api/v1/ - ├── /health # Health check, readiness - ├── /auth # Authentication: setup, login, password change, me - ├── /company # CRUD company config - ├── /agents # List, hire, fire, modify agents - ├── /departments # Department management - ├── /projects # Project CRUD - ├── /tasks # Task management - ├── /messages # Communication log - ├── /meetings # Schedule, view meeting outputs - ├── /artifacts # Browse produced artifacts (code, docs, etc.) - ├── /budget # Spending, limits, projections - ├── /approvals # Pending human approvals queue - ├── /analytics # Performance metrics, dashboards - ├── /providers # Model provider status, config - └── /ws # WebSocket for real-time updates -``` - -### 13.3 Web UI Features - -- **Dashboard**: Real-time company overview, active tasks, spending -- **Org Chart**: Visual hierarchy, click to inspect any agent -- **Task Board**: Kanban/list view of all tasks across projects -- **Message Feed**: Real-time feed of agent communications -- **Approval Queue**: Pending approvals with context and recommendations -- **Agent Profiles**: Detailed view of each agent's identity, history, metrics -- **Budget Panel**: Spending charts, projections, alerts -- **Meeting Logs**: Transcripts and outcomes of all agent meetings -- **Artifact Browser**: Browse and inspect all produced work -- **Settings**: Company config, autonomy levels, provider settings - -### 13.4 Human Roles - -The human can interact as: - -| Role | Access | Description | -|------|--------|-------------| -| **Board Member** | Observe + major approvals only | Minimal involvement, strategic oversight | -| **CEO** | Full authority, replaces CEO agent | Human IS the CEO, agents are the team | -| **Manager** | Department-level authority | Manages one team/department directly | -| **Observer** | Read-only | Watch the company operate, no intervention | -| **Pair Programmer** | Direct collaboration with one agent | Work alongside a specific agent in real-time | - ---- - -## 14. Templates & Builder - -### 14.1 Template System - -Templates are YAML/JSON files defining a complete company setup: - -```yaml -# templates/startup.yaml (simplified — real templates also declare -# variables, departments, min_agents/max_agents, and tags) -template: - name: "Tech Startup" - description: "Small team for building MVPs and prototypes" - version: "1.0" - - company: - type: "startup" - budget_monthly: "{{ budget | default(50.00) }}" - autonomy: 0.5 - - agents: - - role: "CEO" - name: "{{ ceo_name | auto }}" - model: "large" - personality_preset: "visionary_leader" - - - role: "Full-Stack Developer" - merge_id: "fullstack-senior" - name: "{{ dev1_name | auto }}" - level: "senior" - model: "medium" - personality_preset: "pragmatic_builder" - - - role: "Full-Stack Developer" - merge_id: "fullstack-mid" - name: "{{ dev2_name | auto }}" - level: "mid" - model: "small" - personality_preset: "eager_learner" - - - role: "Product Manager" - name: "{{ pm_name | auto }}" - model: "medium" - personality_preset: "strategic_planner" - - workflow: "agile_kanban" - communication: "hybrid" - - workflow_handoffs: - - from_department: "engineering" - to_department: "qa" - trigger: "pr_ready" - - escalation_paths: - - from_department: "engineering" - to_department: "security" - condition: "vulnerability_found" -``` - -**Template Inheritance** — Templates can extend other templates using `extends`: - -```yaml -template: - name: "Extended Startup" - extends: "startup" # inherits all agents, departments, config - agents: - - role: "QA Engineer" # appended to parent agents - level: "mid" - - role: "Full-Stack Developer" - merge_id: "fullstack-mid" - department: "engineering" - _remove: true # removes matching parent agent by key -``` - -Inheritance resolves parent→child chains up to 10 levels deep. Merge semantics: -- **Scalars** (`company_name`, `company_type`): child wins if present. -- **`config`** dict: deep-merged (child keys override parent). -- **`agents`** list: merged by `(role, department, merge_id)` key. When `merge_id` is omitted, it defaults to an empty string, making the key `(role, department, "")`. Child can override, append, or remove (`_remove: true`) parent agents. -- **`departments`** list: merged by name (case-insensitive). Child dept replaces parent entirely. -- **`workflow_handoffs`**, **`escalation_paths`**: child replaces entirely if present. - -Circular inheritance is detected via chain tracking and raises `TemplateInheritanceError`. - -### 14.2 Company Builder (Future) - -> **Deferred.** The template system (§14.1) already supports creating companies from YAML configs. An interactive wizard is a nice-to-have after the REST API exists — it could be a thin CLI utility or a web form that POSTs to `/api/v1/company`. To be determined. - -### 14.3 Community Marketplace (Future) - -- Share company templates -- Share custom role definitions -- Share workflow configurations -- Rating and review system -- Import/export in standard format - ---- - -## 15. Technical Architecture - -### 15.1 High-Level Architecture - -```text -┌──────────────────────────────────────────────────────────────┐ -│ SynthOrg Engine │ -│ │ -│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ -│ │ Company Mgr │ │ Agent Engine │ │ Task/Workflow Eng. │ │ -│ │ (Config, │ │ (Lifecycle, │ │ (Queue, Routing, │ │ -│ │ Templates, │ │ Personality, │ │ Dependencies, │ │ -│ │ Hierarchy) │ │ Execution) │ │ Scheduling) │ │ -│ └──────────────┘ └──────────────┘ └────────────────────┘ │ -│ │ -│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ -│ │ Comms Layer │ │ Memory Layer │ │ Tool/Capability │ │ -│ │ (Message Bus,│ │ (Pluggable, │ │ System (MCP, │ │ -│ │ Meetings, │ │ Retrieval, │ │ Sandboxing, │ │ -│ │ A2A) │ │ Archive) │ │ Permissions) │ │ -│ └──────────────┘ └──────────────┘ └────────────────────┘ │ -│ │ -│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ -│ │ Provider Lyr │ │ Budget/Cost │ │ Security/Approval │ │ -│ │ (Unified, │ │ Engine │ │ System │ │ -│ │ Routing, │ │ (Tracking, │ │ (SecOps Agent, │ │ -│ │ Fallbacks) │ │ Limits, │ │ Audit Log, │ │ -│ │ │ │ CFO Agent) │ │ Human Queue) │ │ -│ └──────────────┘ └──────────────┘ └────────────────────┘ │ -│ │ -│ ┌────────────────────────────────────────────────────────┐ │ -│ │ API Layer (Async Framework + WebSocket) │ │ -│ └────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌──────────────────────┐ ┌─────────────────────────────┐ │ -│ │ Web UI (Local) │ │ CLI Tool │ │ -│ │ Web Dashboard │ │ synthorg │ │ -│ └──────────────────────┘ └─────────────────────────────┘ │ -└──────────────────────────────────────────────────────────────┘ -``` - -### 15.2 Technology Stack - -| Component | Technology | Rationale | -|-----------|-----------|-----------| -| **Language** | Python 3.14+ | Best AI/ML ecosystem, all major frameworks use it, LiteLLM/MCP and memory layer candidates all Python-native. PEP 649 native lazy annotations, PEP 758 except syntax. | -| **API Framework** | Litestar | Async-native, built-in channels (pub/sub WebSocket), auto OpenAPI 3.1 docs, class-based controllers, native route guards, built-in rate limiting / CSRF / compression middleware, explicit DI, Pydantic v2 support via plugin. Chosen over FastAPI — see §15.4 | -| **LLM Abstraction** | LiteLLM | 100+ providers, unified API, built-in cost tracking, retries/fallbacks | -| **Agent Memory** | Mem0 (Qdrant + SQLite) → custom (Neo4j + Qdrant) | Mem0 in-process as initial backend behind pluggable `MemoryBackend` protocol ([ADR-001](docs/decisions/ADR-001-memory-layer.md)). Qdrant embedded + SQLite for persistence. Custom stack (Neo4j + Qdrant external) as future upgrade. Config-driven backend selection | -| **Message Bus** | Internal (async queues) → Redis | Start with Python asyncio queues, upgrade to Redis for multi-process/distributed | -| **Task Queue** | Internal → Celery/Redis | Start simple, scale with Celery when needed | -| **Database** | SQLite (aiosqlite) → PostgreSQL / MariaDB | Pluggable `PersistenceBackend` protocol (§7.6). SQLite ships first via aiosqlite async driver. PostgreSQL, MariaDB as future backends — swap via config, no app code changes | -| **Web UI** | Vue 3 + Vite | Modern, fast, good ecosystem. Simpler than React for dashboards | -| **Real-time** | WebSocket (Litestar channels plugin) | Built-in pub/sub broadcasting, per-channel history, backpressure management. Real-time agent activity, task updates, chat feed | -| **Containerization** | Docker + Docker Compose | Production container packaging: Chainguard Python distroless runtime (non-root UID 65532, CIS Docker Benchmark v1.6.0 hardened, minimal attack surface, continuously scanned in CI), `nginxinc/nginx-unprivileged` web tier, GHCR registry, cosign image signing, Trivy + Grype vulnerability scanning, SBOM + SLSA provenance. Also used for isolated code execution sandboxing | -| **Docker API** | aiodocker | Async-native Docker API client for `DockerSandbox` backend | -| **Tool Integration** | MCP SDK (`mcp`) | Industry standard for LLM-to-tool integration | -| **Agent Comms** | A2A Protocol compatible | Future-proof inter-agent communication | -| **Authentication** | PyJWT + argon2-cffi | JWT (HMAC HS256/384/512) for session tokens, Argon2id for password hashing, HMAC-SHA256 for API key storage (keyed with server secret) | -| **Config Format** | YAML + Pydantic validation | Human-readable config with strict validation | -| **CLI** | TBD (future, if needed) | Thin wrapper around the REST API for terminal use. May not be needed — interactive Scalar docs at `/docs/api` and `curl`/`httpie` may suffice | - -### 15.3 Project Structure - -Files marked with `(planned)` do not exist yet — only stub `__init__.py` files are present. All other files listed below exist in the codebase. - -```text -synthorg/ -├── src/ -│ └── ai_company/ -│ ├── __init__.py -│ ├── constants.py # Top-level constants -│ ├── py.typed # PEP 561 type marker -│ ├── config/ # Configuration loading & validation -│ │ ├── schema.py # Pydantic models for all config -│ │ ├── loader.py # YAML/JSON config loader -│ │ ├── defaults.py # Default configurations -│ │ ├── errors.py # Config error classes -│ │ └── utils.py # Config utilities -│ ├── core/ # Core domain models -│ │ ├── agent.py # AgentIdentity (frozen) -│ │ ├── types.py # Shared validated types (NotBlankStr, etc.) -│ │ ├── company.py # Company structure -│ │ ├── approval.py # ApprovalItem domain model (approval queue) -│ │ ├── enums.py # Core enumerations -│ │ ├── task.py # Task model & state machine -│ │ ├── task_transitions.py # Task state transitions -│ │ ├── project.py # Project management -│ │ ├── artifact.py # Produced work items -│ │ ├── role.py # Role model -│ │ ├── role_catalog.py # Role catalog -│ │ ├── personality.py # Personality compatibility scoring -│ │ └── resilience_config.py # RetryConfig, RateLimiterConfig (shared by config.schema + providers.resilience) -│ ├── engine/ # Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, task lifecycle, recovery, shutdown, workspace isolation, coordination error classification, and prompt policy validation -│ │ ├── errors.py # Engine error hierarchy -│ │ ├── prompt.py # System prompt builder -│ │ ├── prompt_template.py # System prompt Jinja2 templates -│ │ ├── task_execution.py # TaskExecution + StatusTransition -│ │ ├── context.py # AgentContext + AgentContextSnapshot -│ │ ├── loop_protocol.py # ExecutionLoop protocol + result models -│ │ ├── metrics.py # TaskCompletionMetrics proxy overhead model -│ │ ├── policy_validation.py # Org policy quality heuristics (non-inferable principle) -│ │ ├── react_loop.py # ReAct loop implementation -│ │ ├── plan_models.py # Plan step, plan, and plan-execute config models -│ │ ├── plan_execute_loop.py # Plan-and-Execute loop implementation -│ │ ├── plan_parsing.py # Plan extraction from LLM responses (JSON + text fallback) -│ │ ├── loop_helpers.py # Shared stateless helpers for all loop implementations -│ │ ├── recovery.py # Crash recovery strategies (RecoveryStrategy protocol) -│ │ ├── cost_recording.py # Per-turn cost recording helpers -│ │ ├── run_result.py # AgentRunResult outcome model -│ │ ├── _validation.py # Input validation helpers for AgentEngine -│ │ ├── agent_engine.py # Agent execution engine -│ │ ├── parallel.py # Parallel agent executor (TaskGroup + Semaphore) -│ │ ├── parallel_models.py # AgentAssignment, ParallelExecutionGroup, AgentOutcome, ParallelExecutionResult, ParallelProgress -│ │ ├── resource_lock.py # ResourceLock protocol + InMemoryResourceLock -│ │ ├── shutdown.py # Graceful shutdown strategy & manager -│ │ ├── classification/ # Coordination error taxonomy classification (§10.5) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # ErrorSeverity, ErrorFinding, ClassificationResult -│ │ │ ├── detectors.py # Per-category detection heuristics -│ │ │ └── pipeline.py # classify_execution_errors orchestrator -│ │ ├── assignment/ # Task assignment subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # AssignmentRequest, AssignmentResult, AssignmentCandidate, AgentWorkload -│ │ │ ├── protocol.py # TaskAssignmentStrategy protocol -│ │ │ ├── service.py # TaskAssignmentService (orchestrates strategy + validation) -│ │ │ ├── registry.py # STRATEGY_MAP + build_strategy_map factory -│ │ │ └── strategies.py # All 6 strategy implementations -│ │ ├── decomposition/ # Task decomposition subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── classifier.py # TaskStructureClassifier (sequential/parallel/mixed) -│ │ │ ├── dag.py # DependencyGraph (validation, topo sort, parallel groups) -│ │ │ ├── llm.py # LlmDecompositionStrategy (LLM-based decomposition with tool calling) -│ │ │ ├── llm_prompt.py # Prompt building and response parsing for LLM decomposition -│ │ │ ├── manual.py # ManualDecompositionStrategy -│ │ │ ├── models.py # SubtaskDefinition, DecompositionPlan, DecompositionResult, SubtaskStatusRollup, DecompositionContext -│ │ │ ├── protocol.py # DecompositionStrategy protocol -│ │ │ ├── rollup.py # StatusRollup (compute subtask status aggregation) -│ │ │ └── service.py # DecompositionService (orchestrates strategy + classifier + DAG) -│ │ ├── workspace/ # Workspace isolation subsystem (§6.8) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── config.py # PlannerWorktreesConfig, WorkspaceIsolationConfig -│ │ │ ├── git_worktree.py # PlannerWorktreeStrategy (git worktree backend) -│ │ │ ├── merge.py # MergeOrchestrator (sequential merge with conflict escalation) -│ │ │ ├── models.py # Workspace, WorkspaceRequest, MergeResult, MergeConflict, WorkspaceGroupResult -│ │ │ ├── protocol.py # WorkspaceIsolationStrategy protocol -│ │ │ └── service.py # WorkspaceIsolationService (lifecycle orchestrator) -│ │ ├── routing/ # Task routing subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # RoutingCandidate, RoutingDecision, RoutingResult, AutoTopologyConfig -│ │ │ ├── scorer.py # AgentTaskScorer (skill/role/seniority matching) -│ │ │ ├── service.py # TaskRoutingService (routes subtasks to agents) -│ │ │ └── topology_selector.py # TopologySelector (auto coordination topology) -│ ├── hr/ # HR engine: hiring, firing, onboarding, offboarding, agent registry, performance tracking -│ │ ├── __init__.py # Package exports -│ │ ├── enums.py # HR enumerations (HiringRequestStatus, FiringReason, OnboardingStep, LifecycleEventType, TrendDirection, PromotionDirection) -│ │ ├── errors.py # HR error hierarchy -│ │ ├── models.py # CandidateCard, HiringRequest, FiringRequest, OnboardingChecklist, OffboardingRecord, AgentLifecycleEvent -│ │ ├── registry.py # AgentRegistryService (agent lifecycle registry) -│ │ ├── hiring_service.py # HiringService (request → generate candidate → approval → instantiate) -│ │ ├── onboarding_service.py # OnboardingService (checklist management) -│ │ ├── offboarding_service.py # OffboardingService (reassign → archive → notify → terminate) -│ │ ├── archival_protocol.py # MemoryArchivalStrategy protocol -│ │ ├── full_snapshot_strategy.py # FullSnapshotArchivalStrategy -│ │ ├── reassignment_protocol.py # TaskReassignmentStrategy protocol -│ │ ├── queue_return_strategy.py # QueueReturnReassignmentStrategy -│ │ ├── persistence_protocol.py # HR-specific repository protocols -│ │ └── performance/ # Performance tracking subsystem -│ │ ├── __init__.py # Package exports -│ │ ├── models.py # TaskMetricRecord, CollaborationMetricRecord, WindowMetrics, TrendResult, etc. -│ │ ├── config.py # PerformanceConfig -│ │ ├── tracker.py # PerformanceTracker service -│ │ ├── quality_protocol.py # QualityScorer protocol -│ │ ├── ci_quality_strategy.py # CiQualityScorer (CI-based quality scoring) -│ │ ├── collaboration_protocol.py # CollaborationScorer protocol -│ │ ├── behavioral_collaboration_strategy.py # BehavioralCollaborationScorer -│ │ ├── trend_protocol.py # TrendDetector protocol -│ │ ├── theil_sen_strategy.py # TheilSenTrendDetector (robust trend detection) -│ │ ├── window_protocol.py # WindowAggregator protocol -│ │ └── multi_window_strategy.py # MultiWindowAggregator (multi-window rolling metrics) -│ │ └── promotion/ # Promotion/demotion subsystem (D14) -│ │ ├── config.py # PromotionConfig, PromotionCriteriaConfig, PromotionApprovalConfig, ModelMappingConfig -│ │ ├── models.py # CriterionResult, PromotionEvaluation, PromotionApprovalDecision, PromotionRecord, PromotionRequest -│ │ ├── criteria_protocol.py # PromotionCriteriaStrategy protocol -│ │ ├── approval_protocol.py # PromotionApprovalStrategy protocol -│ │ ├── model_mapping_protocol.py # ModelMappingStrategy protocol -│ │ ├── threshold_evaluator.py # ThresholdEvaluator (criteria evaluation) -│ │ ├── seniority_approval_strategy.py # SeniorityApprovalStrategy (approval decisions) -│ │ ├── seniority_model_mapping.py # SeniorityModelMapping (model resolution) -│ │ └── service.py # PromotionService orchestrator (evaluate, request, apply) -│ ├── communication/ # Inter-agent communication -│ │ ├── bus_memory.py # InMemoryMessageBus implementation -│ │ ├── bus_protocol.py # MessageBus protocol interface -│ │ ├── channel.py # Channel model -│ │ ├── config.py # Communication config -│ │ ├── conflict_resolution/ # Conflict resolution subsystem (§5.6) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _helpers.py # Shared utility (find_losers, pick_highest_seniority) -│ │ │ ├── authority_strategy.py # AuthorityResolver (Strategy 1) -│ │ │ ├── config.py # ConflictResolutionConfig, DebateConfig, HybridConfig -│ │ │ ├── debate_strategy.py # DebateResolver (Strategy 2) -│ │ │ ├── human_strategy.py # HumanEscalationResolver (Strategy 3) -│ │ │ ├── hybrid_strategy.py # HybridResolver (Strategy 4) -│ │ │ ├── models.py # Conflict, ConflictPosition, ConflictResolution, DissentRecord -│ │ │ ├── protocol.py # ConflictResolver, JudgeEvaluator protocols -│ │ │ └── service.py # ConflictResolutionService (orchestrator) -│ │ ├── delegation/ # Hierarchical delegation subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── authority.py # AuthorityValidator + AuthorityCheckResult -│ │ │ ├── hierarchy.py # HierarchyResolver (org hierarchy from Company) -│ │ │ ├── models.py # DelegationRequest, DelegationResult, DelegationRecord -│ │ │ └── service.py # DelegationService (orchestrates delegation flow) -│ │ ├── dispatcher.py # MessageDispatcher + DispatchResult -│ │ ├── enums.py # Communication enums -│ │ ├── errors.py # Communication + delegation error hierarchy -│ │ ├── handler.py # MessageHandler protocol, FunctionHandler, HandlerRegistration -│ │ ├── loop_prevention/ # Delegation loop prevention mechanisms -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _pair_key.py # Canonical agent-pair key utility -│ │ │ ├── ancestry.py # Ancestry cycle detection (pure function) -│ │ │ ├── circuit_breaker.py # DelegationCircuitBreaker, CircuitBreakerState -│ │ │ ├── dedup.py # DelegationDeduplicator (time-windowed) -│ │ │ ├── depth.py # Max delegation depth check (pure function) -│ │ │ ├── guard.py # DelegationGuard (orchestrates all mechanisms) -│ │ │ ├── models.py # GuardCheckOutcome -│ │ │ └── rate_limit.py # DelegationRateLimiter (per-pair) -│ │ ├── message.py # Message model -│ │ ├── meeting/ # Meeting protocol subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _parsing.py # Shared helpers for parsing decisions and action items -│ │ │ ├── _prompts.py # LLM prompt templates for meeting phases -│ │ │ ├── _token_tracker.py # TokenTracker for duration_tokens enforcement -│ │ │ ├── config.py # MeetingProtocolConfig, protocol-specific config models -│ │ │ ├── enums.py # MeetingProtocolType, MeetingPhase enums -│ │ │ ├── errors.py # Meeting error hierarchy -│ │ │ ├── models.py # MeetingRecord, MeetingAgendaItem, ActionItem, etc. -│ │ │ ├── orchestrator.py # MeetingOrchestrator (runs meetings end-to-end) -│ │ │ ├── position_papers.py # PositionPapersProtocol implementation -│ │ │ ├── protocol.py # MeetingProtocol protocol interface -│ │ │ ├── round_robin.py # RoundRobinProtocol implementation -│ │ │ └── structured_phases.py # StructuredPhasesProtocol implementation -│ │ ├── messenger.py # AgentMessenger per-agent facade -│ │ └── subscription.py # Subscription + DeliveryEnvelope models -│ ├── memory/ # Agent memory system — protocols, models, config, factory, retrieval pipeline (ranking, injection, context formatting, non-inferable filtering) -│ │ ├── __init__.py # Re-exports -│ │ ├── capabilities.py # MemoryCapabilities protocol -│ │ ├── config.py # CompanyMemoryConfig, MemoryStorageConfig, MemoryOptionsConfig -│ │ ├── errors.py # Memory error hierarchy (MemoryError and subclasses) -│ │ ├── factory.py # create_memory_backend() factory -│ │ ├── formatter.py # format_memory_context() — ranked memories to ChatMessage(s) -│ │ ├── injection.py # MemoryInjectionStrategy protocol, InjectionStrategy enum, TokenEstimator -│ │ ├── models.py # MemoryEntry, MemoryMetadata, MemoryQuery, MemoryStoreRequest -│ │ ├── protocol.py # MemoryBackend protocol -│ │ ├── ranking.py # ScoredMemory model, rank_memories(), scoring functions -│ │ ├── retrieval_config.py # MemoryRetrievalConfig (weights, thresholds, strategy selection) -│ │ ├── filter.py # MemoryFilterStrategy protocol, TagBasedMemoryFilter, PassthroughMemoryFilter -│ │ ├── retriever.py # ContextInjectionStrategy (full retrieval → rank → format pipeline) -│ │ ├── store_guard.py # Advisory non-inferable tag enforcement at store boundary -│ │ ├── shared.py # SharedKnowledgeStore protocol -│ │ ├── consolidation/ # Memory consolidation — strategies, retention, archival -│ │ │ ├── __init__.py -│ │ │ ├── archival.py # ArchivalStore protocol -│ │ │ ├── config.py # ConsolidationConfig, ArchivalConfig, RetentionConfig -│ │ │ ├── models.py # ConsolidationResult, ArchivalEntry, RetentionRule -│ │ │ ├── retention.py # RetentionEnforcer -│ │ │ ├── service.py # MemoryConsolidationService -│ │ │ ├── simple_strategy.py # SimpleConsolidationStrategy -│ │ │ └── strategy.py # ConsolidationStrategy protocol -│ │ └── org/ # Shared organizational memory (§7.4) -│ │ ├── __init__.py -│ │ ├── access_control.py # Write access control -│ │ ├── config.py # OrgMemoryConfig -│ │ ├── errors.py # OrgMemory error hierarchy -│ │ ├── factory.py # create_org_memory_backend() -│ │ ├── hybrid_backend.py # HybridPromptRetrievalBackend -│ │ ├── models.py # OrgFact, OrgFactAuthor, OrgMemoryQuery -│ │ ├── protocol.py # OrgMemoryBackend protocol -│ │ └── store.py # OrgFactStore protocol, SQLiteOrgFactStore -│ ├── persistence/ # Operational data persistence (§7.6) -│ │ ├── __init__.py # Package exports -│ │ ├── protocol.py # PersistenceBackend protocol -│ │ ├── repositories.py # Repository protocols: TaskRepository, CostRecordRepository, MessageRepository, ParkedContextRepository, AuditRepository, UserRepository, ApiKeyRepository -│ │ ├── config.py # PersistenceConfig model -│ │ ├── errors.py # Persistence error hierarchy -│ │ ├── factory.py # create_backend() factory -│ │ └── sqlite/ # SQLite backend (initial) -│ │ ├── __init__.py # Package exports -│ │ ├── backend.py # SQLitePersistenceBackend -│ │ ├── repositories.py # SQLite repository implementations -│ │ ├── hr_repositories.py # SQLite HR repositories (LifecycleEvent, TaskMetricRecord, CollaborationMetricRecord) -│ │ ├── parked_context_repo.py # SQLiteParkedContextRepository (park/resume serialized agent state) -│ │ ├── audit_repository.py # SQLiteAuditRepository (append-only audit entry persistence) -│ │ ├── user_repo.py # SQLiteUserRepository + SQLiteApiKeyRepository -│ │ └── migrations.py # Schema migrations (user_version pragma, v1–v5) -│ ├── observability/ # Structured logging & correlation -│ │ ├── __init__.py # get_logger() entry point -│ │ ├── _logger.py # Logger configuration -│ │ ├── config.py # Observability config -│ │ ├── correlation.py # Correlation ID tracking -│ │ ├── enums.py # Log-related enums -│ │ ├── events/ # Per-domain event constants -│ │ │ ├── __init__.py # Package marker with usage docs; no re-exports -│ │ │ ├── api.py # API_* event constants -│ │ │ ├── autonomy.py # AUTONOMY_* constants -│ │ │ ├── budget.py # BUDGET_* constants -│ │ │ ├── cfo.py # CFO_* constants -│ │ │ ├── classification.py # CLASSIFICATION_* constants -│ │ │ ├── consolidation.py # CONSOLIDATION_* and RETENTION_* constants -│ │ │ ├── company.py # COMPANY_* constants -│ │ │ ├── communication.py # COMM_* constants -│ │ │ ├── conflict.py # CONFLICT_* constants -│ │ │ ├── config.py # CONFIG_* constants -│ │ │ ├── delegation.py # DELEGATION_* constants -│ │ │ ├── correlation.py # CORRELATION_* constants -│ │ │ ├── decomposition.py # DECOMPOSITION_* constants -│ │ │ ├── execution.py # EXECUTION_* constants -│ │ │ ├── git.py # GIT_* constants -│ │ │ ├── hr.py # HR_* constants -│ │ │ ├── meeting.py # MEETING_* constants -│ │ │ ├── memory.py # MEMORY_* constants -│ │ │ ├── org_memory.py # ORG_MEMORY_* constants -│ │ │ ├── parallel.py # PARALLEL_* constants -│ │ │ ├── performance.py # PERF_* constants -│ │ │ ├── persistence.py # PERSISTENCE_* constants -│ │ │ ├── personality.py # PERSONALITY_* constants -│ │ │ ├── prompt.py # PROMPT_* constants -│ │ │ ├── quota.py # QUOTA_* event constants -│ │ │ ├── provider.py # PROVIDER_* constants -│ │ │ ├── role.py # ROLE_* constants -│ │ │ ├── routing.py # ROUTING_* constants -│ │ │ ├── sandbox.py # SANDBOX_* constants -│ │ │ ├── security.py # SECURITY_* constants -│ │ │ ├── task.py # TASK_* constants -│ │ │ ├── task_assignment.py # TASK_ASSIGNMENT_* constants -│ │ │ ├── task_routing.py # TASK_ROUTING_* constants -│ │ │ ├── template.py # TEMPLATE_* constants -│ │ │ ├── timeout.py # TIMEOUT_* constants -│ │ │ ├── tool.py # TOOL_* constants -│ │ │ ├── workspace.py # WORKSPACE_* constants -│ │ │ ├── code_runner.py # CODE_RUNNER_* constants -│ │ │ ├── docker.py # DOCKER_* constants -│ │ │ ├── mcp.py # MCP_* constants -│ │ │ ├── trust.py # Trust event constants -│ │ │ └── promotion.py # Promotion event constants -│ │ ├── processors.py # Log processors -│ │ ├── setup.py # Logging setup -│ │ └── sinks.py # Log output backends -│ ├── providers/ # LLM provider abstraction -│ │ ├── base.py # BaseCompletionProvider (retry + rate limiting) -│ │ ├── protocol.py # Provider protocol (abstract interface) -│ │ ├── models.py # CompletionConfig/Response, TokenUsage, ToolCall/Result -│ │ ├── capabilities.py # Provider capability registry -│ │ ├── registry.py # Provider registry -│ │ ├── enums.py # Provider enumerations -│ │ ├── errors.py # Provider error hierarchy -│ │ ├── drivers/ # Provider driver implementations -│ │ │ ├── litellm_driver.py # LiteLLM adapter -│ │ │ └── mappers.py # Request/response mappers -│ │ ├── routing/ # Model routing (5 strategies) -│ │ │ ├── _strategy_helpers.py # Shared routing helper functions -│ │ │ ├── errors.py # Routing errors -│ │ │ ├── models.py # Routing models (candidates, results) -│ │ │ ├── resolver.py # Model resolver -│ │ │ ├── router.py # Router orchestrator -│ │ │ └── strategies.py # Routing strategies -│ │ └── resilience/ # Resilience patterns -│ │ ├── errors.py # RetryExhaustedError -│ │ ├── rate_limiter.py # Token bucket rate limiter -│ │ └── retry.py # RetryHandler with backoff -│ ├── tools/ # Tool/capability system -│ │ ├── base.py # BaseTool ABC, ToolExecutionResult -│ │ ├── registry.py # Immutable tool registry (MappingProxyType) -│ │ ├── invoker.py # Tool invocation (concurrent via TaskGroup) -│ │ ├── permissions.py # ToolPermissionChecker (access-level gating) -│ │ ├── errors.py # Tool error hierarchy (incl. ToolPermissionDeniedError) -│ │ ├── examples/ # Example tool implementations -│ │ │ ├── __init__.py # Package exports -│ │ │ └── echo.py # Echo tool (for testing) -│ │ ├── file_system/ # Built-in file system tools -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _base_fs_tool.py # BaseFileSystemTool ABC -│ │ │ ├── _path_validator.py # Workspace path validation -│ │ │ ├── delete_file.py # DeleteFileTool -│ │ │ ├── edit_file.py # EditFileTool -│ │ │ ├── list_directory.py # ListDirectoryTool -│ │ │ ├── read_file.py # ReadFileTool -│ │ │ └── write_file.py # WriteFileTool -│ │ ├── _git_base.py # Base class for git tools (workspace, subprocess, sandbox integration) -│ │ ├── _process_cleanup.py # Subprocess transport cleanup utility (Windows ResourceWarning prevention) -│ │ ├── git_tools.py # Git operations — 6 built-in tools (sandbox-aware) -│ │ ├── code_runner.py # Code execution tool -│ │ ├── web_tools.py # HTTP, search (planned) -│ │ ├── sandbox/ # Sandbox backends subpackage -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── config.py # Subprocess sandbox configuration -│ │ │ ├── docker_config.py # Docker sandbox configuration -│ │ │ ├── docker_sandbox.py # DockerSandbox backend (aiodocker) -│ │ │ ├── errors.py # Sandbox error hierarchy -│ │ │ ├── protocol.py # SandboxBackend protocol -│ │ │ ├── result.py # SandboxResult model -│ │ │ ├── sandboxing_config.py # Top-level sandboxing config -│ │ │ └── subprocess_sandbox.py # SubprocessSandbox backend -│ │ └── mcp/ # MCP bridge subpackage -│ │ ├── __init__.py # Package exports -│ │ ├── bridge_tool.py # MCPBridgeTool (BaseTool integration) -│ │ ├── cache.py # MCP result cache (TTL + LRU) -│ │ ├── client.py # MCP client wrapper -│ │ ├── config.py # MCP server/bridge config models -│ │ ├── errors.py # MCP error hierarchy -│ │ ├── factory.py # MCPToolFactory (parallel connect) -│ │ ├── models.py # MCP domain models -│ │ └── result_mapper.py # MCP result → ToolExecutionResult mapping -│ ├── security/ # Security & approval -│ │ ├── action_type_mapping.py # Default ToolCategory → ActionType mapping -│ │ ├── action_types.py # ActionTypeCategory registry and validation -│ │ ├── audit.py # Append-only AuditLog with configurable eviction -│ │ ├── config.py # SecurityConfig, SecurityPolicyRule, RuleEngineConfig, OutputScanPolicyType -│ │ ├── models.py # SecurityVerdict, SecurityContext, AuditEntry, OutputScanResult -│ │ ├── output_scan_policy.py # Output scan response policies (redact/withhold/log-only/autonomy-tiered) -│ │ ├── output_scan_policy_factory.py # build_output_scan_policy() factory -│ │ ├── output_scanner.py # Post-tool output scanning (regex-based redaction) -│ │ ├── protocol.py # SecurityInterceptionStrategy protocol -│ │ ├── service.py # SecOpsService — meta-agent coordinating security -│ │ ├── autonomy/ # Autonomy levels, presets, resolver, change strategy (§12.2) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # AutonomyLevel enum, AutonomyPreset, AutonomyConfig, AutonomyChangeEvent -│ │ │ ├── protocol.py # AutonomyChangeStrategy protocol -│ │ │ ├── change_strategy.py # Rule-based auto-downgrade + human-only promotion strategy -│ │ │ └── resolver.py # AutonomyResolver (agent → department → company chain) -│ │ ├── timeout/ # Approval timeout policies, park/resume, risk tier classifier (§12.4) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── config.py # TimeoutPolicyConfig -│ │ │ ├── factory.py # build_timeout_policy() factory -│ │ │ ├── models.py # TimeoutDecision, RiskTier -│ │ │ ├── park_service.py # ParkResumeService (park/resume blocked tasks) -│ │ │ ├── parked_context.py # ParkedContext model (serialized agent state) -│ │ │ ├── policies.py # WaitForeverPolicy, AutoDenyPolicy, TieredPolicy, EscalationChainPolicy -│ │ │ ├── protocol.py # TimeoutPolicy protocol -│ │ │ ├── risk_tier_classifier.py # RiskTierClassifier (ActionType → RiskTier) -│ │ │ └── timeout_checker.py # TimeoutChecker (polls pending approvals) -│ │ └── rules/ # Rule engine and detectors -│ │ ├── engine.py # RuleEngine (soft-allow + hard-deny, fail-closed) -│ │ ├── protocol.py # SecurityRule protocol -│ │ ├── policy_validator.py # Policy list validation rule (hard-deny/auto-approve) -│ │ ├── risk_classifier.py # RiskClassifier (ActionType → ApprovalRiskLevel) -│ │ ├── credential_detector.py # Credential/secret pattern detection (API keys, tokens) -│ │ ├── data_leak_detector.py # Data leak detection (PII, sensitive file paths) -│ │ ├── destructive_op_detector.py # Destructive operation detection (rm -rf, DROP TABLE) -│ │ ├── path_traversal_detector.py # Path traversal attack detection (../, null bytes) -│ │ └── _utils.py # walk_string_values utility (recursive argument scanning) -│ │ └── trust/ # Progressive trust subsystem (§11.3) -│ │ ├── config.py # TrustConfig, strategy-specific sub-configs -│ │ ├── enums.py # TrustStrategyType, TrustChangeReason -│ │ ├── errors.py # TrustEvaluationError -│ │ ├── levels.py # Shared trust level ordering and transition constants -│ │ ├── models.py # TrustState, TrustEvaluationResult, TrustChangeRecord -│ │ ├── protocol.py # TrustStrategy protocol -│ │ ├── service.py # TrustService orchestrator (state, evaluation, decay, approval) -│ │ ├── disabled_strategy.py # DisabledTrustStrategy (passthrough) -│ │ ├── weighted_strategy.py # WeightedTrustStrategy (weighted score → thresholds) -│ │ ├── per_category_strategy.py # PerCategoryTrustStrategy (per-tool-category tracks) -│ │ └── milestone_strategy.py # MilestoneTrustStrategy (milestone gates + decay) -│ ├── budget/ # Cost management -│ │ ├── _optimizer_helpers.py # CostOptimizer shared helper functions -│ │ ├── config.py # Budget configuration models -│ │ ├── cost_record.py # CostRecord model (frozen) -│ │ ├── cost_tiers.py # Cost tier definitions, classification, and built-in tiers -│ │ ├── call_category.py # LLM call category enums (productive, coordination, system) -│ │ ├── category_analytics.py # Per-category cost breakdown + orchestration ratio -│ │ ├── coordination_config.py # Coordination metrics config models -│ │ ├── coordination_metrics.py # Five coordination metric models + computation -│ │ ├── tracker.py # CostTracker service (records + queries) -│ │ ├── spending_summary.py # _SpendingTotals base + spending summary models -│ │ ├── hierarchy.py # BudgetHierarchy, BudgetConfig -│ │ ├── enums.py # Budget-related enums -│ │ ├── billing.py # Billing period computation utilities -│ │ ├── enforcer.py # BudgetEnforcer service (pre-flight, in-flight, auto-downgrade) -│ │ ├── errors.py # BudgetExhaustedError, DailyLimitExceededError, QuotaExhaustedError -│ │ ├── optimizer.py # CostOptimizer service — anomaly detection, efficiency analysis, downgrade recommendations, approval decisions -│ │ ├── optimizer_models.py # CostOptimizer domain models — anomaly, efficiency, downgrade, approval, config -│ │ ├── quota.py # Quota/subscription models, degradation config, quota snapshots -│ │ ├── quota_tracker.py # QuotaTracker service: per-provider request/token quota enforcement -│ │ └── reports.py # Spending reports -│ ├── api/ # REST + WebSocket API -│ │ ├── app.py # Litestar application factory, lifecycle hooks -│ │ ├── approval_store.py # In-memory approval queue storage -│ │ ├── auth/ # JWT + API key authentication subsystem -│ │ │ ├── config.py # AuthConfig (frozen Pydantic, JWT HMAC algorithm, exclude paths) -│ │ │ ├── controller.py # AuthController (setup, login, change-password, me) -│ │ │ ├── middleware.py # ApiAuthMiddleware (JWT-first, API key fallback) -│ │ │ ├── models.py # User, ApiKey, AuthenticatedUser, AuthMethod -│ │ │ ├── secret.py # JWT secret resolution (env var → persistence → auto-generate) -│ │ │ └── service.py # AuthService (Argon2id password hashing, JWT ops, HMAC-SHA256 API key hashing) -│ │ ├── bus_bridge.py # Message-bus → WebSocket bridge -│ │ ├── channels.py # WebSocket channel definitions -│ │ ├── config.py # API configuration models (ServerConfig, CorsConfig) -│ │ ├── controllers/ # 14 class-based controllers + 1 WebSocket handler (15 route modules) -│ │ ├── dto.py # Request/response DTOs and envelopes -│ │ ├── errors.py # API error hierarchy (ApiError, NotFoundError, UnauthorizedError, etc.) -│ │ ├── exception_handlers.py # Litestar exception handler registration -│ │ ├── guards.py # Route guards — role-based read/write access control (HumanRole enum) -│ │ ├── middleware.py # Request logging, CSP middleware -│ │ ├── pagination.py # Cursor-free offset/limit pagination -│ │ ├── server.py # Uvicorn server runner -│ │ ├── state.py # Typed AppState container with service access (deferred auth init) -│ │ └── ws_models.py # WebSocket event models (WsEvent, WsEventType) -│ ├── cli/ # CLI interface (future, if needed) -│ │ ├── __init__.py -│ │ └── commands/ -│ │ └── __init__.py -│ └── templates/ # Company templates -│ ├── schema.py # Template schema models -│ ├── loader.py # Template loader -│ ├── renderer.py # Template renderer -│ ├── merge.py # Template config merging for inheritance -│ ├── presets.py # Personality presets + auto-name generation -│ ├── errors.py # Template errors -│ └── builtins/ # Pre-built company templates -│ ├── agency.yaml -│ ├── dev_shop.yaml -│ ├── full_company.yaml -│ ├── product_team.yaml -│ ├── research_lab.yaml -│ ├── solo_founder.yaml -│ └── startup.yaml -├── tests/ -│ ├── unit/ -│ ├── integration/ -│ └── e2e/ -├── mkdocs.yml # MkDocs configuration -├── docs/ -│ ├── index.md # Documentation landing page -│ ├── getting_started.md -│ ├── overrides/ # MkDocs theme overrides -│ ├── architecture/ -│ │ ├── index.md # Architecture overview -│ │ └── decisions.md # ADR index -│ ├── api/ # Auto-generated API reference (mkdocstrings) -│ │ ├── index.md # API reference landing -│ │ ├── core.md, engine.md, providers.md, budget.md, ... -│ │ └── tools.md -│ └── decisions/ -│ ├── ADR-001-memory-layer.md -│ ├── ADR-002-design-decisions-batch-1.md -│ └── ADR-003-documentation-architecture.md -├── site/ # Astro landing page (synthorg.io root) -│ ├── astro.config.mjs -│ ├── package.json -│ ├── tsconfig.json -│ ├── public/ -│ │ └── favicon.svg -│ └── src/ -│ ├── layouts/Base.astro -│ └── pages/index.astro -├── docker/ -│ ├── backend/ -│ │ └── Dockerfile # 3-stage: python:3.14-slim → chainguard/python-dev → chainguard/python (distroless) -│ ├── sandbox/ -│ │ └── Dockerfile # Code execution sandbox (Python + Node.js, non-root) -│ ├── web/ -│ │ └── Dockerfile # nginxinc/nginx-unprivileged (non-root) -│ ├── compose.yml # CIS-hardened orchestration -│ ├── compose.override.yml # Local dev overrides (debug logging) -│ └── .env.example # Environment variable reference -├── web/ -│ ├── app.js # Dashboard JavaScript -│ ├── index.html # Placeholder dashboard with health check -│ ├── nginx.conf # SPA routing + API/WebSocket proxy -│ └── style.css # Dashboard styles -├── .github/ -│ ├── workflows/ -│ │ ├── ci.yml # Lint + type-check + test (parallel) -│ │ ├── docker.yml # Build → scan → push → sign (GHCR) -│ │ ├── dependency-review.yml # License allow-list on PRs -│ │ ├── release.yml # Release Please (automated versioning + GitHub Releases) -│ │ ├── secret-scan.yml # Gitleaks on push/PR + weekly -│ │ ├── pages.yml # Build Astro + MkDocs → deploy GitHub Pages -│ │ ├── pages-preview.yml # PR preview → Cloudflare Pages -│ │ └── zizmor.yml # Workflow security analysis (zizmor) -│ ├── actions/ -│ │ └── setup-python-uv/ # Composite action: Python + uv install -│ ├── dependabot.yml # uv + github-actions + docker updates -│ ├── CHANGELOG.md # Release changelog (managed by Release Please) -│ ├── CONTRIBUTING.md -│ ├── SECURITY.md -│ ├── .grype.yaml # Grype CVE ignore list (synced with .trivyignore.yaml) -│ └── .trivyignore.yaml # Trivy CVE ignore list (structured YAML format) -├── .dockerignore # Consolidated Docker build context exclusions -├── .gitleaks.toml # Gitleaks config (test file allowlist) -├── DESIGN_SPEC.md # This document -├── README.md -├── pyproject.toml -└── CLAUDE.md -``` - -### 15.4 Key Design Decisions (Preliminary - Subject to Research) - -| Decision | Choice | Alternatives Considered | Rationale | -|----------|--------|------------------------|-----------| -| Language | Python 3.14+ | TypeScript, Go, Rust | AI ecosystem, LiteLLM/MCP and memory layer candidates are Python-native, PEP 649 lazy annotations, PEP 758 except syntax | -| API | Litestar | FastAPI, Flask, Django, aiohttp | Built-in channels (pub/sub WebSocket), class-based controllers, native route guards, middleware (rate limiting, CSRF, compression), explicit DI. FastAPI considered but Litestar offers more batteries-included for less custom code — see rationale below | -| LLM Layer | LiteLLM | Direct APIs, OpenRouter only | 100+ providers, cost tracking, fallbacks, load balancing built-in | -| Memory | Mem0 (initial) → custom stack (future) + SQLite | Graphiti, Letta, Cognee, custom | Mem0 in-process as initial backend behind pluggable `MemoryBackend` protocol ([ADR-001](docs/decisions/ADR-001-memory-layer.md)). Custom stack (Neo4j + Qdrant) as future upgrade. Must support episodic, semantic, procedural memory types (§7.1–7.3). Org memory served via `OrgMemoryBackend` protocol (§7.4) | -| Message Bus | asyncio queues → Redis | Kafka, RabbitMQ, NATS | Start simple, Redis well-supported, Kafka overkill for local | -| Config | YAML + Pydantic | JSON, TOML, Python dicts | Human-friendly, strict validation, good IDE support | -| CLI | Deferred (TBD) | Typer, Click, argparse | Thin API wrapper if needed. Scalar interactive docs at `/docs/api` + `curl`/`httpie` may suffice | -| Web UI | Vue 3 | React, Svelte, HTMX | Simpler than React for dashboards | -| Persistence | Pluggable protocol + repository protocols | ORM (SQLAlchemy), raw SQL, hybrid | Same frozen Pydantic models in and out (no DTOs), async throughout, backend-swappable via config. Repository protocols decouple app code from storage engine. See §7.6 | -| Sandboxing | Layered: subprocess + Docker | Docker-only, subprocess-only, WASM | Risk-proportionate: fast subprocess for file/git, Docker isolation for code execution. Pluggable `SandboxBackend` protocol enables K8s migration later | -| Container Packaging | Chainguard distroless + GHCR | Alpine, Debian-slim, scratch, Docker Hub | Chainguard Python distroless: no shell/package-manager (minimal attack surface), non-root by default, continuously scanned in CI. GHCR over Docker Hub: tighter GitHub integration, no rate limits for public images, native OIDC token auth. cosign keyless signing for supply-chain integrity. Trivy + Grype dual scanning for comprehensive CVE coverage | - -### 15.5 Engineering Conventions - -These conventions are used throughout the codebase. **Adopted** conventions are already in use. **Planned** conventions are approved design decisions not yet implemented. - -| Convention | Status | Decision | Rationale | -|------------|--------|----------|-----------| -| **Immutability strategy** | Adopted | `copy.deepcopy()` at construction + `MappingProxyType` wrapping for non-Pydantic internal collections (registries, `BaseTool`). For Pydantic frozen models: `frozen=True` prevents field reassignment; `copy.deepcopy()` at system boundaries (tool execution, LLM provider serialization) prevents nested mutation. No MappingProxyType inside Pydantic models (serialization friction). | Deep-copy at construction fully isolates nested structures; `MappingProxyType` enforces read-only access. Boundary-copy for Pydantic models is simple, centralized, and Pydantic-native. A future CPython built-in immutable mapping type (e.g. `frozendict`) would provide zero-friction field-level immutability when available. | -| **Config vs runtime split** | Adopted | Frozen models for config/identity; `model_copy(update=...)` for runtime state transitions | `TaskExecution` and `AgentContext` (in `engine/`) are frozen Pydantic models that use `model_copy(update=...)` for copy-on-write state transitions without re-running validators (per Pydantic `model_copy` semantics). Config layer (`AgentIdentity`, `Task`) remains unchanged. | -| **Derived fields** | Adopted | `@computed_field` instead of stored + validated | Eliminates redundant storage and impossible-to-fail validators. `TokenUsage.total_tokens` migrated from stored `Field` + `@model_validator` to `@computed_field` property. | -| **String validation** | Adopted | `NotBlankStr` type from `core.types` for all identifiers | Eliminates per-model `@model_validator` boilerplate for whitespace checks. All identifier/name fields use `NotBlankStr`; optional identifiers use `NotBlankStr \| None`; tuple fields use `tuple[NotBlankStr, ...]` for per-element validation. | -| **Shared field groups** | Adopted | Extracted common field sets into base models (e.g. `_SpendingTotals`) | Prevents field duplication across spending summary models. `_SpendingTotals` provides shared aggregation fields; `AgentSpending`, `DepartmentSpending`, `PeriodSpending` extend it. | -| **Event constants** | Adopted (per-domain) | Per-domain submodules under `events/` package (e.g. `events.provider`, `events.budget`). Import directly: `from ai_company.observability.events. import CONSTANT` | Split by domain for discoverability, co-location with domain logic, and reduced merge conflicts as constants grow. `__init__.py` serves as package marker with usage documentation; no re-exports. | -| **Parallel tool execution** | Adopted | `asyncio.TaskGroup` in `ToolInvoker.invoke_all` with optional `max_concurrency` semaphore | Structured concurrency with proper cancellation semantics. Fatal errors collected via guarded wrapper and re-raised after all tasks complete. | -| **Parallel agent execution** | Adopted | `ParallelExecutor` coordinates concurrent `AgentEngine.run()` calls via `asyncio.TaskGroup` + optional `Semaphore` concurrency limit + `_run_guarded()` error isolation. `ResourceLock` protocol with `InMemoryResourceLock` for exclusive file-path claims. Progress tracking via `ProgressCallback`. Shutdown-aware via `ShutdownManager` task registration. Fail-fast mode cancels sibling tasks on first failure; all errors are surfaced via `ParallelExecutionResult` outcomes. | Follows the `ToolInvoker.invoke_all()` pattern (parallel tool execution above). Composition over inheritance — wraps `AgentEngine`. Structured concurrency with proper cancellation. See §6.3 Parallel Execution. | -| **Tool permission checking** | Adopted | `ToolPermissionChecker` enforces category-level gating based on `ToolAccessLevel` (sandboxed → restricted → standard → elevated, plus custom). Priority-based resolution: denied list → allowed list → level categories → deny. Case-insensitive name matching. `ToolInvoker` filters definitions for prompt and checks at invocation time. | Defense-in-depth: agents only see permitted tools in the LLM prompt, and invocations are re-checked at execution time. Explicit allow/deny lists provide per-agent overrides. See §11.1.1. | -| **Tool sandboxing** | Adopted (incremental) | File system tools use in-process `PathValidator` for workspace-scoped path validation (symlink resolution + containment check). `BaseFileSystemTool` ABC provides shared `ToolCategory.FILE_SYSTEM` and `PathValidator` integration — all file system tools extend this base. `SandboxBackend` protocol with `SubprocessSandbox` implemented — git tools accept optional `SandboxBackend` injection and delegate subprocess management to it (env filtering, workspace enforcement, timeout + process-group kill). `DockerSandbox` planned for code_runner, terminal, web, and database tools. `K8sSandbox` planned for future container deployments. Config-driven per-category backend selection planned for engine wiring. | File system tools use defence-in-depth path validation; subprocess sandbox provides lightweight isolation for git tools; heavier Docker/K8s isolation reserved for higher-risk tool categories (code execution, network). See §11.1.2. | -| **Crash recovery** | Adopted | Pluggable `RecoveryStrategy` protocol. Current: `FailAndReassignStrategy` (catch at engine boundary, log snapshot, mark FAILED / eligible for reassignment). Planned: `CheckpointStrategy` (persist `AgentContext` per turn, resume from last checkpoint). | Immutable `model_copy` pattern makes checkpoint serialization trivial to add later. Fail-and-reassign is sufficient for short tasks. See §6.6. | -| **Personality compatibility scoring** | Adopted | Weighted composite: 60% Big Five similarity (openness, conscientiousness, agreeableness, stress_response → 1−\|diff\|; extraversion → tent-function peaking at 0.3 diff), 20% collaboration alignment (ordinal adjacency: INDEPENDENT↔PAIR↔TEAM), 20% conflict approach (constructive pairs score 1.0, destructive pairs 0.2, mixed 0.4–0.6). `itertools.combinations` for team-level averaging. Result clamped to [0, 1]. | Covers behavioral diversity (extraversion complement), task alignment (conscientiousness similarity), and interpersonal friction (conflict approach). Weights are configurable module constants. | -| **Agent behavior testing** | Planned | Scripted `FakeProvider` for unit tests (deterministic turn sequences); behavioral outcome assertions for integration tests (task completed, tools called, cost within budget). | Leverages existing `FakeProvider` and `CompletionResponseFactory` fixtures. Precise engine testing without brittle response-matching at integration level. | -| **LLM call analytics** | Adopted (incremental) | Proxy metrics (`turns_per_task`, `tokens_per_task`) — adopted. Data models for call categorization (`productive`, `coordination`, `system`), category analytics, coordination metrics, orchestration ratio — adopted. Runtime collection pipeline and full analytics: planned. | Append-only, never blocks execution. Builds on existing `CostRecord` infrastructure. Detects orchestration overhead early. See §10.5. | -| **Cost tiers & quota tracking** | Adopted | Configurable `CostTierDefinition` definitions with merge/override semantics via `resolve_tiers(config: CostTiersConfig)`. `SubscriptionConfig` + `QuotaLimit` model per-provider subscription plans. `QuotaTracker` enforces per-provider request/token quotas with window-based rotation. `DegradationConfig` controls behavior when quotas are exhausted (default: `ALERT` — raise error; `FALLBACK` and `QUEUE` strategies defined but not yet implemented). | Enables cost classification without hardcoding vendor tiers. Quota tracking prevents surprise overages at the provider level. Window-based rotation aligns quota resets with billing periods. See §10.4. | -| **Shared org memory** | Adopted | `OrgMemoryBackend` protocol (pluggable) with `HybridPromptRetrievalBackend` (Backend 1). `OrgFactStore` protocol with `SQLiteOrgFactStore` for persistent fact storage. Seniority-based write access control via `CategoryWriteRule`. Core policies injected into system prompts; extended facts retrieved on demand via `OrgMemoryQuery`. `OrgFact` model with `OrgFactAuthor` provenance tracking. Config-driven via `OrgMemoryConfig`. | Pluggable backend mirrors `MemoryBackend` pattern. Hybrid prompt+retrieval balances always-available core policies with on-demand extended knowledge. Seniority-based access control prevents junior agents from overwriting organizational knowledge. See §7.4. | -| **Memory consolidation** | Adopted | `ConsolidationStrategy` protocol with `SimpleConsolidationStrategy` (deduplication + summarization). `RetentionEnforcer` for per-category age-based cleanup via `RetentionRule` policies. `ArchivalStore` protocol for cold storage before deletion. `MemoryConsolidationService` orchestrates retention → consolidation → max-memories enforcement pipeline. `ConsolidationResult` tracks statistics. Config-driven via `ConsolidationConfig` + `RetentionConfig` + `ArchivalConfig`. | Prevents unbounded memory growth. Pluggable strategy enables different consolidation approaches (simple dedup now, LLM-based summarization later). Retention + archival ensures compliance with data lifecycle policies. See §7.4. | -| **State coordination** | Planned | Centralized single-writer: `TaskEngine` owns all task/project mutations via `asyncio.Queue`. Agents submit requests, engine applies `model_copy(update=...)` sequentially and publishes snapshots. `version: int` field on state models for future optimistic concurrency if multi-process scaling is needed. | Prevents lost updates by design. Trivial in single-threaded asyncio (no locks). Perfect audit trail. Industry consensus: MetaGPT, CrewAI, AutoGen all use prevention-by-design, not conflict resolution. See §6.8 State Coordination table. | -| **Workspace isolation** | Adopted | Pluggable `WorkspaceIsolationStrategy` protocol. Default: planner + git worktrees. Each agent works in an isolated worktree; sequential merge on completion. Textual conflicts detected by git; semantic conflicts reviewed by agent or human. Runtime multi-agent coordination wiring is planned. | Industry standard (Codex, Cursor, Claude Code, VS Code). Maximum parallelism. Leverages mature git infrastructure. See §6.8. | -| **Graceful shutdown** | Adopted | Pluggable `ShutdownStrategy` protocol. Default: cooperative with 30s timeout. Agents check shutdown event at turn boundaries. Force-cancel after timeout. `INTERRUPTED` status for force-cancelled tasks. Planned: upgrade to checkpoint-and-stop. | Cross-platform (Windows `signal.signal()` fallback). Bounded shutdown time. Mirrors cooperative shutdown in §6.7. | -| **Template inheritance** | Adopted | `extends` field on `CompanyTemplate` triggers parent resolution at render time. `merge.py` merges configs by field type: scalars (child wins), config dicts (deep merge), agents (by `(role, department, merge_id)` key with `_remove` support), departments (by name). `_ParentEntry` dataclass tracks merge state. `DEFAULT_MERGE_DEPARTMENT = "engineering"` shared between merge and renderer. Circular chains detected via `frozenset` tracking; max depth = 10. | Enables template composition without copy-paste. Merge-by-key preserves parent order. `_remove` directive enables clean agent removal without workarounds. | -| **Pydantic alias for YAML directives** | Adopted | `Field(alias="_remove")` in `TemplateAgentConfig` — YAML uses `_remove: true`, Python accesses `agent.remove`. Keeps the YAML-facing name (underscore prefix signals internal directive) separate from the Python attribute name. | Underscore-prefixed YAML keys signal merge directives vs regular fields. Pydantic alias bridges the naming convention gap cleanly. | -| **Communication foundation** | Adopted | `MessageBus` protocol with `InMemoryMessageBus` backend (asyncio queues, pull-model `receive()` with shutdown signaling via `asyncio.Event`). `MessageDispatcher` routes to concurrent handlers via `asyncio.TaskGroup` with pre-allocated error collection. `AgentMessenger` per-agent facade auto-fills sender/timestamp/ID; deterministic direct-channel naming `@{sorted_a}:{sorted_b}`. `DeliveryEnvelope` for delivery tracking. `NotBlankStr` validation on all protocol boundary identifiers. | Pull-model avoids callback complexity and enables agents to consume at their own pace. Protocol + backend split enables future persistent/distributed bus implementations. Deterministic DM channel names prevent duplicates. See §5. | -| **Delegation & loop prevention** | Adopted | `HierarchyResolver` resolves org hierarchy from `Company` at construction (cycle-detected, `MappingProxyType`-frozen). `AuthorityValidator` checks chain-of-command + role permissions. `DelegationGuard` orchestrates five mechanisms (ancestry, depth, dedup, rate limit, circuit breaker) in sequence, short-circuiting on first rejection. `DelegationService` is synchronous (CPU-only); messaging integration deferred. Stateful mechanisms use injectable clock for deterministic testing. Task model extended with `parent_task_id` and `delegation_chain` fields. | Synchronous delegation avoids async complexity for CPU-only validation. Five-mechanism guard provides defence-in-depth against all loop patterns. Injectable clocks enable deterministic testing. See §5.4, §5.5. | -| **Task assignment** | Adopted | `TaskAssignmentStrategy` protocol with six concrete strategies: Manual (pre-designated), RoleBased (capability scoring via `AgentTaskScorer`), LoadBalanced (workload-aware with score tiebreaker), CostOptimized (cheapest-agent with score tiebreaker), Hierarchical (subordinate delegation via `HierarchyResolver`), Auction (bid = score × availability). `TaskAssignmentService` orchestrates with status validation, structured logging, and `STRATEGY_MAP` registry (`MappingProxyType`-wrapped singletons; five strategies — Hierarchical requires `build_strategy_map(hierarchy=...)`). Inactive agents filtered during scoring. | Pluggable strategies behind a protocol mirror the execution loop and conflict resolution patterns. Reuses `AgentTaskScorer` from routing subsystem. `MappingProxyType` registry matches existing immutability conventions. See §6.4. | -| **Conflict resolution** | Adopted | `ConflictResolver` protocol with async `resolve()` + sync `build_dissent_records()` split (resolve may call LLM, dissent record is pure construction). Four strategies: `AuthorityResolver` (seniority comparison iterating all N positions, hierarchy proximity tiebreaker via `get_lowest_common_manager`), `DebateResolver` (LLM judge via `JudgeEvaluator` protocol, authority fallback when absent), `HumanEscalationResolver` (stub, returns `ESCALATED_TO_HUMAN`), `HybridResolver` (LLM review + ambiguity escalation/authority fallback). `ConflictResolutionService` follows `DelegationService` pattern (`__slots__`, keyword-only constructor, `MappingProxyType`-wrapped resolver mapping, audit trail). `DissentRecord` preserves losing agent's reasoning. `Conflict.is_cross_department` is a `@computed_field` derived from positions. `HierarchyResolver` extended with `get_lowest_common_manager()` and `get_delegation_depth()`. | Protocol + strategy pattern enables adding new resolution approaches without modifying existing code. Async resolve accommodates LLM calls; sync dissent record avoids unnecessary async overhead. Shared `find_losers` utility prevents code duplication across strategies. See §5.6. | - ---- - -## 16. Research & Prior Art - -### 16.1 Existing Frameworks Comparison - -| Framework | Stars | Architecture | Roles | Models | Memory | Custom Roles | Production Ready | -|-----------|-------|-------------|-------|--------|--------|-------------|-----------------| -| **MetaGPT** | 64.5k | SOP-driven pipeline | PM, Architect, Engineer, QA | OpenAI, Ollama, Groq, Azure | Limited | Partial | Research → MGX commercial | -| **ChatDev 2.0** | 31.2k | Zero-code visual workflows | CEO, CTO, Programmer, Tester, Designer | Multiple via config | Limited | Yes (YAML) | Improving (v2.0 Jan 2026) | -| **CrewAI** | ~50k+ | Role-based crews + flows | Fully custom | Multi-provider | Basic (crew memory) | Yes | Yes (100k+ developers) | -| **AutoGen** | ~40k+ | Conversation-driven async | Custom agents | OpenAI primary, others | Session-based | Yes | Transitioning to MS Agent Framework | -| **LangGraph** | Large | Graph-based DAG | Custom nodes | LangChain ecosystem | Stateful graphs | Yes (nodes) | Yes | -| **Smolagents** | Growing | Code-centric minimal | Code agent | HuggingFace ecosystem | Minimal | Yes | Rapid prototyping | - -### 16.2 What Exists vs What We Need - -| Feature | MetaGPT | ChatDev | CrewAI | **SynthOrg (Ours)** | -|---------|---------|---------|--------|----------------------| -| Full company simulation | Partial | Partial | No | **Yes - complete** | -| HR (hiring/firing) | No | No | No | **Yes** | -| Budget management (CFO) | No | No | No | **Yes** | -| Persistent agent memory | No | No | Basic | **Yes (Mem0 initial, custom stack future — ADR-001)** | -| Agent personalities | Basic | Basic | Basic | **Deep - traits, styles, evolution** | -| Dynamic team scaling | No | No | Manual | **Yes - auto + manual** | -| Multiple company types | No | No | Manual | **Yes - templates + builder** | -| Security ops agent | No | No | No | **Yes** | -| Configurable autonomy | No | No | Limited | **Yes - full spectrum** | -| Local + cloud providers | Partial | Partial | Partial | **Yes - unified abstraction (LiteLLM candidate)** | -| Cost tracking per agent | No | No | No | **Yes - full budget system** | -| Progressive trust | No | No | No | **Yes** | -| Performance metrics | No | No | No | **Yes** | -| MCP tool integration | No | No | Partial | **Yes** | -| A2A protocol support | No | No | No | **Planned** | -| Community marketplace | MGX (commercial) | No | No | **Planned (backlog)** | - -### 16.3 Agent Scaling Research - -[Kim et al., "Towards a Science of Scaling Agent Systems" (2025)](https://arxiv.org/abs/2512.08296) — 180 controlled experiments across 3 LLM families (OpenAI, Google, Anthropic), 4 agentic benchmarks, 5 coordination topologies. Key findings informing our design: - -- **Task decomposability is the #1 predictor** of multi-agent success. Parallelizable tasks gain up to +81%, sequential tasks degrade -39% to -70% under all MAS variants. Informs §6.9. -- **Coordination metrics suite** (efficiency, overhead, error amplification, message density, redundancy) explains 52.4% of performance variance (R²=0.524). Adopted in §10.5. -- **Tiered coordination overhead** (`O%`): optimal band 200–300%, over-coordination above 400%. Informs §10.5 interpretation of the `O%` metric. Note: the `orchestration_ratio` tiered alerts (info/warn/critical) measure a different ratio (coordination tokens / total tokens). -- **Error taxonomy** (logical contradiction, numerical drift, context omission, coordination failure) with architecture-specific patterns. Adopted as opt-in classification in §10.5. -- **Auto topology selection** achieves 87% accuracy from measurable task properties. Informs §6.9 auto topology selector. -- **Centralized verification** contains error amplification to 4.4× vs 17.2× for independent agents. Supports §6.9's centralized-topology guidance and §10.5's `Ae` metric interpretation. -- **Context:** Paper tested identical agents on individual tasks; our architecture uses role-differentiated agents in an organizational structure. Thresholds (e.g., 45% capability ceiling, 3–4 agent sweet spot) are directional — to be validated empirically in our context. - -### 16.4 Build vs Fork Decision - -**Recommendation: Build from scratch, leverage libraries.** - -Rationale: -- No existing framework covers even 50% of our requirements -- Our core differentiators (HR, budget, security ops, deep personalities, progressive trust) don't exist in any framework -- Forking MetaGPT or CrewAI would mean fighting their architecture while adding our features -- **LiteLLM**, **Litestar**, **MCP**, and **Mem0** (memory layer — ADR-001) give us battle-tested components for the hard parts -- The "company simulation" layer on top is our unique value and must be purpose-built - -What we **plan to leverage** (not fork) — subject to evaluation: -- **LiteLLM** (selected) - Provider abstraction -- **Mem0** (selected, ADR-001) - Agent memory (initial backend; custom stack future) -- **Litestar** (selected) - API layer (see §15.4 rationale) -- **MCP** - Tool integration standard (strong candidate, emerging industry standard) -- **Pydantic** (selected) - Config validation and data models -- **Web UI framework** - TBD (Vue 3, React, Svelte, HTMX all under consideration) - -> **Why Litestar over FastAPI?** Both are async-native Python frameworks with auto-generated OpenAPI docs and Pydantic support. FastAPI has a larger ecosystem and more community resources. However, Litestar provides significantly more built-in functionality that we would otherwise need to write and maintain ourselves: -> -> 1. **Channels plugin** — pub/sub WebSocket broadcasting with per-channel subscriptions, backpressure management, and subscriber backlog. FastAPI requires hand-rolling all WebSocket connection management. -> 2. **Class-based controllers** — group routes with shared guards, middleware, and configuration. Our 13 route groups map naturally to controllers. FastAPI only supports loose functions on routers. -> 3. **Native route guards** — declarative authorization at controller/route level. Essential for the approval queue and future security features. FastAPI requires `Depends()` on every route. -> 4. **Built-in middleware** — rate limiting, CSRF protection, GZip/Brotli compression, session handling, request logging. FastAPI requires third-party packages or custom code for each. -> 5. **Explicit dependency injection** — pytest-style named dependencies with scope control. Matches our testing approach. FastAPI's DI is implicit (function parameter magic). -> -> The ecosystem size gap is acceptable: our API is an internal orchestration interface, not a public web service. The bottleneck is LLM latency (seconds), not framework overhead (microseconds). Litestar's ~2x performance advantage in micro-benchmarks is a bonus, not the deciding factor. Python 3.14 is supported by both. - ---- - -## 17. Open Questions & Risks - -### 17.1 Open Questions - -| # | Question | Impact | Status | Notes | -|---|----------|--------|--------|-------| -| 1 | How deep should agent personality affect output? | Medium | Open | Too deep = inconsistent, too shallow = all agents feel the same | -| 2 | What is the optimal meeting format for multi-agent? | High | **Resolved** | Multiple configurable protocols — see §5.7 Meeting Protocol | -| 3 | How to handle context window limits for long tasks? | High | Open | Agents may lose track of complex multi-file changes | -| 4 | Should agents be able to create/modify other agents? | Medium | Open | CTO "hires" a dev by creating a new agent config | -| 5 | How to handle conflicting agent opinions? | High | **Resolved** | Multiple configurable strategies — see §5.6 Conflict Resolution Protocol | -| 6 | What metrics define "good" agent performance? | Medium | Open | Needed for HR/hiring/firing decisions | -| 7 | How to prevent agent communication loops? | High | **Resolved** | Implemented in §5.5 Loop Prevention | -| 8 | Optimal message bus for local-first architecture? | Medium | Open | asyncio queues vs Redis vs embedded broker | -| 9 | How to handle code execution safely? | High | **Resolved** | Layered sandboxing behind `SandboxBackend` protocol — see §11.1.2 Tool Sandboxing | -| 10 | What's the minimum viable meeting set? | Low | Open | Standup + planning + review as minimum? | -| 11 | What is the agent execution loop architecture? | High | **Resolved** | Multiple configurable loops — see §6.5 Agent Execution Loop | -| 12 | How should shared organizational memory work? | High | **Resolved** | Modular backends behind protocol — see §7.4 Shared Organizational Memory | -| 13 | What happens when humans don't respond to approvals? | High | **Resolved** | Configurable timeout policies with task suspension — see §12.4 Approval Timeout | -| 14 | Which memory layer library to use? | Medium | **Resolved** | Mem0 (initial) → custom stack (future) behind pluggable `MemoryBackend` protocol — see [ADR-001](docs/decisions/ADR-001-memory-layer.md) | -| 15 | How to handle agent crashes mid-task? | High | **Resolved** | Pluggable `RecoveryStrategy` protocol — see §6.6 Agent Crash Recovery | -| 16 | How to test non-deterministic agent behavior? | High | **Resolved** | Scripted providers for unit tests + behavioral assertions for integration — see §15.5 Engineering Conventions | -| 17 | How to detect orchestration overhead? | Medium | **Resolved** | Incremental LLM call analytics with proxy metrics → full categorization — see §10.5 | - -### 17.2 Technical Risks - -| Risk | Severity | Mitigation | -|------|----------|------------| -| Context window exhaustion on complex tasks | High | Memory summarization, task decomposition, working memory management | -| Cost explosion from agent loops | High | Budget hard stops, loop detection, max iterations per task | -| Agent quality degradation with cheap models | Medium | Quality gates, minimum model requirements per task type | -| Third-party library breaking changes | Medium | Pin versions, integration tests, abstraction layers | -| Memory retrieval quality | Medium | Mem0 selected as initial backend (ADR-001). Protocol layer enables backend swap if retrieval quality insufficient. Pin version, test 3.14 compat in CI | -| Agent personality inconsistency | Low | Strong system prompts, few-shot examples, personality tests | -| WebSocket scaling | Low | Start local, add Redis pub/sub when needed | - -### 17.3 Architecture Risks - -| Risk | Severity | Mitigation | -|------|----------|------------| -| Over-engineering the MVP | High | Start with minimal viable company (3-5 agents), add complexity iteratively | -| Config format becoming unwieldy | Medium | Good defaults, layered config (base + overrides), validation | -| Agent execution bottlenecks | Medium | Async execution, parallel agent processing, queue-based | -| Data loss on crash | Medium | WAL mode SQLite, `RecoveryStrategy` protocol (§6.6): fail-and-reassign implemented, checkpoint recovery planned | -| Orchestration overhead exceeds productive work | Medium | LLM call analytics (§10.5): proxy metrics implemented, call categorization + orchestration ratio alerts planned | - ---- - -## 18. Backlog & Future Vision - -### 18.1 Future Features (Not for MVP) - -| Feature | Priority | Description | -|---------|----------|-------------| -| Community marketplace | Medium | Share/download company templates, roles, workflows | -| Network hosting | Medium | Expose on LAN/internet, multi-user access | -| Agent evolution | Medium | Agents improve over time based on feedback | -| Inter-company communication | Low | Two AI companies collaborating on a project | -| Voice interface | Low | Talk to your AI company via voice | -| Mobile app | Low | Monitor your company from phone | -| Plugin system | High | Third-party plugins for new tools, roles, providers | -| Benchmarking suite | Medium | Compare company configurations on standard tasks | -| Visual workflow editor | Medium | Drag-and-drop workflow design in Web UI | -| Multi-project support | High | Company handles multiple projects simultaneously | -| Client simulation | Low | AI "clients" that give requirements and review output | -| Training mode | Medium | New agents learn from senior agents' past work | -| ~~Conflict resolution protocol~~ | ~~High~~ | ~~Moved to core — see §5.6~~ | -| Agent promotions | Medium | Junior → Mid → Senior based on performance | -| Shift system | Low | Agents "work" in shifts, different agents for different hours | -| Reporting system | Medium | Weekly/monthly automated company reports | -| Integration APIs | Medium | Connect to real Slack, GitHub, Jira, Linear | -| Self-improving company | High | The AI company developing AI company (meta!) | - -### 18.2 Scaling Path - -```text -Phase 1: Local Single-Process - └── Async runtime, embedded DB, in-memory bus, 1-10 agents - -Phase 2: Local Multi-Process - └── External message bus, production DB, sandboxed execution, 10-30 agents - -Phase 3: Network/Server - └── Full API, multi-user, distributed agents, 30-100 agents - -Phase 4: Cloud/Hosted - └── Container orchestration, horizontal scaling, marketplace, 100+ agents -``` - ---- - -## Appendix A: Industry Standards Reference - -| Standard | Owner | Purpose | Our Usage | -|----------|-------|---------|-----------| -| **MCP** (Model Context Protocol) | Anthropic → Linux Foundation (AAIF) | LLM ↔ Tool integration | Tool system backbone | -| **A2A** (Agent-to-Agent Protocol) | Google → Linux Foundation | Agent ↔ Agent communication | Future agent interop | -| **OpenAI API format** | OpenAI (de facto standard) | LLM API interface | Via provider abstraction layer (LiteLLM candidate) | - -## Appendix B: Research Sources - -- [MetaGPT](https://github.com/FoundationAgents/MetaGPT) - Multi-agent SOP framework (64.5k stars) -- [ChatDev 2.0](https://github.com/openbmb/ChatDev) - Zero-code multi-agent platform (31.2k stars) -- [CrewAI](https://github.com/crewAIInc/crewAI) - Role-based agent collaboration framework -- [AutoGen](https://github.com/microsoft/autogen) - Microsoft async multi-agent framework -- [LiteLLM](https://github.com/BerriAI/litellm) - Unified LLM API gateway (100+ providers) -- [Mem0](https://github.com/mem0ai/mem0) - Universal memory layer for AI agents -- [A2A Protocol](https://github.com/a2aproject/A2A) - Agent-to-Agent protocol (Linux Foundation) -- [MCP Specification](https://modelcontextprotocol.io/specification/2025-11-25) - Model Context Protocol -- [Langfuse Agent Comparison](https://langfuse.com/blog/2025-03-19-ai-agent-comparison) - Framework comparison -- [Confluent Event-Driven Patterns](https://www.confluent.io/blog/event-driven-multi-agent-systems/) - Multi-agent architecture patterns -- [Microsoft Multi-Agent Reference Architecture](https://microsoft.github.io/multi-agent-reference-architecture/) - Enterprise patterns -- [OpenRouter](https://openrouter.ai/) - Multi-model API gateway -- [Kim et al., "Towards a Science of Scaling Agent Systems" (2025)](https://arxiv.org/abs/2512.08296) - Empirical agent scaling research (180 experiments, 3 LLM families) -- [Cemri et al., "Multi-Agent System Failure Taxonomy (MAST)" (2025)] - MAS coordination error classification -- [Gloaguen et al., "Evaluating AGENTS.md" (2026)](https://arxiv.org/abs/2602.11988) - Context files reduce success rates; non-inferable-only principle for system prompts +The design specification has been split into focused documentation pages for better navigation and maintainability. Each page covers a cohesive domain of the framework's design. + +## Design Pages + +| Page | Sections | Description | +|------|----------|-------------| +| [Design Overview](docs/design/index.md) | Vision, Core Concepts | What SynthOrg is, design principles, glossary | +| [Agents & HR](docs/design/agents.md) | Agent System, HR | Agent identity, roles, hiring, performance, promotions | +| [Organization & Templates](docs/design/organization.md) | Company Structure, Templates | Company types, hierarchy, departments, template system | +| [Communication](docs/design/communication.md) | Communication Architecture | Message bus, delegation, conflict resolution, meetings | +| [Task & Workflow Engine](docs/design/engine.md) | Task Engine | Task lifecycle, execution loops, routing, recovery, shutdown | +| [Memory & Persistence](docs/design/memory.md) | Memory & Persistence | Memory types, backends, retrieval, operational data | +| [Operations](docs/design/operations.md) | Providers, Budget, Tools, Security, Human Interaction | Provider layer, cost management, sandboxing, security, API | + +## Supporting Pages + +| Page | Description | +|------|-------------| +| [Tech Stack](docs/architecture/tech-stack.md) | Technology choices and engineering conventions | +| [Decision Log](docs/architecture/decisions.md) | All design decisions, organized by domain | +| [Research & Prior Art](docs/reference/research.md) | Framework comparison and scaling research | +| [Industry Standards](docs/reference/standards.md) | MCP, A2A, and other standards | +| [Open Questions & Risks](docs/roadmap/open-questions.md) | Unresolved questions and risk mitigations | +| [Future Vision](docs/roadmap/future-vision.md) | Backlog features and scaling path | diff --git a/README.md b/README.md index 4bf594b118..12b77ce2ce 100644 --- a/README.md +++ b/README.md @@ -1,146 +1,132 @@ -# SynthOrg +

+ SynthOrg +

-[![CI](https://github.com/Aureliolo/synthorg/actions/workflows/ci.yml/badge.svg)](https://github.com/Aureliolo/synthorg/actions/workflows/ci.yml) +

+ A framework for building synthetic organizations — autonomous AI agents orchestrated as a virtual company. +

-A framework for building synthetic organizations — autonomous AI agents orchestrated as a virtual company. +

+ CI + Coverage + License + Python + Docs +

-## Concept +--- -SynthOrg lets you spin up a synthetic organization staffed entirely by AI agents. Each agent has a role (CEO, developer, designer, QA, etc.), a personality, persistent memory, and access to real tools. Agents collaborate through structured communication, follow workflows, and produce real artifacts - code, documents, designs, and more. +## What is SynthOrg? -## What's Built +SynthOrg lets you define agents with roles, personalities, budgets, and tools, then orchestrate them to collaborate on complex tasks as a virtual organization. Each agent has a defined role (CEO, developer, designer, QA), persistent memory, and access to real tools. Agents collaborate through structured communication, follow workflows, and produce real artifacts — code, documents, designs, and more. -### Core Framework +The framework is provider-agnostic (any LLM via LiteLLM), configuration-driven (YAML + Pydantic), and designed for the full autonomy spectrum — from locked-down human approval of every action to fully autonomous operation. -- Company config + core models — Pydantic validation, immutable config, runtime state -- Provider layer — LiteLLM-based abstraction with routing, retry, rate limiting -- Templates — built-in templates, inheritance/merge, personality presets -- Persistence — pluggable `PersistenceBackend` protocol, SQLite backend, schema migrations +## Capabilities -### Agent Engine + + + + + + + + + + + +
-- Single-agent execution — ReAct/Plan-Execute loops, fail-and-reassign recovery, graceful shutdown -- Multi-agent orchestration — message bus, delegation, loop prevention, conflict resolution, meeting protocols -- Task intelligence — decomposition, routing, assignment strategies, workspace isolation (git worktrees) -- Coordination error taxonomy — post-execution classification (contradictions, drift, omissions) +**Agent Orchestration** -### Communication +Define agents with roles, models, and tools. The engine handles task decomposition, routing, execution loops (ReAct, Plan-and-Execute), and multi-agent coordination. -- Message bus with dispatcher and channels -- Delegation with loop prevention -- Conflict resolution (4 strategies: authority+dissent, debate+judge, human escalation, hybrid) -- Meeting protocols (round-robin, position papers, structured phases) + -### Budget & Cost Management +**Budget & Cost Management** -- Cost tracking — records, summaries, coordination analytics -- Budget enforcement — pre-flight/in-flight checks, auto-downgrade, cost tiers, quota tracking -- CFO optimization — anomaly detection, efficiency analysis, downgrade recommendations, spending reports +Per-agent cost limits, auto-downgrade to cheaper models at task boundaries, spending reports, CFO-level cost optimization with anomaly detection. -### Memory + -- Pluggable `MemoryBackend` protocol — capability discovery, retrieval pipeline (ranking, formatting, filtering) -- Shared org memory — `OrgMemoryBackend` with hybrid prompt+retrieval backend -- Consolidation/archival — pluggable strategies, retention enforcement +**Security & Trust** -### Tool System +SecOps agent with fail-closed rule engine, progressive trust (4 strategies), configurable autonomy levels, audit logging, and approval timeout policies. -- Built-in tools — file system, git, code runner -- Sandboxing — subprocess (file/git) + Docker (code execution) -- MCP bridge — Model Context Protocol integration -- Permission gating — role-based access, category-level enforcement +
-### API & Human Interaction +**Memory** -- REST API — Litestar, 15 controllers (company, agents, tasks, budget, approvals, analytics, messages, meetings, projects, departments, artifacts, providers, health, auth) -- WebSocket — channel-based subscriptions, per-channel filters, message-bus bridge -- Approval queue — submit/approve/reject, status filtering, WebSocket notifications -- Route guards — role-based access control, 5 human roles +Per-agent and shared organizational memory with retrieval pipeline, non-inferable filtering, consolidation, and archival. Pluggable backends via protocol. -### Security + -- Authentication — JWT + API key, Argon2id password hashing, HMAC-SHA256 API key hashing, first-run admin setup -- SecOps agent — rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, risk classifier -- Progressive trust — 4 strategies behind `TrustStrategy` protocol -- Autonomy levels — 5 tiers, presets, resolver, change strategies -- Approval timeout policies — wait-forever/auto-deny/tiered/escalation-chain, task park/resume +**Communication** -### HR +Message bus, hierarchical delegation with loop prevention, conflict resolution (4 strategies), and meeting protocols (round-robin, position papers, structured phases). -- Hiring pipeline — request, candidate generation, approval, instantiation -- Onboarding checklists, offboarding pipeline (reassign, archive, notify, terminate) -- Agent registry -- Performance tracking — task metrics, quality scoring, collaboration scoring, trend detection -- Promotion/demotion — criteria evaluation, approval strategies, model mapping + -### Planned +**Tools & Integration** -- Memory backend adapter — Mem0 initial ([ADR-001](docs/decisions/ADR-001-memory-layer.md)); GraphRAG, Temporal KG on roadmap -- Approval workflow gates — integration with engine execution flow -- CLI surface — `cli/` package is placeholder-only -- Web dashboard — Vue 3 (planned) +Built-in tools (file system, git, sandbox, code runner) plus MCP bridge for external tools. Layered sandboxing with subprocess and Docker backends. -## Status - -Core framework complete — agent engine, multi-agent coordination, API, security, HR, memory, and budget systems are implemented. Remaining: Mem0 adapter backend, approval workflow gates, CLI, web dashboard. See [DESIGN_SPEC.md](DESIGN_SPEC.md) for the full specification. - -## Tech Stack - -- **Python 3.14+** with Litestar, Pydantic -- **uv** as package manager, **Hatchling** as build backend -- **LiteLLM** for multi-provider LLM abstraction -- **structlog** for structured logging and observability -- **Mem0** for agent memory (initial backend; custom stack future — see [ADR-001](docs/decisions/ADR-001-memory-layer.md)) -- **MCP** for tool integration -- **Vue 3** for web dashboard (planned) -- **SQLite** (aiosqlite) → PostgreSQL for operational data persistence -- **Docker** with Chainguard Python distroless runtime (CIS-hardened, non-root) -- **nginx** (unprivileged) for web UI reverse proxy +
-## System Requirements +## Quick Start -- **Python 3.14+** -- **uv** — package manager ([install](https://docs.astral.sh/uv/getting-started/installation/)) -- **Git 2.x+** — required at runtime for built-in git tools (subprocess-based, not a Python binding) -- **Docker** (optional) — required for code execution sandbox, Docker-backed tool isolation, and running the full stack via Docker Compose. Install [Docker Desktop](https://docs.docker.com/get-docker/) or Docker Engine. File system and git tools work without Docker via subprocess isolation. - -## Getting Started - -### Development (local Python) +### Development ```bash git clone https://github.com/Aureliolo/synthorg.git cd synthorg -uv sync +uv sync # install dev + test deps +uv sync --group docs # install docs toolchain (mkdocs) ``` -See [docs/getting_started.md](docs/getting_started.md) for prerequisites, IDE setup, and the full walkthrough. - -### Docker Compose (full stack) +### Docker Compose ```bash -cp docker/.env.example docker/.env # configure env vars (set LLM_API_KEY) -docker compose -f docker/compose.yml build +cp docker/.env.example docker/.env docker compose -f docker/compose.yml up -d +curl http://localhost:8000/api/v1/health # verify +docker compose -f docker/compose.yml down # stop ``` -Services (default ports, configurable via `BACKEND_PORT` / `WEB_PORT` in `docker/.env`): -- **Backend API**: `http://localhost:8000` — Litestar REST + WebSocket -- **Web Dashboard**: `http://localhost:3000` — placeholder (proxies `/api/` and `/ws` to backend) - -```bash -curl http://localhost:8000/api/v1/health # health check (default port) -docker compose -f docker/compose.yml down # stop services +## Architecture + +```mermaid +graph TB + Config[Config & Templates] --> Engine[Agent Engine] + Engine --> Core[Core Models] + Engine --> Providers[LLM Providers] + Engine --> Communication[Communication] + Engine --> Tools[Tools & MCP] + Engine --> Memory[Memory] + Engine --> Security[Security & Trust] + Engine --> Budget[Budget & Cost] + Engine --> HR[HR Engine] + API[REST & WebSocket API] --> Engine + Observability[Observability] -.-> Engine + Persistence[Persistence] -.-> HR + Persistence -.-> Security ``` -See [docker/](docker/) for Dockerfiles, compose config, and environment variable reference. - ## Documentation -- [Getting Started](docs/getting_started.md) - Setup and installation guide -- [Contributing](.github/CONTRIBUTING.md) - Branch, commit, and PR workflow -- [CLAUDE.md](CLAUDE.md) - Code conventions and AI assistant reference -- [Design Specification](DESIGN_SPEC.md) - Full high-level design +| Section | Description | +|---------|-------------| +| [Design Specification](docs/design/index.md) | Vision, agents, communication, engine, memory, operations | +| [Architecture](docs/architecture/index.md) | System overview, tech stack, decision log | +| [API Reference](docs/api/index.md) | Auto-generated from docstrings | +| [Developer Setup](docs/getting_started.md) | Clone, test, lint, contribute | +| [User Guide](docs/user_guide.md) | Install, configure, run via Docker | + +> **Contributors:** Start with the [Design Overview](docs/design/index.md) before implementing any feature — it is the mandatory starting point for architecture, data models, and behavior. [`DESIGN_SPEC.md`](DESIGN_SPEC.md) serves as a pointer to the full design set. + +## Status + +Core framework complete — agent engine, multi-agent coordination, API, security, HR, memory, and budget systems are implemented. Remaining: Mem0 adapter backend, approval workflow gates, CLI, web dashboard. See the [roadmap](docs/roadmap/index.md) for details. ## License diff --git a/docker/.env.example b/docker/.env.example index b8c38c1634..97c2567a72 100644 --- a/docker/.env.example +++ b/docker/.env.example @@ -5,10 +5,6 @@ # cp docker/.env.example docker/.env # ============================================================================= -# --- LLM Provider ----------------------------------------------------------- -# API key for the LLM provider (required for agent execution) -LLM_API_KEY= - # --- Authentication ---------------------------------------------------------- # JWT signing secret (optional — auto-generated and persisted on first run). # Set explicitly only for multi-instance deployments sharing a common secret. @@ -18,9 +14,6 @@ LLM_API_KEY= # First-run: POST /api/v1/auth/setup to create admin account # --- Application ------------------------------------------------------------- -# Log level: debug, info, warning, error, critical -AI_COMPANY_LOG_LEVEL=info - # SQLite database path (inside container: /data/synthorg.db) AI_COMPANY_DB_PATH=/data/synthorg.db diff --git a/docker/compose.override.yml b/docker/compose.override.yml index 89d555af99..bf0db96679 100644 --- a/docker/compose.override.yml +++ b/docker/compose.override.yml @@ -3,8 +3,6 @@ # docker compose -f docker/compose.yml -f docker/compose.override.yml up services: backend: - environment: - AI_COMPANY_LOG_LEVEL: "debug" # Docker socket for agent code execution sandbox. # WARNING: Mounting the Docker socket gives the container full control # over the Docker daemon. Only enable in trusted development environments. diff --git a/docs/architecture/decisions.md b/docs/architecture/decisions.md index 4e01edf2ec..c979d987b8 100644 --- a/docs/architecture/decisions.md +++ b/docs/architecture/decisions.md @@ -1,21 +1,93 @@ -# Design Decisions +--- +description: All significant design and architecture decisions, organized by domain. +--- -Architectural Decision Records (ADRs) document significant technical decisions made during development. Each ADR captures the context, options considered, and rationale for the chosen approach. +# Decision Log -## Index +All significant design and architecture decisions, organized by domain. Each entry includes the decision, rationale, and key alternatives that were considered. -| ADR | Title | Status | Date | -|-----|-------|--------|------| -| [ADR-001](../decisions/ADR-001-memory-layer.md) | Memory Layer Selection | Accepted | 2026-03-08 | -| [ADR-002](../decisions/ADR-002-design-decisions-batch-1.md) | Design Decisions Batch 1 | Accepted | 2026-03-09 | -| [ADR-003](../decisions/ADR-003-documentation-architecture.md) | Documentation & Site Architecture | Accepted | 2026-03-11 | +## Memory Layer (2026-03-08) -## ADR Format +**Decision:** Mem0 as initial memory backend behind pluggable `MemoryBackend` protocol. Custom stack (Neo4j + Qdrant external) as planned future upgrade. -Each ADR follows this structure: +**Context:** 16+ agent memory solutions evaluated. After gate checks (local-first, license, Docker, Python 3.14+, per-agent isolation), three candidates passed: Mem0, Graphiti, and Custom Stack. -- **Status**: Proposed / Accepted / Deprecated / Superseded -- **Context**: The problem or decision that needs to be made -- **Options**: Alternatives considered with trade-offs -- **Decision**: The chosen approach and rationale -- **Consequences**: What changes as a result +| Candidate | Score | Why chosen / rejected | +|-----------|-------|----------------------| +| **Mem0** (chosen) | 70/100 | Production-ready (v1.0+, 49k stars). In-process deployment (Qdrant embedded + SQLite). Python 3.14 compatible (`>=3.9,<4.0`). Async client available. Low adapter overhead (~500-1k lines). Known gap: flat fact model doesn't natively map to 5-type memory taxonomy — acceptable for initial backend | +| Custom Stack | 80/100 | Best architectural fit but ~6-8k lines of custom code before any memory works. Deferred to future phase — build after Mem0 proves the protocol shape | +| Graphiti | 66/100 | Best temporal knowledge graph, but pre-1.0 stability (v0.28), extreme LLM ingestion costs (1000+ API calls per 10k chars), only covers 2-3 of 5 memory types | + +**Eliminated:** Letta (Python `<3.14`), Cognee (Python `<3.14`), memU (AGPL-3.0), Supermemory (hosted API only), Graphlit (cloud-only). Both Letta and Cognee are on the watch list for when they add Python 3.14 support. + +**Architecture:** Mem0 runs in-process inside the synthorg-backend Docker container. Qdrant embedded for vectors, SQLite for history, both persisting to mounted volumes. Graph memory (Neo4j) is optional, enabled via config. All behind the `MemoryBackend` protocol — swap backends via config without code changes. + +## Security & Trust + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D1 | StrEnum + validated registry for action types; two-level `category:action` hierarchy; static tool metadata classification | Type safety + extensibility. Category shortcuts for simple config, fine-grained control when needed. No LLM in the security classification path | Closed enum (can't extend), open strings (typos = security hazard), LLM classification (non-deterministic, catastrophic for security). Precedents: AWS IAM, K8s RBAC, GitHub scopes | +| D4 | Hybrid SecOps: rule engine fast path (~95%) + LLM slow path (~5%) | Rules catch known patterns (sub-ms, deterministic). LLM handles uncertain cases. Hard safety rules never bypass regardless of autonomy level | Pure rules (can't handle novel situations), pure LLM (0.5-8.6s latency, non-deterministic, vulnerable to injection). Precedents: AWS GuardDuty, LlamaFirewall, NeMo Guardrails — all hybrid | +| D5 | SecOps intercepts before every tool invocation via `SecurityInterceptionStrategy` protocol | Maximum coverage. Sub-ms rule check is invisible against seconds of LLM inference. Policy strictness (not interception point) varies by autonomy level | Before task step (misses per-tool threats), before task assignment only (zero runtime security), configurable per autonomy (the point doesn't change, only policy does) | +| D6 | Three-level autonomy resolution: per-agent, per-department, company default | Matches real-world IAM systems (AWS, Azure, K8s). Seniority validation prevents Juniors from getting `full` autonomy | Company-wide only (too coarse), per-department (can't distinguish junior from lead). Precedents: CrewAI per-agent attributes, AutoGen per-agent `human_input_mode` | +| D7 | Human-only promotion + automatic downgrade via `AutonomyChangeStrategy` protocol | No real-world security system auto-grants higher privileges. Automatic downgrade on errors, budget exhaustion, or security incidents | Human only (too restrictive for downgrades), CEO agent can promote (prompt injection risk → privilege escalation), fully automatic (dangerous). Precedent: Azure Conditional Access only restricts, never loosens | + +## Agent & HR + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D8 | Templates + LLM for candidate generation; persist to operational store; hot-pluggable | Reuses template system for common roles, LLM for novel roles. Operational store enables rehiring and audit. Hot-plug via dedicated registry service | Templates only (can't create novel roles), LLM only (risk of invalid configs), in-memory only (lost on restart), persist to YAML (race conditions). Precedents: AutoGen hot-pluggable, Letta DB-persisted | +| D9 | Pluggable `TaskReassignmentStrategy`; initial: queue-return | Tasks return to unassigned queue. Existing `TaskRoutingService` re-routes with priority boost for reassigned tasks | Same-department/lowest-load (ignores skill match), manager decides (LLM cost, blocks on availability), HR agent decides (expensive, bottleneck) | +| D10 | Pluggable `MemoryArchivalStrategy`; initial: full snapshot, read-only | Complete preservation. Selective promotion of semantic+procedural to org memory. Enables rehiring via archive restore | Full snapshot accessible (exposes personal reasoning), selective discard (irrecoverable if classification wrong) | + +## Performance Metrics + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D2 | Pluggable `QualityScoringStrategy`; initial: layered (CI signals + LLM judge + human override) | Multiple independent signals, hardest to game. Start with Layer 1 (free CI signals), add layers incrementally | Human only (doesn't scale), LLM-as-judge only (12+ known biases), CI signals only (narrow view), peer ratings (reciprocity bias). Research: LLM judges >80% human alignment but biased (CALM framework) | +| D3 | Pluggable `CollaborationScoringStrategy`; initial: automated behavioral telemetry | Objective, zero token cost. Weighted average of delegation success, response latency, conflict constructiveness, meeting contribution, loop prevention, handoff completeness | LLM evaluation (expensive, circular — LLM judging LLM), peer ratings (reciprocity/collusion), human-provided (doesn't scale) | +| D11 | Pluggable `MetricsWindowStrategy`; initial: multiple windows (7d, 30d, 90d) | Industry standard (Google SRE Workbook prescribes multi-window alerting). Handles heterogeneous metric cadences. Min 5 data points per window | Fixed 30d (too rigid), configurable per-metric (added complexity without multi-resolution benefit) | +| D12 | Pluggable `TrendDetectionStrategy`; initial: Theil-Sen regression + thresholds | 29.3% outlier breakdown (tolerates ~1 in 3 bad data points). Classifies trends as improving/stable/declining. Min 5 data points | Period-over-period (statistically weak), OLS regression (0% outlier breakdown), threshold-only (not a trend detection method). EPA recommends Theil-Sen for noisy data | + +## Promotions + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D13 | Pluggable `PromotionCriteriaStrategy`; initial: configurable threshold gates (N of M) | `min_criteria_met` setting covers AND, OR, and threshold logic. Default: junior-to-mid = 2/3, mid-to-senior = all | AND only (blocks strong agents with one weak metric), OR only (trivial task spam → auto-promote). Precedents: game progression systems, HR competency matrices | +| D14 | Pluggable `PromotionApprovalStrategy`; initial: senior+ requires human approval | Low-level auto-promotes (small cost impact: small→medium ~4x). Demotions auto-apply for cost-saving, human approval for authority reduction | All human-approved (bottleneck on mass promotions), configurable per-level (extra complexity without clear benefit) | +| D15 | Pluggable `ModelMappingStrategy`; initial: default ON, opt-out | Model follows seniority. Changes at task boundaries only. Per-agent `preferred_model` overrides. Smart routing still uses cheap models for simple tasks | Always applied (budget-constrained deployments can't promote without cost increase), opt-in only (seniority feels disconnected from capability) | + +## Tools & Sandbox + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D16 | Docker MVP via `aiodocker`; `SandboxBackend` protocol for future backends | Docker cold start (1-2s) invisible against LLM latency (2-30s). Pre-built image + user config. Fail if Docker unavailable — no unsafe subprocess fallback. gVisor as config-level hardening upgrade | Docker + WASM (CPython can't run pip packages in WASM), Docker + Firecracker (Linux-only, requires KVM), docker-py (sync, no 3.14 support). Precedents: E2B, OpenAI, Daytona — none offer unsandboxed fallback | +| D17 | Official `mcp` Python SDK, pinned `>=1.25,<2`; `MCPBridgeTool` adapter | Used by every major framework (LangChain, CrewAI, OpenAI Agents, Pydantic AI). Python 3.14 compatible. Pydantic 2.12.5 compatible. Thin adapter isolates codebase from SDK changes | Custom MCP client (must implement protocol handshake, track spec changes manually) | +| D18 | MCP result mapping via adapter in `MCPBridgeTool` | Keep `ToolResult` as-is. Text concatenation for LLM path. Rich content in metadata. Zero disruption to existing codebase | Extend ToolResult for multi-modal (cascading changes across codebase; LLM providers consume as text anyway) | + +## Timeout & Approval + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D19 | Pluggable `RiskTierClassifier`; initial: configurable YAML mapping | Predictable, hot-reloadable. Unknown action types default to HIGH (fail-safe) | Fixed per action type (rigid), SecOps assigns at runtime (non-deterministic, expensive), default + SecOps override (premature coupling). Precedent: OPA policy-as-config | +| D20 | Pydantic JSON via `PersistenceBackend`; `ParkedContextRepository` protocol | Pydantic handles serialization, SQLite handles durability. Conversation stored verbatim — summarization is a context window concern at resume time, not a persistence concern | Pydantic only (no durability), persistence only (still needs serialization format). Precedents: Temporal, LangGraph, SpiffWorkflow all store full state | +| D21 | Tool result injection for approval resume | Approval IS the tool's return value. Satisfies LLM conversation protocol (expects tool result after tool call). Fallback: system message for engine-initiated parking | System message (not for events, agent may not notice), context metadata flag (LLM doesn't see it). Precedent: LangGraph HITL pattern | + +## Engine & Prompts + +| ID | Decision | Rationale | Alternatives considered | +|----|----------|-----------|------------------------| +| D22 | Remove tools section from system prompt | API's `tools` parameter injects richer definitions (with JSON schemas). Eliminates 200-400+ token redundancy per call. Both Anthropic and OpenAI inject tool definitions internally | Keep as-is (wastes tokens, contradicts provider best practices), replace with behavioral guidance (requires per-tool-set crafting). Evidence: arXiv 2602.11988 shows redundant context increases cost 20%+ with minimal benefit | +| D23 | Pluggable `MemoryFilterStrategy`; initial: tag-based at write time | Zero retrieval cost. Uses existing `MemoryMetadata.tags`. Non-inferable tag convention enforced at `MemoryBackend.store()` boundary | LLM classification at retrieval (2K-10K extra tokens, adds latency, recursive problem), keyword heuristic (low accuracy), documentation only (no enforcement). Evidence: arXiv 2602.11988 confirms agents store inferable content without enforcement | + +## Documentation (2026-03-11) + +**Decision:** MkDocs + Material + mkdocstrings for docs; Astro for landing page; build output embedding for Vue dashboard; single domain with CI merge. + +**Rationale:** Best-in-class tools for each job. MkDocs with Griffe AST extraction is PEP 649 safe (no runtime imports). Astro produces zero-JS landing pages. Build output embedding means the same docs HTML serves both the public site and the Vue dashboard. + +**Alternatives:** Sphinx (poor landing pages), VitePress/Docusaurus (no Python API docs), shared markdown with dual renderers (breaks mkdocstrings directives), subdomain split (higher maintenance), iframe embedding (poor UX with double scrollbars). + +## Overarching Pattern + +Nearly every decision follows the same architecture: a pluggable protocol interface with one initial implementation shipped, and alternative strategies documented for future extension. This is consistent with the project's protocol-driven design philosophy. diff --git a/docs/architecture/index.md b/docs/architecture/index.md index fb84556afb..5cf4405b67 100644 --- a/docs/architecture/index.md +++ b/docs/architecture/index.md @@ -59,6 +59,7 @@ graph TB ## Further Reading -- [Design Specification](https://github.com/Aureliolo/synthorg/blob/main/DESIGN_SPEC.md) — Full high-level spec (~3500 lines, 18 sections) -- [Design Decisions](decisions.md) — Architectural Decision Records (ADRs) +- [Design Specification](../design/index.md) — Full design spec split into 7 focused pages +- [Tech Stack](tech-stack.md) — Technology choices and engineering conventions +- [Decision Log](decisions.md) — All design decisions, organized by domain - [API Reference](../api/index.md) — Auto-generated from source code diff --git a/docs/architecture/tech-stack.md b/docs/architecture/tech-stack.md new file mode 100644 index 0000000000..4580351e7c --- /dev/null +++ b/docs/architecture/tech-stack.md @@ -0,0 +1,130 @@ +# Tech Stack + +## High-Level Architecture + +```text +┌──────────────────────────────────────────────────────────────┐ +│ SynthOrg Engine │ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ +│ │ Company Mgr │ │ Agent Engine │ │ Task/Workflow Eng. │ │ +│ │ (Config, │ │ (Lifecycle, │ │ (Queue, Routing, │ │ +│ │ Templates, │ │ Personality, │ │ Dependencies, │ │ +│ │ Hierarchy) │ │ Execution) │ │ Scheduling) │ │ +│ └──────────────┘ └──────────────┘ └────────────────────┘ │ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ +│ │ Comms Layer │ │ Memory Layer │ │ Tool/Capability │ │ +│ │ (Message Bus,│ │ (Pluggable, │ │ System (MCP, │ │ +│ │ Meetings, │ │ Retrieval, │ │ Sandboxing, │ │ +│ │ A2A) │ │ Archive) │ │ Permissions) │ │ +│ └──────────────┘ └──────────────┘ └────────────────────┘ │ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ +│ │ Provider Lyr │ │ Budget/Cost │ │ Security/Approval │ │ +│ │ (Unified, │ │ Engine │ │ System │ │ +│ │ Routing, │ │ (Tracking, │ │ (SecOps Agent, │ │ +│ │ Fallbacks) │ │ Limits, │ │ Audit Log, │ │ +│ │ │ │ CFO Agent) │ │ Human Queue) │ │ +│ └──────────────┘ └──────────────┘ └────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ API Layer (Async Framework + WebSocket) │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────┐ ┌─────────────────────────────┐ │ +│ │ Web UI (Local) │ │ CLI Tool │ │ +│ │ Web Dashboard │ │ synthorg │ │ +│ └──────────────────────┘ └─────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────┘ +``` + +The SynthOrg engine is structured as a set of loosely coupled subsystems. Each box represents a major component that communicates through well-defined protocol interfaces. The API layer sits below the engine, exposing REST and WebSocket endpoints to the Web UI and CLI. + +--- + +## Technology Stack + +| Component | Technology | Rationale | +|-----------|-----------|-----------| +| **Language** | Python 3.14+ | Best AI/ML ecosystem; all major frameworks use it. LiteLLM, MCP, and memory layer candidates are all Python-native. PEP 649 native lazy annotations, PEP 758 except syntax. | +| **API Framework** | Litestar | Async-native with built-in channels (pub/sub WebSocket), auto OpenAPI 3.1 docs, class-based controllers, native route guards, built-in rate limiting / CSRF / compression middleware, explicit DI, Pydantic v2 support via plugin. See the [design decision](#why-litestar-over-fastapi) below. | +| **LLM Abstraction** | LiteLLM | 100+ providers, unified API, built-in cost tracking, retries/fallbacks. | +| **Agent Memory** | Mem0 (Qdrant + SQLite) initially, custom stack (Neo4j + Qdrant) planned | Mem0 runs in-process as the initial backend behind a pluggable `MemoryBackend` protocol ([Decision Log](decisions.md)). Qdrant embedded + SQLite for persistence. Custom stack as a future upgrade. Config-driven backend selection. | +| **Message Bus** | Internal (async queues), Redis planned | Start with Python asyncio queues; upgrade to Redis for multi-process/distributed deployments. | +| **Task Queue** | Internal, Celery/Redis planned | Start simple, scale with Celery when needed. | +| **Database** | SQLite (aiosqlite), PostgreSQL / MariaDB planned | Pluggable `PersistenceBackend` protocol. SQLite ships first via aiosqlite async driver. PostgreSQL and MariaDB as future backends -- swap via config, no app code changes. | +| **Web UI** | Vue 3 + Vite | Modern, fast, good ecosystem. Simpler than React for dashboards. | +| **Real-time** | WebSocket (Litestar channels plugin) | Built-in pub/sub broadcasting, per-channel history, backpressure management. Real-time agent activity, task updates, chat feed. | +| **Containerization** | Docker + Docker Compose | Chainguard Python distroless runtime (non-root, CIS Docker Benchmark v1.6.0 hardened, minimal attack surface, continuously scanned in CI). `nginxinc/nginx-unprivileged` web tier. GHCR registry, cosign image signing, Trivy + Grype vulnerability scanning, SBOM + SLSA provenance. Also used for isolated code execution sandboxing. | +| **Docker API** | aiodocker | Async-native Docker API client for the `DockerSandbox` backend. | +| **Tool Integration** | MCP SDK (`mcp`) | Industry standard for LLM-to-tool integration. See [Industry Standards](../reference/standards.md). | +| **Agent Communication** | A2A Protocol compatible | Future-proof inter-agent communication. See [Industry Standards](../reference/standards.md). | +| **Authentication** | PyJWT + argon2-cffi | JWT (HMAC HS256/384/512) for session tokens, Argon2id for password hashing, HMAC-SHA256 for API key storage (keyed with server secret). | +| **Config Format** | YAML + Pydantic validation | Human-readable config with strict validation. | +| **CLI** | TBD | Thin wrapper around the REST API for terminal use. Interactive Scalar docs at `/docs/api` and `curl`/`httpie` may suffice. | + +--- + +## Key Design Decisions + +| Decision | Choice | Alternatives Considered | Rationale | +|----------|--------|------------------------|-----------| +| Language | Python 3.14+ | TypeScript, Go, Rust | AI ecosystem; LiteLLM, MCP, and memory layer candidates are Python-native. PEP 649 lazy annotations, PEP 758 except syntax. | +| API | Litestar | FastAPI, Flask, Django, aiohttp | Built-in channels (pub/sub WebSocket), class-based controllers, native route guards, middleware (rate limiting, CSRF, compression), explicit DI. FastAPI considered but Litestar provides more batteries-included for less custom code. | +| LLM Layer | LiteLLM | Direct APIs, OpenRouter only | 100+ providers, cost tracking, fallbacks, load balancing built-in. | +| Memory | Mem0 (initial), custom stack (future) + SQLite | Graphiti, Letta, Cognee, custom | Mem0 in-process as initial backend behind a pluggable `MemoryBackend` protocol ([Decision Log](decisions.md)). Custom stack (Neo4j + Qdrant) as a future upgrade. Must support episodic, semantic, and procedural memory types. | +| Message Bus | asyncio queues, Redis planned | Kafka, RabbitMQ, NATS | Start simple; Redis is well-supported; Kafka is overkill for local deployments. | +| Config | YAML + Pydantic | JSON, TOML, Python dicts | Human-friendly, strict validation, good IDE support. | +| Web UI | Vue 3 | React, Svelte, HTMX | Simpler than React for dashboards. | +| Persistence | Pluggable protocol + repository protocols | ORM (SQLAlchemy), raw SQL, hybrid | Same frozen Pydantic models in and out (no DTOs), async throughout, backend-swappable via config. Repository protocols decouple app code from storage engine. | +| Sandboxing | Layered: subprocess + Docker | Docker-only, subprocess-only, WASM | Risk-proportionate: fast subprocess for file/git, Docker isolation for code execution. Pluggable `SandboxBackend` protocol enables K8s migration later. | +| Container Packaging | Chainguard distroless + GHCR | Alpine, Debian-slim, scratch, Docker Hub | Minimal attack surface, non-root by default, continuously scanned in CI. GHCR for tighter GitHub integration. cosign keyless signing for supply-chain integrity. Trivy + Grype dual scanning. | + + +!!! info "Design Decision: Why Litestar over FastAPI?" + + Both are async-native Python frameworks with auto-generated OpenAPI docs and Pydantic support. FastAPI has a larger ecosystem and more community resources. However, Litestar provides significantly more built-in functionality that would otherwise need to be written and maintained separately: + + 1. **Channels plugin** -- pub/sub WebSocket broadcasting with per-channel subscriptions, backpressure management, and subscriber backlog. FastAPI requires hand-rolling all WebSocket connection management. + 2. **Class-based controllers** -- group routes with shared guards, middleware, and configuration. The 13 route groups map naturally to controllers. FastAPI only supports loose functions on routers. + 3. **Native route guards** -- declarative authorization at controller/route level. Essential for the approval queue and security features. FastAPI requires `Depends()` on every route. + 4. **Built-in middleware** -- rate limiting, CSRF protection, GZip/Brotli compression, session handling, request logging. FastAPI requires third-party packages or custom code for each. + 5. **Explicit dependency injection** -- pytest-style named dependencies with scope control. Matches the project's testing approach. FastAPI's DI is implicit (function parameter magic). + + The ecosystem size gap is acceptable: the API is an internal orchestration interface, not a public web service. The bottleneck is LLM latency (seconds), not framework overhead (microseconds). Litestar's approximately 2x performance advantage in micro-benchmarks is a bonus, not the deciding factor. Python 3.14 is supported by both. + +--- + +## Engineering Conventions + +These conventions are used throughout the codebase. For full details on each, see the relevant design documentation. + +| Convention | Status | Summary | +|------------|--------|---------| +| **Immutability strategy** | Adopted | `copy.deepcopy()` at construction + `MappingProxyType` wrapping for non-Pydantic collections. `frozen=True` + boundary `deepcopy()` for Pydantic models. | +| **Config vs runtime split** | Adopted | Frozen models for config/identity; `model_copy(update=...)` for runtime state transitions (e.g., `TaskExecution`, `AgentContext`). | +| **Derived fields** | Adopted | `@computed_field` instead of stored + validated redundant fields. | +| **String validation** | Adopted | `NotBlankStr` type from `core.types` for all identifier/name fields, eliminating per-model validator boilerplate. | +| **Shared field groups** | Adopted | Common field sets extracted into base models (e.g., `_SpendingTotals`) to prevent duplication. | +| **Event constants** | Adopted | Per-domain submodules under `observability/events/`. Direct imports: `from ai_company.observability.events. import CONSTANT`. | +| **Parallel tool execution** | Adopted | `asyncio.TaskGroup` in `ToolInvoker.invoke_all` with optional `max_concurrency` semaphore and structured error collection. | +| **Parallel agent execution** | Adopted | `ParallelExecutor` with `TaskGroup` + `Semaphore` concurrency limits, `ResourceLock` for exclusive file-path claims, progress tracking, and shutdown awareness. | +| **Tool permission checking** | Adopted | Category-level gating based on `ToolAccessLevel`. Priority-based resolution: denied list, allowed list, level categories, then deny. | +| **Tool sandboxing** | Adopted | Layered: in-process path validation for file system tools, `SubprocessSandbox` for git tools, `DockerSandbox` planned for code execution. | +| **Crash recovery** | Adopted | Pluggable `RecoveryStrategy` protocol. Current: `FailAndReassignStrategy`. Planned: `CheckpointStrategy` for per-turn state persistence. | +| **Personality compatibility** | Adopted | Weighted composite scoring: 60% Big Five similarity, 20% collaboration alignment, 20% conflict approach. | +| **Agent behavior testing** | Planned | Scripted `FakeProvider` for unit tests; behavioral outcome assertions for integration tests. | +| **LLM call analytics** | Adopted | Proxy metrics (`turns_per_task`, `tokens_per_task`) and data models for call categorization, coordination metrics, and orchestration ratio. | +| **Cost tiers and quota tracking** | Adopted | Configurable `CostTierDefinition` with merge/override semantics. `QuotaTracker` enforces per-provider request/token quotas with window-based rotation. | +| **Shared org memory** | Adopted | `OrgMemoryBackend` protocol with `HybridPromptRetrievalBackend`. Seniority-based write access control. Core policies in system prompts; extended facts retrieved on demand. | +| **Memory consolidation** | Adopted | `ConsolidationStrategy` protocol with deduplication + summarization. `RetentionEnforcer` for age-based cleanup. `ArchivalStore` for cold storage. | +| **State coordination** | Planned | Centralized single-writer `TaskEngine` with `asyncio.Queue`. Agents submit requests; engine applies `model_copy(update=...)` sequentially and publishes snapshots. | +| **Workspace isolation** | Adopted | Pluggable `WorkspaceIsolationStrategy` protocol. Default: git worktrees with sequential merge on completion. | +| **Graceful shutdown** | Adopted | Pluggable `ShutdownStrategy` protocol with cooperative 30-second timeout. Force-cancel after timeout with `INTERRUPTED` status. | +| **Template inheritance** | Adopted | `extends` field triggers parent resolution at render time with deep merge by field type. Circular chain detection included. | +| **Communication foundation** | Adopted | `MessageBus` protocol with pull-model `receive()`, `MessageDispatcher` for concurrent handler routing, `AgentMessenger` per-agent facade. | +| **Delegation and loop prevention** | Adopted | `DelegationGuard` orchestrates five mechanisms (ancestry, depth, dedup, rate limit, circuit breaker) in sequence with short-circuit on first rejection. | +| **Task assignment** | Adopted | `TaskAssignmentStrategy` protocol with six strategies: Manual, RoleBased, LoadBalanced, CostOptimized, Hierarchical, and Auction. | +| **Conflict resolution** | Adopted | `ConflictResolver` protocol with four strategies: Authority, Debate, Human Escalation, and Hybrid. | +| **Pydantic alias for YAML directives** | Adopted | `Field(alias="_remove")` in `TemplateAgentConfig` -- YAML uses `_remove: true`, Python accesses `agent.remove`. Keeps YAML human-readable while avoiding leading-underscore attributes. | diff --git a/docs/decisions/ADR-001-memory-layer.md b/docs/decisions/ADR-001-memory-layer.md deleted file mode 100644 index 6f795436ff..0000000000 --- a/docs/decisions/ADR-001-memory-layer.md +++ /dev/null @@ -1,636 +0,0 @@ -# ADR-001: Memory Layer Selection - -## Status - -Accepted - -## Date - -2026-03-08 - -## Context - -The `memory/` module in `DESIGN_SPEC.md` (sections 7.1-7.4, 15.2) lists the memory -layer as "TBD — candidates: Mem0, Zep, Letta, Cognee, custom." (Note: Zep pivoted to -**Graphiti** as their open-source temporal knowledge graph offering; the standalone Zep -product is now a cloud-only service. This evaluation covers Graphiti as Zep's -successor.) This decision blocks -the memory subsystem implementation: - -- **#32** Memory interface design -- **#36** Persistence layer -- **#41** Retrieval and injection -- **#125** Shared organizational memory -- **#48** Consolidation - -### Key Architecture Constraints (from user) - -1. **Target architecture**: memory/storage runs in **separate container(s)** from the - main Python app. **MVP exception**: an in-process deployment (e.g., Mem0 inside the - `synthorg-backend` container) is acceptable as long as it preserves the same protocol - boundary and can be moved out-of-process without refactors. -2. Does NOT have to be Python — any technology, containerized -3. Main app uses a **thin async Python client** behind a **pluggable protocol**, which - must work for both in-process libraries and remote services so storage can - transparently move to separate container(s) later. -4. **Capability discovery** — protocol exposes what each backend supports -5. Multiple containers are fine (e.g., graph DB + vector store) -6. **Graph DB**: both Neo4j (server) and embedded options should be evaluated -7. **Embeddings**: implementation detail of the memory layer — just verify - configurable providers (local + cloud) - -### Requirements from Design Spec - -- **5 memory types**: Working, Episodic, Semantic, Procedural, Social (§7.2) -- **4 persistence levels**: none, session, project, full (§7.3) -- **Per-agent isolation**: namespace/tenant support -- **3 org memory backends** behind `OrgMemoryBackend` protocol (§7.4): - - Backend 1: Hybrid Prompt + Retrieval (MVP) - - Backend 2: GraphRAG Knowledge Graph (research) - - Backend 3: Temporal Knowledge Graph (research) -- **Python 3.14+** compatibility (project requirement) - ---- - -## Discovery Phase: Candidates Found - -An exhaustive search (web, GitHub, community forums, awesome-lists) identified **16+ -agent memory solutions** as of March 2026. The field is expanding rapidly — most -projects launched or matured significantly in 2025. - -### Long List - -| # | Candidate | Stars | License | Graph | Vector | Temporal | Local | -|---|-----------|-------|---------|-------|--------|----------|-------| -| 1 | Mem0 | ~49k | Apache 2.0 | Yes (optional) | Yes | Basic | Yes | -| 2 | Supermemory | ~33k | MIT (SDK only) | No | Yes | No | No (proprietary engine) | -| 3 | Graphiti (Zep) | ~23k | Apache 2.0 | Yes (primary) | Yes | Bi-temporal | Yes | -| 4 | Letta (MemGPT) | ~21.5k | Apache 2.0 | No | Yes | No | Yes | -| 5 | memU | ~10.5k | AGPL-3.0 | No | Yes | No | Yes | -| 6 | Cognee | ~8.2k | Apache 2.0 | Yes | Yes | Partial | Yes | -| 7 | MemOS | ~5.9k | Apache 2.0 | Yes | Yes | Yes | Yes | -| 8 | Memori (Gibson) | ~4.9k | Apache 2.0 | No (SQL) | No | No | Yes | -| 9 | MemMachine | ~4.6k | Apache 2.0 | No | Yes | No | Yes | -| 10 | OpenMemory | ~3.5k | MIT | Yes | Yes | Yes | Yes | -| 11 | Memary | ~2.5k | MIT | Yes | Yes | No | Partial | -| 12 | LangMem | ~1.3k | MIT | No | Yes | No | Yes | -| 13 | SimpleMem | ~1.3k | MIT | No | Yes | No | Yes | -| 14 | A-MEM | ~835 | MIT | Yes | Yes | No | Yes | -| 15 | memsearch | ~227 | MIT | No | Yes | No | Yes | -| 16 | Graphlit | N/A | Cloud-only | Yes | Yes | No | No | -| -- | Custom Stack | -- | -- | Full control | Full control | Full control | Yes | - ---- - -## Gate Check Results - -### Gate Definitions - -| Gate | Requirement | Method | -|------|-------------|--------| -| G1 | Runs fully local (no mandatory cloud) | Docs review, offline deploy test | -| G2 | License compatible with BUSL-1.1 | LICENSE file review | -| G3 | Containerizable as Docker service | Dockerfile/compose review | -| G4 | Active maintenance (release in last 6 months) | GitHub releases, commits | -| G5 | Per-agent memory isolation | API docs review | -| G6 | Configurable embedding provider (local + cloud) | Docs review | -| G7 | **Python 3.14+ compatible** | PyPI `requires-python` review | - -### Gate Results - -| Candidate | G1 | G2 | G3 | G4 | G5 | G6 | G7 | Result | -|-----------|----|----|----|----|----|----|----|----| -| **Mem0** | PASS | PASS (Apache 2.0) | PASS (in-process or 3 containers) | PASS (v1.0.5, Mar 2026) | PASS (user/agent/app/run_id) | PASS (11+ providers) | PASS (`>=3.9,<4.0`) | **PASS** | -| **Graphiti** | PASS | PASS (Apache 2.0) | PASS (compose) | PASS (v0.28.1, Feb 2026) | PASS (group_id) | PASS (4 providers) | PASS (`>=3.10,<4`) | **PASS** | -| **Letta** | PASS | PASS (Apache 2.0) | PASS | PASS (v0.16.6) | PASS (inherent) | PASS | **FAIL (`<3.14`)** | **ELIMINATED** | -| **Cognee** | PASS | PASS (Apache 2.0) | PASS (compose) | PASS (v0.5.3) | PASS (multi-tenant) | PASS (8+) | **FAIL (`<3.14`)** | **ELIMINATED** | -| **memU** | PASS | **FAIL (AGPL-3.0)** | -- | -- | -- | -- | -- | **ELIMINATED** | -| **Supermemory** | **FAIL (hosted API)** | -- | -- | -- | -- | -- | -- | **ELIMINATED** | -| **Graphlit** | **FAIL (cloud-only)** | -- | -- | -- | -- | -- | -- | **ELIMINATED** | -| **MemOS** | PASS | PASS (Apache 2.0) | PASS | PASS | Unclear | PASS | PASS (`>=3.10`) | Viable but immature | -| **Custom Stack** | PASS | PASS | PASS | PASS | PASS | PASS | PASS | **PASS** | - -### Gate Elimination Details - -- **Letta**: `requires-python = "<3.14,>=3.11"`. Conservative upper bound (no known - technical blocker), but no upstream issue/PR requesting 3.14 support. Also: Letta is - a full agent platform, NOT a standalone memory layer — the memory component cannot be - used independently. - -- **Cognee**: `requires-python = ">=3.10,<3.14"`. Conservative bound. Latest dev - release (0.5.4.dev1, 2026-03-05) still has `<3.14`. Kuzu dependency itself supports - 3.14, so the constraint is from Cognee's own build/test matrix. Also early-stage - maturity (v0.5.x). - -- **memU**: AGPL-3.0 is copyleft and incompatible with BUSL-1.1 project licensing - without careful isolation. - -- **Supermemory**: The MIT-licensed repo contains only client SDKs and a web console - — zero engine code. The actual memory engine (fact extraction, contradiction - resolution, vector search) is proprietary. "Self-hosting" exists only for enterprise - customers deploying a compiled proprietary bundle to Cloudflare Workers — not Docker. - No Dockerfile exists in the repo. - -- **Graphlit**: Cloud-native only by design. No self-hosting option. - -### Candidates Passing All Gates - -1. **Mem0** (Apache 2.0, `>=3.9,<4.0`, 49k stars) -2. **Graphiti** (Apache 2.0, `>=3.10,<4`, 23k stars) -3. **Custom Stack** (full control, all components Python 3.14 verified) - ---- - -## Scored Evaluation - -### Scoring Criteria (100 points) - -| # | Criterion | Weight | Description | -|---|-----------|--------|-------------| -| S1 | Memory type coverage | 15 | How naturally 5 types map to candidate abstractions | -| S2 | Retrieval quality reputation | 15 | Benchmarks, reviews, user reports | -| S3 | Graph/relational capability | 12 | GraphRAG, temporal KG support (§7.4) | -| S4 | Stability and maturity | 12 | Version history, production usage, breaking changes | -| S5 | Protocol compatibility | 10 | Impedance with our `@runtime_checkable` protocol pattern | -| S6 | Operational persistence overlap | 8 | Can store tasks/costs/messages too? (#36) | -| S7 | Async support | 8 | Native async client quality | -| S8 | Consolidation built-in | 5 | Memory compression/summarization (#48) | -| S9 | Community, docs, ecosystem | 5 | Stars, contributors, doc quality | -| S10 | Resource footprint | 5 | RAM/disk/CPU requirements | -| S11 | Operational complexity | 5 | Container count, config, maintenance | - -### Comparison Table - -| Criterion | Mem0 | Graphiti | Custom Stack | -|-----------|------|---------|-------------| -| **S1** Memory types (15) | **9** — episodic+semantic+procedural+short-term. No explicit social/working. Flat fact model needs wrapping | **7** — episodic+semantic+social via graph. No procedural/working | **15** — full control, maps directly to all 5 types | -| **S2** Retrieval quality (15) | **12** — +26% vs. OpenAI Memory on LOCOMO. Well benchmarked | **11** — +18.5% accuracy, 90% latency reduction. Graph traversal powerful | **10** — depends on implementation. Qdrant + Neo4j individually excellent | -| **S3** Graph capability (12) | **8** — graph is supplementary to vector. Neo4j/Kuzu/FalkorDB. Enriches results but doesn't drive retrieval | **12** — graph IS the primary store. Bi-temporal model. Neo4j/FalkorDB/Kuzu | **11** — Neo4j is best-in-class. Full Cypher. Must implement temporal tracking | -| **S4** Stability (12) | **11** — v1.0+, 49k stars, YC-backed. v1.0.0 had breaking changes but migration guide exists | **7** — pre-1.0 (v0.28), fast-moving API. Docker image freshness issues. Hallucination bugs reported | **10** — each component individually very mature. No unified project risk | -| **S5** Protocol compat (10) | **6** — factory-based, opinionated memory structure. Needs adapter layer that fights its own abstractions | **7** — GraphDriver ABC is protocol-like. Async-native. Cleaner wrapping than Mem0 | **10** — built from scratch to match our protocol pattern exactly | -| **S6** Persistence overlap (8) | **3** — memory-focused only. No tasks/costs/messages | **2** — knowledge graph only | **5** — can add SQLite/Postgres for operational data naturally | -| **S7** Async support (8) | **6** — AsyncMemory added after community request (#2495). Works but secondary path | **8** — fully async throughout. Native design | **8** — Neo4j async driver + Qdrant async client both confirmed | -| **S8** Consolidation (5) | **4** — built-in memory compression engine | **3** — community detection, entity deduplication | **1** — must implement from scratch | -| **S9** Community/docs (5) | **5** — largest community (49k stars), good docs, YC backing | **4** — 23k stars, growing fast, good docs | **3** — components have great docs but no unified project | -| **S10** Resource footprint (5) | **3** — full graph stack: 3 containers (FastAPI + PostgreSQL + Neo4j); in-process mode lighter (Qdrant embedded + SQLite) | **3** — graph DB container + heavy LLM usage during ingestion | **3** — 2 containers (Neo4j + Qdrant) + embedded FastEmbed | -| **S11** Operational complexity (5) | **3** — full graph stack: 3 containers, OpenAI defaults need reconfiguration for local; in-process mode simpler | **2** — graph DB + high LLM cost per episode ingestion (1000+ API calls per 10k chars reported) | **4** — 2 well-understood containers, standard config | -| | | | | -| **TOTAL** | **70/100** | **66/100** | **80/100** | - -### Analysis - -**Mem0 (70/100)** — Most mature and well-adopted. Best retrieval benchmarks. -However, its flat "memory as facts" model does not naturally map to our 5-type -taxonomy. Graph memory is optional and supplementary, not primary. Would need a -significant adapter layer that fights Mem0's opinionated architecture. Note: graph DB -is entirely optional (disabled by default) and supports Neo4j, FalkorDB, Memgraph, -and Kuzu as backends — Kuzu-specific concurrency bugs do not apply when using other -graph backends. - -**Graphiti (66/100)** — Best temporal knowledge graph capabilities, which maps -perfectly to §7.4 Backend 3. However, only covers 2-3 of 5 memory types (no -procedural, no working memory). Pre-1.0 stability concerns. The biggest risk is **LLM -ingestion cost** — users reported 1,000+ API requests for 10k chars and 24,000 API -calls / 41M tokens for processing a documentation set. This conflicts with our -cost-aware design principles. - -**Custom Stack (80/100)** — Best architectural fit. Perfect protocol compatibility. -Full control over 5-type memory model. Python 3.14 verified per-component. Main -trade-off: ~6,000-8,000 lines of custom code (plus ~2,500-3,000 lines of tests) and -no built-in consolidation or memory extraction. However, we need the protocol layer -regardless, and the extraction logic can leverage our existing LiteLLM provider layer. - ---- - -## Decision - -### Initial Backend: Mem0 (in-process, persistent) - -**Mem0** as the initial `MemoryBackend` implementation — get working memory fast, -build a proper custom backend later. - -| Component | Technology | License | Role | -|-----------|-----------|---------|------| -| **Memory engine** | Mem0 (`mem0ai`) | Apache 2.0 | Memory extraction, storage, retrieval | -| **Vector store** | Qdrant (embedded, in-process) | Apache 2.0 | Persists to configurable path on mounted volume | -| **History store** | SQLite (in-process) | Public domain | Memory history, persists to configurable path | -| **Embeddings** | Configurable (FastEmbed for local, LiteLLM for cloud) | Apache 2.0 / MIT | Mem0 supports 11+ embedding providers | -| **Graph memory** | Optional (Neo4j when needed) | Apache 2.0 (driver) | Enable via config when graph capabilities needed | -| **Working memory** | In-process Python | N/A | Ephemeral per-task context | - -### Why Mem0 as Initial - -1. **Production-ready now**: v1.0+, 49k stars, YC-backed. `pip install mem0ai` and go. -2. **In-process deployment**: Qdrant embedded + SQLite — runs inside the synthorg-backend - Docker container. No external services needed. Persists to mounted volumes. -3. **Python 3.14 compatible**: `>=3.9,<4.0`. -4. **Configurable everything**: embedding provider, vector store, graph store, LLM - provider, storage paths — all via config dict. -5. **Async client available**: `AsyncMemory` class with full method parity. -6. **Graph is optional**: Start without graph, enable via config flag when needed. -7. **Low adapter overhead**: Thin wrapper (~500-1k lines) behind our protocol. - -### What Mem0 Does NOT Cover (known gaps, accepted for now) - -- **5-type taxonomy**: Mem0 treats memories as flat facts. No native distinction - between episodic/semantic/procedural/social. Adapter maps memory types via metadata - tags — imperfect but functional. -- **Social memory**: No graph-native relationship modeling (unless graph is enabled). -- **Consolidation control**: Mem0 has built-in compression but limited fine-tuning. -- **Full temporal model**: Basic timestamps, not bi-temporal tracking. - -These gaps are accepted for the initial backend. The protocol architecture ensures -they can be addressed by a future custom backend without any consumer-side changes. - -### Future Backend: Custom Stack (target architecture) - -When the Mem0 adapter's limitations become blocking, build a custom backend: - -| Component | Technology | License | Role | -|-----------|-----------|---------|------| -| **Graph DB** | Neo4j CE (Docker) | GPLv3 (server) / Apache 2.0 (driver) | Semantic + social memory, org knowledge graph | -| **Vector DB** | Qdrant (Docker) | Apache 2.0 | Episodic + procedural memory, similarity retrieval | -| **Embeddings (local)** | FastEmbed | Apache 2.0 | ONNX-based, Python 3.14 ready | -| **Embeddings (cloud)** | LiteLLM (existing dep) | MIT | Route to any cloud provider | -| **Metadata** | SQLite → PostgreSQL | Public domain | Structured metadata, operational data | - -This moves storage to external containers with full 5-type coverage, bi-temporal -tracking, and graph-native social/semantic memory. Same `MemoryBackend` protocol — -swap via config. - -### Why Not Graphiti (as initial)? - -1. **Pre-1.0 stability**: v0.28, fast-moving API, Docker image freshness issues. -2. **LLM cost**: Episode ingestion is extremely LLM-heavy (1,000+ API calls per 10k - chars). Conflicts with our cost-aware design. -3. **Partial coverage**: Only 2-3 of 5 memory types. -4. **Heavier setup**: Requires external graph DB container even for basic usage. - ---- - -## Architecture - -### Initial: Mem0 In-Process - -Everything runs inside the synthorg-backend Docker container. Persistent data written to -configurable paths on mounted Docker volumes. - -```text -┌─────────────────────────────────────────────────────────────────┐ -│ synthorg-backend Docker container │ -│ │ -│ ┌───────────────────────────────────────────────────────────┐ │ -│ │ Memory Protocol Layer │ │ -│ │ │ │ -│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │ -│ │ │ MemoryBackend │ │ MemoryCapabilities │ │ │ -│ │ │ (protocol) │ │ (capability discovery) │ │ │ -│ │ └────────┬────────┘ └─────────────────────────────┘ │ │ -│ │ │ │ │ -│ │ ┌────────┴───────────────────────────────────────────┐ │ │ -│ │ │ Mem0MemoryBackend (adapter) │ │ │ -│ │ │ │ │ │ -│ │ │ ┌─────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ -│ │ │ │ Mem0 Memory │ │ Qdrant │ │ SQLite │ │ │ │ -│ │ │ │ (engine) │ │ (embedded) │ │ (history) │ │ │ │ -│ │ │ │ │ │ │ │ │ │ │ │ -│ │ │ │ extraction, │ │ vectors → │ │ history → │ │ │ │ -│ │ │ │ compression │ │ /data/mem/ │ │ /data/mem/ │ │ │ │ -│ │ │ └─────────────┘ └────────────┘ └────────────┘ │ │ │ -│ │ └────────────────────────────────────────────────────┘ │ │ -│ └───────────────────────────────────────────────────────────┘ │ -│ │ -│ Mounted volume: /data/memory/ (configurable) │ -└─────────────────────────────────────────────────────────────────┘ -``` - -### Future: Custom Stack with External Services - -When Mem0's limitations become blocking, swap to custom backend via config: - -```text -┌──────────────────────────────────────────────────────────────────┐ -│ synthorg-backend Docker container │ -│ │ -│ ┌────────────────────────────────────────────────────────────┐ │ -│ │ Memory Protocol Layer │ │ -│ │ │ │ -│ │ ┌─────────────────┐ │ │ -│ │ │ MemoryBackend │ (same protocol, different impl) │ │ -│ │ └────────┬────────┘ │ │ -│ │ │ │ │ -│ │ ┌────────┴──────────────────────────────────────────┐ │ │ -│ │ │ CustomMemoryBackend │ │ │ -│ │ │ working → in-process │ │ │ -│ │ │ episodic → Qdrant (external) │ │ │ -│ │ │ semantic → Neo4j (external) + Qdrant │ │ │ -│ │ │ procedural → Qdrant (external) │ │ │ -│ │ │ social → Neo4j (external) │ │ │ -│ │ └───────┬──────────────────────┬─────────────────────┘ │ │ -│ └──────────┼──────────────────────┼──────────────────────────┘ │ -│ │ bolt:// │ http/grpc │ -└─────────────┼──────────────────────┼─────────────────────────────┘ - │ │ -┌─────────────┴────────┐ ┌─────────┴────────┐ -│ Neo4j CE (Docker) │ │ Qdrant (Docker) │ -│ Port 7687 │ │ Port 6333/6334 │ -└───────────────────────┘ └──────────────────┘ -``` - -### Configuration - -Memory config lives in the **same config schema** as all other settings -(`RootConfig` in `config/schema.py`), following the same Pydantic validation and -YAML loading patterns. Per-agent overrides via `AgentConfig.memory` (already exists -as a raw dict field). When the dynamic config system is built, memory config -participates like every other config section. - -> **Note:** The `RootConfig.memory` field exists with `CompanyMemoryConfig` -> defaults (see `config/schema.py`). The Mem0 adapter (#41) will connect the -> config values to an actual backend instance during startup. - -```yaml -# Company-wide defaults (in RootConfig) -memory: - backend: "mem0" # mem0, custom, cognee, graphiti (future) - level: "persistent" # none, session, project, persistent - storage: - data_dir: "/data/memory" # mounted Docker volume path - vector_store: "qdrant" # qdrant (embedded), qdrant-external, etc. - history_store: "sqlite" # sqlite, postgresql - embeddings: - provider: "fastembed" # fastembed (local), openai, litellm, ollama - model: "BAAI/bge-small-en-v1.5" - graph: - enabled: false # enable graph memory (requires graph_store) - store: "neo4j" # neo4j, falkordb - uri: "bolt://neo4j:7687" - options: - retention_days: null # null = forever - max_memories_per_agent: 10000 - consolidation_interval: "daily" - -# Per-agent overrides (in AgentConfig — list of agent objects) -agents: - - name: "senior_dev" - memory: - level: "persistent" - graph: - enabled: true # this agent gets graph memory - - name: "intern" - memory: - level: "session" # this agent only keeps session memory -``` - -### Per-Agent Isolation - -Mem0 provides four-level scoping out of the box: -- `user_id` — maps to agent identity -- `agent_id` — per-agent namespace within a user -- `app_id` — multi-tenant isolation (maps to company) -- `run_id` — ephemeral session/task scope - ---- - -## Consequences - -### Impact on #32 (Memory Interface Design) - -The protocol will follow our established `@runtime_checkable` pattern: - -```python -@runtime_checkable -class MemoryBackend(Protocol): - """Structural interface for agent memory storage backends.""" - - async def connect(self) -> None: ... - async def disconnect(self) -> None: ... - async def health_check(self) -> bool: ... - - @property - def is_connected(self) -> bool: ... - @property - def backend_name(self) -> NotBlankStr: ... - - async def store(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... - async def retrieve(self, agent_id: NotBlankStr, query: MemoryQuery) -> tuple[MemoryEntry, ...]: ... - async def get(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> MemoryEntry | None: ... - async def delete(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... - async def count(self, agent_id: NotBlankStr, *, category: MemoryCategory | None = None) -> int: ... - -@runtime_checkable -class MemoryCapabilities(Protocol): - """Capability discovery — what this backend supports.""" - - @property - def supported_categories(self) -> frozenset[MemoryCategory]: ... - @property - def supports_graph(self) -> bool: ... - @property - def supports_temporal(self) -> bool: ... - @property - def supports_vector_search(self) -> bool: ... - @property - def supports_shared_access(self) -> bool: ... - @property - def max_memories_per_agent(self) -> int | None: ... -``` - -Initial concrete implementation: `Mem0MemoryBackend` (wraps Mem0 `AsyncMemory`). -Future: `CustomMemoryBackend` (Neo4j + Qdrant), or `CogneeMemoryBackend`, -`GraphitiMemoryBackend` — all behind the same protocol. - -### Impact on #36 (Persistence) - -The memory layer handles **agent memory persistence** only. Operational data (tasks, -costs, messages, audit logs) remains in **SQLite** (upgrading to PostgreSQL later), -managed by separate repositories. This clean separation means: - -- `memory/` module: agent memories via Mem0 (initial) or custom backend (future) -- `budget/tracker.py`: cost records via SQLite/Postgres repository -- `engine/`: task state via SQLite/Postgres repository -- `communication/`: message history via SQLite/Postgres repository - -Memory data persists to a configurable directory on a mounted Docker volume. - -### Impact on #125 (Org Memory Backends) - -The `OrgMemoryBackend` protocol (§7.4) mapping: - -| Backend | Initial (Mem0) | Future (custom) | -|---------|---------------|-----------------| -| Backend 1: Hybrid Prompt + Retrieval (MVP) | Mem0 vector search for extended knowledge + SQLite for core policies | Same, or Qdrant external | -| Backend 2: GraphRAG Knowledge Graph | Mem0 with graph enabled (Neo4j) | Neo4j with custom entity extraction | -| Backend 3: Temporal Knowledge Graph | Not supported by Mem0 (basic timestamps only) | Neo4j with temporal properties, or Graphiti if stable | - -### Embedding Provider Strategy - -Mem0 natively supports 11+ embedding providers — configurable via memory config: - -- **Local** (default, cost-free): FastEmbed, HuggingFace, Ollama, LM Studio -- **Cloud** (higher quality): OpenAI, Azure OpenAI, Vertex AI, Gemini, Together, - AWS Bedrock -- **Abstraction**: LangChain embeddings - -Configuration determines which provider to use. Set via YAML config. - -### Graph DB Strategy - -- **Initially**: Graph disabled by default. Enable via config when needed. -- **When enabled**: Mem0 supports Neo4j (recommended), Memgraph, FalkorDB. -- **Kuzu NOT recommended**: Archived October 10, 2025. Its architectural - concurrency model (single `Database` per process with `Connection` reuse) is not - suited for Mem0's multi-threaded context. Use Neo4j or FalkorDB instead, both of - which handle concurrent access patterns out of the box. -- **Future custom backend**: Neo4j as primary graph DB, behind a `GraphDriver` - protocol for pluggability. - -### Incremental Build Path - -| Phase | What | External Containers | Notes | -|-------|------|-------------------|-------| -| **Phase 1** | Mem0 in-process (Qdrant embedded + SQLite) | None | All memory inside synthorg-backend container. Persists to mounted volume | -| **Phase 2** | Enable Mem0 graph (Neo4j) | 1 (Neo4j) | Optional, for semantic/social memory and org knowledge graph | -| **Phase 3** | Custom backend OR swap to Cognee/Graphiti | 2 (Neo4j + Qdrant) | When Mem0 limitations become blocking, or when alternatives add 3.14 support | - ---- - -## Risks and Mitigations - -| Risk | Likelihood | Impact | Mitigation | -|------|-----------|--------|------------| -| Mem0 flat fact model limits 5-type taxonomy | Certain | Low (initial) | Acceptable for initial backend. Metadata tagging provides partial typing. Custom backend replaces when needed | -| Mem0 API breaking changes | Possible | Medium | Pin `mem0ai` version. Adapter layer isolates changes from protocol consumers | -| Mem0 Python 3.14 untested (range allows it) | Possible | High | `>=3.9,<4.0` allows 3.14 but no explicit classifier. Test early in CI. Fallback: custom backend | -| Neo4j CE resource footprint (JVM, ~512 MB+ RAM) | Likely | Low | Deferred to Phase 2. Not needed initially. FalkorDB as lighter alternative | -| Kuzu ecosystem fragmentation | Likely | None | Archived Oct 2025. Not recommended. All candidates support Neo4j/FalkorDB. Not a factor | - ---- - -## Alternatives Considered - -### Custom Stack as Initial Backend - -Highest score (80/100) on architectural fit, but ~6-8k lines of custom code before -any memory works. Deferred to future phase — build after Mem0 proves the protocol -shape and reveals real-world requirements. - -### Graphiti for Temporal KG + Custom for Rest - -Appealing for §7.4 Backend 3, but: -- Pre-1.0 stability risk for a core subsystem -- Extreme LLM ingestion costs conflict with cost-aware design -- Would still need Qdrant for episodic/procedural memory -- Temporal tracking can be implemented with Neo4j temporal properties - -### Cognee (Best Backend Flexibility) - -Most flexible backend support (4+ graph DBs, 3+ vector stores), but: -- **Python `<3.14` not yet supported** — conservative upper bound, no known technical - blocker. Check future releases. -- Early-stage maturity (v0.5.x) -- Could become an alternative backend behind our protocol once 3.14 lands. - -### Letta (OS-Inspired Memory) - -Architecturally unique self-editing memory paradigm, but: -- **Python `<3.14` not yet supported** — conservative upper bound. Check future releases. -- Full agent platform, not a standalone memory layer -- Cannot use memory component independently -- Opinionated architecture conflicts with our pluggable protocol design. - ---- - -## Backend Swappability (Key Design Principle) - -The protocol-based architecture means **the memory layer decision is never final**. -Any backend that satisfies the `MemoryBackend` protocol can be added as an alternative -implementation. **Mem0 is the initial backend**, not the only one. - -Future backends can be added without modifying existing code: - -| Candidate | Trigger to Revisit | Role in the Architecture | -|-----------|-------------------|--------------------------| -| **Custom Stack** | Mem0 adapter limitations become blocking | Full 5-type coverage, bi-temporal tracking, graph-native memory | -| **Cognee** | Adds Python 3.14 support | Could provide a unified graph+vector pipeline behind the memory protocol | -| **Letta** | Adds Python 3.14 support + standalone memory extraction | Could power self-editing memory for advanced agents | -| **Graphiti** | Reaches v1.0 + reduces LLM ingestion costs | Could power §7.4 Backend 3 (temporal KG) specifically | - -Capability discovery flags (`supports_graph`, `supports_temporal`, etc.) enable -backends with different feature sets to coexist. An agent configured for graph memory -will use a backend that supports it; one that doesn't need graph memory can use a -simpler backend. - -### Watch List (check periodically) - -- [ ] **Cognee** `requires-python` — currently `<3.14`. Monitor releases for bump. -- [ ] **Letta** `requires-python` — currently `<3.14`. Monitor releases for bump. -- [ ] **Mem0** typed memory support — currently flat facts. Monitor for richer taxonomy. -- [ ] **Graphiti** v1.0 — currently v0.28. Monitor for API stabilization + cost reduction. - ---- - -## Component Version References - -| Component | Version | PyPI / Docker | Python 3.14 | -|-----------|---------|---------------|-------------| -| Neo4j CE | 5.x | `neo4j:community` (Docker) | N/A (JVM) | -| neo4j (driver) | 6.1.0 | `neo4j` (PyPI) | Confirmed (classifier) | -| Qdrant | 1.13.x | `qdrant/qdrant` (Docker) | N/A (Rust) | -| qdrant-client | 1.17.0 | `qdrant-client` (PyPI) | Confirmed (classifier) | -| FastEmbed | 0.7.4 | `fastembed` (PyPI) | Confirmed (classifier) | -| mem0ai | 1.0.5 | `mem0ai` (PyPI) | PASS (`>=3.9,<4.0`) | -| LiteLLM | (existing dep) | `litellm` (PyPI) | In use | -| SQLite | (stdlib) | Built-in | Yes | - ---- - -## Appendix: Eliminated Candidates Detail - -### Letta — NOT YET COMPATIBLE (G7: Python `<3.14`) - -- `requires-python = "<3.14,>=3.11"` — likely conservative bound, not technical -- Full agent platform, not standalone memory library -- OS-inspired memory hierarchy (core/archival/recall) is powerful but inflexible -- No graph memory capabilities -- Memory component cannot be extracted from the platform -- **Watch**: revisit when/if 3.14 support is added - -### Cognee — NOT YET COMPATIBLE (G7: Python `<3.14`) - -- `requires-python = ">=3.10,<3.14"` — likely conservative bound, not technical -- Best multi-backend flexibility (Kuzu, Neo4j, FalkorDB, LanceDB, Qdrant, etc.) -- `memify()` self-improving memory is unique -- 14 search modes including graph completion and temporal -- Would be a strong contender if Python 3.14 constraint is lifted -- **Watch**: revisit when/if 3.14 support is added — strongest alternative backend candidate. - -### memU — ELIMINATED (G2: AGPL-3.0) - -- Copyleft license incompatible with BUSL-1.1 -- Interesting hierarchical file-system memory design -- 92% LOCOMO accuracy, ~1/10 token cost - -### Supermemory — ELIMINATED (G1: Hosted API only) - -- Python SDK is just an API client for their cloud service -- Not a self-hosted memory framework -- #1 on several benchmarks but requires cloud dependency - -### Graphlit — ELIMINATED (G1: Cloud-native only) - -- No self-hosting option at all -- SDKs are MIT but the service is cloud-only - -### MemOS — DID NOT ADVANCE (Immature) - -- Passed gates but ~5.9k stars, early v2.0 -- Heavy dependency footprint (Transformers, scikit-learn) -- Unclear multi-tenancy support -- Not enough production usage data to recommend for core subsystem - -### Other Tier 2-3 candidates - -OpenMemory, Memary, A-MEM, SimpleMem, LangMem, memsearch — all interesting but -either too small/immature, research-oriented, or missing critical features (Docker -support, multi-tenancy, graph capabilities) for our requirements. diff --git a/docs/decisions/ADR-002-design-decisions-batch-1.md b/docs/decisions/ADR-002-design-decisions-batch-1.md deleted file mode 100644 index 19d1bb102f..0000000000 --- a/docs/decisions/ADR-002-design-decisions-batch-1.md +++ /dev/null @@ -1,547 +0,0 @@ -# ADR-002: Design Decisions Batch 1 (D1–D23) - -> **Status:** DECIDED (2026-03-09) -> **Generated:** 2026-03-09 by 11 parallel research agents (one per issue group, plus one cross-cutting coordinator). -> **Decided:** 2026-03-09 — all 23 decisions finalized by user. -> -> Each decision includes options, pros/cons, real-world precedents, and the chosen approach. - ---- - -## Overarching Pattern - -**Nearly every decision follows the same architecture:** a pluggable protocol interface with one initial implementation shipped, and alternative strategies documented in DESIGN_SPEC.md for future. This is consistent with the project's protocol-everywhere design philosophy. - ---- - -## Cross-Cutting Decisions (D1–D3) - -### D1: Action Type Taxonomy - -**Unblocks:** #40, #42, #126 - -**Context:** The autonomy presets reference action types informally (code_changes, tests, docs, deployment, hiring, etc.) but there's no formal enum, no definition of what each covers, and no registry. These action types are used by autonomy presets, SecOps validation, tiered timeout policies, and progressive trust. - -**Sub-question 1 (D1.1): Fixed enum vs open/extensible registry?** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Closed enum** | Type safety, autocomplete, typos caught at compile time | Cannot extend for custom company templates; violates "Configuration over Code" principle | -| **(b) Open string** | Unlimited extensibility | Typos silently accepted — security hazard in approval system (typo = skip approval); no discoverability | -| **(c) Enum core + validated registry (CHOSEN)** | Built-in types have type safety + autocomplete; custom types supported via explicit registration; typos caught at config validation time | Slightly more complex than pure enum | - -**Precedents:** AWS IAM uses open namespaced strings (`s3:GetObject`). Kubernetes RBAC uses semi-open verbs. GitHub uses closed scopes. OPA/Rego uses open policy strings. Every production security system validates action strings against a known set. - -**Decision:** **(c) Enum core + validated registry.** StrEnum for built-in types (~25), plus an `ActionTypeRegistry` that accepts custom strings only if explicitly registered. Unknown strings rejected at config load time. Critical for security — a typo in `human_approval` list silently means "skip approval." - -**Sub-question 2 (D1.2): Granularity — two-level hierarchy?** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Flat list (~15 types)** | Simple config | Can't distinguish file_edit from file_create (supervised preset needs this) | -| **(b) Two-level hierarchy `category:action` (CHOSEN)** | Simple config via category shortcuts (`auto_approve: ["code"]`) AND fine-grained control (`human_approval: ["code:create"]`); matches AWS/GCP pattern | Slightly more complex parsing | -| **(c) Three+ levels** | Maximum granularity | Overkill; no one gates by language or sub-sub-type | - -**Proposed taxonomy (~25 leaf types):** - -```text -code:read, code:write, code:create, code:delete, code:refactor -test:write, test:run -docs:write -vcs:commit, vcs:push, vcs:branch -deploy:staging, deploy:production -comms:internal, comms:external -budget:spend, budget:exceed -org:hire, org:fire, org:promote -db:query, db:mutate, db:admin -arch:decide -``` - -**Decision:** **(b) Two-level `category:action` hierarchy** with category shortcuts. `auto_approve: ["code"]` expands to all code:* actions. Keeps simple configs simple, power configs powerful. - -**Sub-question 3 (D1.3): Who classifies an action into a type?** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Static tool metadata (CHOSEN)** | Deterministic, zero overhead, predictable for users; matches AWS/K8s/GCP pattern | Cannot consider arguments (writing to /deploy/ vs /src/) | -| **(b) Runtime pattern matching** | Argument-aware (file path determines type) | Complex pattern maintenance; still deterministic | -| **(c) LLM classification** | Handles novel tools | Non-deterministic — catastrophic for security; adds latency + cost; vulnerable to prompt injection | -| **(d) Static primary + optional enrichment** | Best of (a) + (b) | Slightly more complex | - -**Decision:** **(a) Static tool metadata primary**, with optional deterministic enrichment layer for advanced use. Each `BaseTool` declares its `action_type`. Default mapping from `ToolCategory` → action type. Non-tool action types (org:hire, budget:spend) triggered by engine-level operations. No LLM in the security classification path. - ---- - -### D2: Quality Scoring Mechanism - -**Unblocks:** #47, #43, #49 - -**Context:** The spec says `average_quality_score: 8.5` — "from code reviews, peer feedback" — but defines no mechanism. This score gates trust promotions (`quality_score_min: 7.0`) and hiring/firing decisions. - -| Option | Pros | Cons | Cost | -|--------|------|------|------| -| **(a) Human inputs via API** | Highest fidelity, no gaming | Doesn't scale; bottleneck for promotions | Human time | -| **(b) LLM-as-judge** | Scales to any throughput; consistent rubric; captures qualitative dimensions | 12+ known biases (verbosity, self-enhancement, position); costs tokens | ~$1-5/day | -| **(c) Automated objective signals** | Zero token cost; completely objective (test pass rate, lint, coverage) | Only works for code tasks; narrow quality view; gameable | Free | -| **(d) Peer agent ratings** | Captures collaboration dimensions | Reciprocity bias, collusion, strategic manipulation; LLMs rating LLMs is conceptually suspect | Minimal | -| **(e) Combination: objective baseline + LLM judge + human override (CHOSEN)** | Multiple independent signals; hardest to game; scales with human oversight for edge cases | Most complex to implement | ~$1-5/day | - -**Research highlights:** -- LLM judges align with human preferences >80% of time but exhibit 12+ biases (CALM framework) -- SWE-bench uses pure test-pass evaluation (option c) successfully -- LangSmith uses combination: automated LLM-as-judge + human annotation queues -- Peer ratings show severe reciprocity bias even in human systems (Caltech research) - -**Decision:** **(e) Combination** — three layers: -1. **Layer 1 (free):** Objective CI signals — test pass/fail, lint errors, coverage delta → `objective_quality` sub-score -2. **Layer 2 (~$1/day):** Small-model LLM judge (different model family than agent) evaluates output against acceptance criteria → `assessed_quality` sub-score -3. **Layer 3 (on-demand):** Human override via REST API, highest weight when present - -Start with Layer 1 only (free, sufficient for initial trust gates). Add layers incrementally. - ---- - -### D3: Collaboration Scoring Mechanism - -**Unblocks:** #47, #43, #49 - -**Context:** The spec says `collaboration_score: 7.8` — "peer ratings" — but agents don't currently have a mechanism to rate each other. - -| Option | Pros | Cons | Cost | -|--------|------|------|------| -| **(a) Automated from communication patterns (CHOSEN)** | Completely objective; zero token cost; derived from existing telemetry (delegation success, response latency, conflict outcomes, meeting participation, loop triggers) | Measures behavior, not quality; context-dependent | Free | -| **(b) LLM evaluation of message quality** | Captures nuanced helpfulness | Expensive at scale (thousands of messages); circular (LLM judging LLM communication to LLM) | High | -| **(c) Peer agent ratings** | Captures firsthand interaction quality | Same reciprocity/collusion problems as D2(d); LLMs have no genuine opinions | Minimal | -| **(d) Human-provided periodically** | Highest fidelity; cannot be gamed | Doesn't scale; too infrequent for real-time decisions | Human time | - -**Decision:** **(a) Automated behavioral telemetry** as primary signal: - -```text -collaboration_score = weighted_average( - delegation_success_rate, - delegation_response_latency, - conflict_resolution_constructiveness, - meeting_contribution_rate, - loop_prevention_score, # penalty for causing loops - handoff_completeness, -) -``` - -Weights configurable per-role. Optional: periodic LLM sampling (1% of interactions) for calibration. Human override via REST API. - ---- - -## SecOps Decisions (D4–D5) - -### D4: SecOps — LLM-based or Rule-based? - -**Unblocks:** #40 - -| Option | Pros | Cons | Latency | -|--------|------|------|---------| -| **(a) Pure rule engine** | Fast, deterministic, zero LLM cost; catches 80-90% of predictable threats (credentials, path traversal, destructive ops) | Can't handle novel situations or semantic reasoning | Sub-ms | -| **(b) Pure LLM agent** | Flexible, reasons about novel actions and intent | 0.5-8.6s per evaluation; non-deterministic; costs tokens on every action; itself vulnerable to prompt injection | 0.5-8.6s | -| **(c) Hybrid: rule engine fast path + LLM slow path (CHOSEN)** | Rules catch known patterns deterministically; LLM handles uncertain cases; rules serve as backstop if LLM fails | Two systems to maintain; handoff logic needs tuning | Sub-ms (est. 95%), 0.5-2s (est. 5%) | - -**Precedents:** AWS GuardDuty (YARA rules + ML anomaly detection), LlamaFirewall (PromptGuard + AlignmentCheck + CodeShield), Google ADK (in-tool guardrails + callback hooks), NeMo Guardrails (Colang DSL + LLM classification). **Every production security system uses a hybrid approach.** - -**Sub-decision: SecOps in "full autonomy" mode?** -- **Always run rules + audit logging** regardless of autonomy level (even root has auditd) -- LLM slow path and human escalation disabled in full mode -- Hard safety rules (credential exposure, data destruction) never bypass - -**Decision:** **(c) Hybrid.** Rule engine for known patterns (sub-ms). LLM fallback only for uncertain cases (estimated ~5% of actions). Full autonomy mode: rules + audit only, no LLM path. - ---- - -### D5: SecOps — Integration Point in Pipeline - -**Unblocks:** #40 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Before every tool invocation (CHOSEN)** | Maximum security coverage; catches per-action threats; Google ADK, LlamaFirewall, Snyk all use this | Highest number of checks (but sub-ms with rule engine) | -| **(b) Before task step execution (batch level)** | Can see tool combinations; fewer checks | Cannot stop individual tools mid-batch; misses threats within batches | -| **(c) Before task assignment only** | Minimal overhead | Zero runtime security; just access control (already have ToolPermissionChecker) | -| **(d) Configurable per autonomy level** | Maximum flexibility | The interception POINT doesn't actually change — only the POLICY strictness does | - -**Performance reality:** Our bottleneck is LLM inference (seconds). A sub-ms rule check per tool call is invisible. Even OPA sidecar evaluations are 1-5ms. Total security overhead: milliseconds against minutes of LLM time. - -**Decision:** **(a) Before every tool invocation**, with policy strictness (not interception point) configurable per autonomy level. Implement behind a pluggable `SecurityInterceptionStrategy` protocol. Slots naturally into existing `ToolInvoker` between permission check and tool execution. Add post-tool-call checking for result scanning (detect sensitive data in outputs). - ---- - -## Autonomy Decisions (D6–D7) - -### D6: Autonomy — Per-Agent or Company-Wide? - -**Unblocks:** #42 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Company-wide only** | Simplest; no misconfiguration risk | Too coarse; contradicts existing seniority system; can't give trusted CEO more freedom than new intern | -| **(b) Per-agent override with company default (CHOSEN)** | Matches every real-world IAM system (AWS, Azure, K8s); all AI frameworks use per-agent (CrewAI, AutoGen, LangGraph); aligns with existing per-agent seniority instructions and tool access | Risk of misconfiguration (mitigated by seniority-based validation rules) | -| **(c) Per-department** | Middle ground | Still can't distinguish junior from lead within same department; no real-world system stops at group-only | - -**Precedents:** CrewAI has 24 per-agent attributes. AutoGen has per-agent `human_input_mode`. LangGraph has per-node `interrupt_before`/`interrupt_after`. CSA Agentic Trust Framework requires per-agent identity and trust level. - -**Decision:** **(b) Per-agent override.** Optional `autonomy_level` on `AgentIdentity` and department config (default: None = use next level's default). Resolution: `agent.autonomy_level or department.autonomy_level or company.autonomy.level`. Add seniority-based validation (Juniors/Interns cannot be set to `full`). - ---- - -### D7: Autonomy — Who Can Change Levels at Runtime? - -**Unblocks:** #42 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Human only** | Most secure; all changes auditable; no privilege escalation risk | Potentially too restrictive (but API provides instant control) | -| **(b) Human + CEO agent** | Fits company metaphor | Severe security risk: prompt injection → CEO manipulated into escalating; cascading privilege escalation; accountability gap; unprecedented in any AI framework | -| **(c) Automatic based on conditions** | Adapts to context (error rate, budget, time) | Automatic PROMOTION is dangerous and unprecedented; automatic RESTRICTION is safe and well-precedented | -| **(a+c hybrid) Human-only promotion + automatic downgrade (CHOSEN)** | Asymmetric trust: gaining trust is hard, losing it is easy; matches Azure Conditional Access (only restricts, never loosens) | Two code paths | - -**Key insight:** No real-world security system automatically grants higher privileges. Conditional access only steps UP requirements, never DOWN. The SEAgent MAC framework explicitly prevents agents from self-modifying policies. - -**Decision:** **(a+c hybrid)** Human-only for promotion. Automatic downgrade on: high error rate → downgrade one level, budget exhausted → supervised, security incident → locked. Recovery from auto-downgrade: human-only. - ---- - -## HR Decisions (D8–D10) - -### D8: HR — Runtime Agent Instantiation - -**Unblocks:** #45 - -**Sub-decision 1 (D8.1): Source** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Templates only** | Predictable, validated, reuses existing template system | Can't create novel roles | -| **(b) LLM-generated only** | Maximum flexibility for novel roles | Risk of invalid configs; non-deterministic | -| **(c) Both: template primary + LLM customization (CHOSEN)** | Templates for common cases; LLM customization for gaps; approval gate catches bad configs | Slightly more complex API surface | - -**Sub-decision 2 (D8.2): Persistence** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) In-memory only** | Simplest | Lost on restart; can't rehire; can't audit | -| **(b) Persist to YAML** | Config is source of truth | YAML mutation at runtime is error-prone; race conditions | -| **(c) Operational store via PersistenceBackend (CHOSEN)** | Survives restart; auditable; enables rehiring; YAML stays as bootstrap seed | Need reconciliation strategy (operational store wins for runtime) | - -**Sub-decision 3 (D8.3): Hot-plug** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Restart required** | Simplest | Unacceptable for running company; all agents stop | -| **(b) Hot-pluggable (CHOSEN)** | Matches company metaphor; enables auto-scaling; async architecture supports it | Need thread-safe registry; wire into message bus, tools, budget | - -**Precedents:** AutoGen is hot-pluggable by design (`register()` at any time). Letta persists everything to database. No serious framework requires restart for agent changes. - -**Decision:** **(c) Both sources**, **(c) operational store**, **(b) hot-pluggable**. Template-based MVP. `HiringRequest` model carries template reference + overrides or custom config. Operational store via existing `PersistenceBackend`. Hot-plug via dedicated company/registry service (not `AgentEngine`, which remains the per-agent task runner). - ---- - -### D9: HR — Task Reassignment on Offboarding - -**Unblocks:** #45 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Same-department, lowest load** | Fast, automatic, no LLM calls | Ignores skill match | -| **(b) Manager decides** | Matches real-world practice | LLM cost per task; blocks on manager availability | -| **(c) HR agent decides (LLM matching)** | Best skill-task matching | Most expensive; HR becomes bottleneck | -| **(d) Tasks return to unassigned queue** | Simplest; zero coupling; existing TaskRoutingService handles re-routing | Risk of orphaned tasks if queue processing slow | -| **(e) Configurable `TaskReassignmentStrategy` protocol (CHOSEN)** | Matches project's protocol-everywhere pattern; different strategies for different situations | More code to write | - -**Decision:** **(e) Configurable protocol** with **(d) queue-return** as default MVP. Existing `TaskRoutingService` + `AgentTaskScorer` already handle skill-based routing. Add priority boost for reassigned tasks. Manager-decides as first non-trivial strategy upgrade. - ---- - -### D10: HR — Memory Archival Semantics - -**Unblocks:** #45 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Full snapshot, accessible** | Complete preservation; enables forensic analysis | Storage grows; personal reasoning exposed to others | -| **(b) Selective: org-relevant promoted, personal discarded** | Clean; high-quality org memory | "Org-relevant" requires classification (LLM cost); irrecoverable if wrong | -| **(c) Full snapshot, read-only, restorable (CHOSEN)** | Everything preserved but frozen; enables rehiring; selective promotion as separate non-destructive step; existing `ArchivalStore` protocol supports it directly | Stores everything (but storage is cheap) | - -**Decision:** **(c) Full snapshot, read-only.** Pipeline: retrieve all → archive to `ArchivalStore` → selectively promote semantic+procedural to `OrgMemoryBackend` (rule-based auto) → clean hot store → mark TERMINATED. Rehiring = restore archived memories into new `AgentIdentity`. - ---- - -## Performance Metrics (D11–D12) - -### D11: Rolling Average Window - -**Unblocks:** #47 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Fixed 30 days** | Simplest | Too rigid for heterogeneous metrics; cost can shift in hours, quality arrives weekly | -| **(b) Configurable per metric** | Flexibility | Adds config complexity; still single-resolution per metric | -| **(c) Multiple windows: 7d, 30d, 90d (CHOSEN)** | Industry standard (Google SRE, Prometheus, Datadog); handles heterogeneous cadences; sparse data resilience (fallback to longer windows); enables multi-window alerting | 3x computation (negligible at agent scale) | - -**Key evidence:** Google SRE Workbook prescribes multi-window, multi-burn-rate alerting as "the most appropriate approach." Every major monitoring platform uses this. - -**Decision:** **(c) Multiple windows** — 7d (acute regressions), 30d (sustained patterns), 90d (baseline/drift). Minimum 5 data points per window; below that, report "insufficient data." - ---- - -### D12: Trend Detection Approach - -**Unblocks:** #47 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Period-over-period comparison** | Simplest (O(1) after averages) | Statistically weak; sensitive to window boundaries; no significance measure; high false positive rate | -| **(b) Linear regression slope** | Statistically principled; gives direction + magnitude + significance | Assumes linear trend; OLS has 0% outlier breakdown | -| **(c) Threshold-based flagging** | Filters noise into actionable categories | Not a trend detection method — only answers "crossed boundary?" not "trending?" | -| **(b+c hybrid) Theil-Sen regression + thresholds (CHOSEN)** | Theil-Sen: 29.3% outlier breakdown (tolerates ~1 in 3 bad points); thresholds filter noise into improving/stable/declining; minimum data point guard | Slightly more complex than simple comparison | - -**Key evidence:** Theil-Sen estimator is 91% as efficient as OLS on normal data but dramatically better on heavy-tailed data. EPA recommends it for environmental trend detection. Perfect for agent metrics with occasional catastrophic task failures. - -**Decision:** **(b+c hybrid)** — Theil-Sen slope per window, thresholds per metric to classify as improving/stable/declining. Minimum 5 data points per window. - ---- - -## Promotion Decisions (D13–D15) - -### D13: Promotion Criteria Logic (AND/OR) - -**Unblocks:** #49 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) All (AND)** | Strictest; prevents gaming via one strong metric | Agent with quality=9.5 but collaboration=6.9 blocked forever from junior→mid | -| **(b) Any (OR)** | Most lenient | Agent completing 100 trivial tasks auto-promotes to senior | -| **(c) Configurable per level: threshold gates (CHOSEN)** | Lower levels: lenient (2 of 3 criteria). Higher levels: strict (all). Single `ThresholdEvaluator` covers AND/OR/threshold | More configuration | - -**Precedents:** Game progression systems predominantly use threshold gates ("complete any 3 of 5 challenges"). HR competency matrices use weighted composite scoring with per-dimension minimums. - -**Decision:** **(c) Configurable per level.** `ThresholdEvaluator` with `min_criteria_met: int` + `required_criteria: list[str]`. Setting min=total gives AND. Setting min=1 gives OR. Default: junior→mid = 2 of 3; mid→senior = all. - ---- - -### D14: Promotion Approval Requirements - -**Unblocks:** #49 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) All promotions human-approved** | Safest for budget | Bottleneck; queue floods on mass promotion events | -| **(b) Only senior+ requires human (CHOSEN)** | Low levels auto-promote (small budget impact: small→medium ~4x); high levels human-gated (large budget impact: medium→large ~5-10x) | Accidental auto-promotion possible for junior/mid | -| **(c) Configurable per level** | Maximum flexibility | Extra config complexity without clear benefit over (b) | - -**Additional:** Demotions should auto-apply for cost-saving (model downgrade) but require human approval for authority-reducing demotions. - -**Decision:** **(b) Senior+ requires human.** Mirrors industry graduated-autonomy patterns (CSA, Anthropic, AWS). Junior→mid is low-risk/low-cost. The existing `standard_to_elevated` tool access invariant already establishes this pattern. - ---- - -### D15: Promotion — Seniority-to-Model Mapping - -**Unblocks:** #49 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Always applied (promotion auto-changes model)** | Simple, predictable | Budget-constrained deployments can't promote for authority without cost increase | -| **(b) Opt-in (promotion = seniority only, model unchanged)** | Budget-friendly | Seniority system feels disconnected from agent capability | -| **(c) Default ON, configurable opt-out (CHOSEN)** | Existing `SeniorityInfo.typical_model_tier` already implemented; model changes at task boundaries (consistent with auto-downgrade §10.4); per-agent overrides take priority; `smart` routing cascade still routes simple tasks to cheap models | One more config flag | - -**Current catalog mapping:** Junior→small, Mid→medium, Senior→medium, Lead+→large. The big cost jump is at Lead, not Senior — budget-conservative by design. - -**Decision:** **(c) Default ON, configurable.** `hr.promotions.model_follows_seniority: true` (default). Model changes at task boundaries only (never mid-execution). Per-agent `preferred_model` overrides seniority default. Smart routing still uses cheap models for simple tasks regardless of seniority. - ---- - -## Sandbox Decision (D16) - -### D16: Sandbox Backend Choice - -**Unblocks:** #50 - -**Main decision:** - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Docker only** | Simplest; covers all use cases; widest familiarity | 1-2s cold start (mitigatable) | -| **(b) Docker + WASM optional** | WASM gives microsecond starts | CPython-in-WASM can't run pip packages or C extensions — disqualifying | -| **(c) Docker + Firecracker optional** | Strongest isolation (hardware VM) | Linux-only (requires KVM); not available on macOS/Windows; complex setup; overkill for single-tenant | -| **(d) Docker MVP, evaluate later (CHOSEN)** | Ships minimum viable sandbox; `SandboxBackend` protocol makes adding backends trivial later; gVisor upgrade is config-level only | Defers optimization | - -**Key performance insight:** LLM calls take 2-30s. Docker cold start (1-2s, sub-second with `--network none` + warm pool) is invisible in agent execution flow. - -**Sub-decision 1: Docker image** -- **Pre-built default** (Python 3.14 + Node.js LTS + basic utils) + **user-configurable** via `docker.image` config -- Keep under 500MB; users add Go/Rust via custom images - -**Sub-decision 2: Python library** -- **aiodocker (CHOSEN)** — async-native (matches our stack), explicit Python 3.14 support, aio-libs ecosystem, sufficient API coverage -- docker-py — sync (requires `asyncio.to_thread()` wrapping), no declared 3.14 support, sluggish maintenance - -**Sub-decision 3: Docker unavailable fallback** -- **Fail with clear error (CHOSEN)** — no subprocess fallback for code execution (security anti-pattern) -- File/git tools already use SubprocessSandbox (no Docker needed) -- Industry consensus: E2B, OpenAI, Daytona — none offer unsandboxed fallback - -**Decision:** **(d) Docker MVP.** aiodocker library. Pre-built image + user config. Fail if Docker unavailable. gVisor (`--runtime=runsc`) as free config-level hardening upgrade. Firecracker belongs in future K8s path. - ---- - -## MCP Decisions (D17–D18) - -### D17: MCP SDK Choice - -**Unblocks:** #53 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Official `mcp` Python SDK (CHOSEN)** | Every major framework uses it (LangChain, CrewAI, OpenAI Agents, Pydantic AI); Python 3.14 compatible (tested, build issue resolved); Pydantic 2.12.5 compatible; all transports (stdio, Streamable HTTP); dependency overlap with Litestar stack | v2 migration upcoming (pin to `>=1.25,<2`); beta classification | -| **(b) Custom MCP client** | Zero new deps; full control | Must implement protocol handshake, capability negotiation, transport; must track spec changes manually; reinventing the wheel | - -**Transport:** Support both **stdio** (local/dev) and **Streamable HTTP** (remote/production). Skip deprecated SSE. - -**Test servers:** Everything (comprehensive reference) + Filesystem (realistic integration). - -**Decision:** **(a) Official SDK**, pinned `mcp>=1.25,<2`. Thin `MCPBridgeTool` adapter layer isolates rest of codebase from SDK API changes. - ---- - -### D18: MCP Tool Result Mapping - -**Unblocks:** #53 - -MCP `CallToolResult` has: `content: list[ContentBlock]` (text/image/audio/resource), `structuredContent: dict | None`, `isError: bool`. Our `ToolResult` has: `content: str`, `is_error: bool`. - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Extend ToolResult to support multi-modal** | Native support for images/resources | Cascading changes across entire codebase; LLM providers consume tool results as text anyway | -| **(b) Adapter in MCPBridgeTool; keep ToolResult as-is (CHOSEN)** | Zero disruption; text concatenation for LLM path; rich content stored in `ToolExecutionResult.metadata` (not `ToolResult`, which has no metadata field); MCP spec requires TextContent block alongside structured content | Non-text content requires metadata extraction | - -**Mapping:** -- Text blocks → concatenate into `content: str` -- Image/audio → `[image: {mimeType}]` placeholder in content; base64 in `metadata["attachments"]` -- `structuredContent` → `metadata["structured_content"]` -- `isError` → `is_error` (direct 1:1) -- `tool_call_id` assigned by our framework, associated back after MCP response - -**Decision:** **(b) Adapter in MCPBridgeTool.** Keep `ToolResult` as-is. Handle complexity in the bridge. Future: extend `ToolResult` with optional `attachments` when multi-modal LLM tool results are needed. - ---- - -## Timeout Decisions (D19–D21) - -### D19: Timeout — Risk Tier Classification Source - -**Unblocks:** #126 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Fixed per action type** | Simplest | Rigid; `git_push` might be low-risk for internal team but high-risk for production | -| **(b) SecOps assigns at runtime** | Context-aware | Requires SecOps running; non-deterministic; expensive; blocks timeout on SecOps | -| **(c) Configurable YAML mapping (CHOSEN)** | Follows "Configuration over Code" principle; predictable; matches spec §12.4 examples; hot-reloadable; OPA best practice | Configuration burden (mitigated by sensible defaults) | -| **(d) Default mapping + SecOps override** | Best of both | Premature coupling to SecOps; non-deterministic when SecOps active | - -**Decision:** **(c) Configurable YAML mapping.** `RiskTierMapping` config model with `dict[str, ApprovalRiskLevel]`. Sensible defaults matching spec examples. Unknown action types default to HIGH (fail-safe). Leaves door open for future SecOps override. - ---- - -### D20: Timeout — Context Serialization Format - -**Unblocks:** #126 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Pydantic `model_dump_json()` only** | Natural fit; fast (Rust-based); round-trip fidelity | No queryability; no durability beyond memory | -| **(b) Persistence backend (SQLite) only** | Queryable, durable, transactional | Still needs serialization format | -| **(c) Pydantic JSON via persistence backend (CHOSEN)** | Pydantic handles serialization fidelity; SQLite handles durability + queryability; matches Temporal, LangGraph, SpiffWorkflow patterns | Two serialization boundaries; new repository + migration needed | - -**Sub-decision: Verbatim vs summarized conversation?** -- **Verbatim (CHOSEN)** — Every major workflow engine (Temporal, LangGraph) stores full state. Summarization is a context window management concern at resume time, not a persistence concern. No information loss. - -**Decision:** **(c) Pydantic JSON via persistence backend.** `ParkedContext` model with metadata columns (execution_id, agent_id, task_id, parked_at) + `context_json` blob. `ParkedContextRepository` protocol. Conversation stored verbatim. - ---- - -### D21: Timeout — Resume Injection - -**Unblocks:** #126 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) System message injection** | Simple to implement | System messages are for instructions, not events; agent may not "notice" mid-conversation; no structured data | -| **(b) Tool result injection (CHOSEN)** | Semantically correct (approval IS the tool's return value); LLM conversation protocol requires tool result after tool call; matches LangGraph HITL pattern; structured data | Requires approval request to be modeled as tool call | -| **(c) Context metadata flag** | Clean separation | LLM doesn't see it — must still inject something into conversation; incomplete on its own | - -**Key insight:** If the agent requested approval via a tool call (`request_human_approval`), then the approval decision IS the tool's return value. The LLM expects a tool result before the next assistant turn. Injecting it as a tool result satisfies the protocol and reads naturally in the conversation. - -**Decision:** **(b) Tool result injection.** Model approval requests as tool calls. Approval decision returned as `ToolResult`. Fallback for engine-initiated parking: system message (option a) as exception path. - ---- - -## Non-Inferable Prompt Decisions (D22–D23) - -### D22: Tools Section Redundancy in System Prompt - -**Unblocks:** #188 - -| Option | Pros | Cons | -|--------|------|------| -| **(a) Remove tools section from system prompt (CHOSEN)** | Eliminates pure duplication (200-400+ tokens per call); both Anthropic and OpenAI inject tool definitions internally via `tools` parameter; the API version is RICHER (includes schemas); saves 20%+ cost per arXiv 2602.11988 | Very minor risk model benefits from seeing tools in different format | -| **(b) Keep as-is** | Belt-and-suspenders | Wastes tokens; increases context rot; contradicts provider best practices | -| **(c) Replace with behavioral guidance** | Non-redundant value ("when searching, prefer grep over reading files") | Requires per-tool-set crafting | - -**Evidence:** -- Anthropic docs: "we construct a special system prompt from the tool definitions" via `tools` parameter -- OpenAI staff: "you don't need to repeat the same information in the prompt" -- arXiv 2602.11988: redundant context increases cost 20%+ with minimal or negative success impact -- Chroma "Context Rot" research: performance degrades as input length increases, even below context limit - -**Decision:** **(a) Remove.** The tools section provides LESS information than the API injection (no schemas), making it strictly inferior. Later, consider option (c) for behavioral guidance (when to use, not what tools exist). - ---- - -### D23: Memory Filter Heuristic (Non-Inferable) - -**Unblocks:** #188 - -| Option | Pros | Cons | Cost | -|--------|------|------|------| -| **(a) LLM classification at retrieval** | Potentially highest accuracy | 2K-10K extra input tokens per retrieval; adds 0.5-2s latency; classifier itself can hallucinate; recursive problem | Very high | -| **(b) Keyword/pattern heuristic** | Zero token cost | Low accuracy; "auth module at src/auth/ uses JWT" mentions a path but the DECISION is non-inferable; brittle | Free | -| **(c) Tag-based: tagged at write time, filtered at read time (CHOSEN)** | Zero retrieval cost; infrastructure already exists (`MemoryMetadata.tags`, `MemoryQuery.tags`); write-time classification is most accurate (creator has richest context) | Depends on tagging discipline (enforceable at store boundary) | Free | -| **(d) Documentation only** | Zero implementation | Without enforcement, agents WILL store inferable content (arXiv confirms) | Free | - -**Decision:** **(c) Tag-based** with **(d) documentation** as complement. Define `non-inferable` tag convention. Enforce at `MemoryBackend.store()` boundary. System prompt instructs agents what qualifies: design rationale, team decisions, "why not X", cross-repo knowledge = non-inferable. Code structure, API signatures, file contents = inferable. Existing `MemoryMetadata.tags` and `MemoryQuery.tags` require zero new models. - ---- - -## Summary of All Decisions - -| ID | Decision | Chosen Approach | Protocol | Initial Impl | Unblocks | -|----|----------|----------------|----------|-------------|----------| -| D1 | Action type taxonomy | Enum core + validated registry; two-level `category:action`; static tool metadata | — | StrEnum + `ActionTypeRegistry` | #40, #42, #126 | -| D2 | Quality scoring | Pluggable strategy | `QualityScoringStrategy` | Layered: CI signals + LLM judge + human override (start Layer 1) | #47, #43, #49 | -| D3 | Collaboration scoring | Pluggable strategy | `CollaborationScoringStrategy` | Automated behavioral telemetry | #47, #43, #49 | -| D4 | SecOps: LLM vs rules | Hybrid: rule engine + LLM | — | Rule fast path (~95%) + LLM slow path (~5%) | #40 | -| D5 | SecOps: integration point | Pluggable + configurable | `SecurityInterceptionStrategy` | Before every tool invocation | #40 | -| D6 | Autonomy: scope | Three-level chain | — | Agent → department → company default | #42 | -| D7 | Autonomy: who changes | Pluggable strategy | `AutonomyChangeStrategy` | Human-only promotion | #42 | -| D8 | HR: instantiation | Templates + LLM; persist to DB; hot-plug | — | All three as decided | #45 | -| D9 | HR: task reassignment | Pluggable strategy | `TaskReassignmentStrategy` | Queue-return + priority boost | #45 | -| D10 | HR: memory archival | Pluggable strategy | `MemoryArchivalStrategy` | Full snapshot, read-only | #45 | -| D11 | Perf: rolling window | Pluggable strategy | `MetricsWindowStrategy` | Multiple: 7d, 30d, 90d | #47 | -| D12 | Perf: trend detection | Pluggable strategy | `TrendDetectionStrategy` | Theil-Sen + thresholds | #47 | -| D13 | Promotion: criteria logic | Pluggable strategy | `PromotionCriteriaStrategy` | Threshold gates (N of M) | #49 | -| D14 | Promotion: approval | Pluggable strategy | `PromotionApprovalStrategy` | Senior+ requires human | #49 | -| D15 | Promotion: model mapping | Pluggable strategy | `ModelMappingStrategy` | Default ON, opt-out | #49 | -| D16 | Sandbox: backend | Pluggable (existing) | `SandboxBackend` | Docker only (aiodocker) | #50 | -| D17 | MCP: SDK | Pluggable adapter layer | — | Official `mcp` SDK `>=1.25,<2` | #53 | -| D18 | MCP: result mapping | Pluggable adapter | — | `MCPBridgeTool` adapter | #53 | -| D19 | Timeout: risk tiers | Pluggable strategy | `RiskTierClassifier` | YAML mapping, unknown→HIGH | #126 | -| D20 | Timeout: serialization | Pydantic JSON via persistence | `ParkedContextRepository` | `ParkedContext` model + verbatim | #126 | -| D21 | Timeout: resume | Tool result injection | — | Approval = tool's return value | #126 | -| D22 | Non-inferable: tools | Remove tools from system prompt | — | API injects richer definitions | #188 | -| D23 | Non-inferable: memory | Pluggable strategy | `MemoryFilterStrategy` | Tag-based at write time | #188 | diff --git a/docs/decisions/ADR-003-documentation-architecture.md b/docs/decisions/ADR-003-documentation-architecture.md deleted file mode 100644 index d922ee5234..0000000000 --- a/docs/decisions/ADR-003-documentation-architecture.md +++ /dev/null @@ -1,125 +0,0 @@ -# ADR-003: Documentation & Site Architecture - -- **Status**: Accepted -- **Date**: 2026-03-11 -- **Decision Makers**: Aurelio - -## Context - -SynthOrg needs a public-facing website with: - -1. A landing page that communicates the project's value proposition -2. Auto-generated API reference documentation from Python docstrings -3. Architecture documentation and guides -4. The same docs available inside the Vue 3 web dashboard at `/docs` - -Key constraints: -- Python 3.14+ project with Google-style docstrings and Pydantic v2 models -- Docs must stay in sync with code (auto-generated API reference) -- Minimal maintenance overhead -- Custom domain (`synthorg.io`) already purchased - -## Options Considered - -### Documentation Engine - -| Option | Python API Docs | Landing Page | Vue 3 Integration | -|--------|----------------|-------------|-------------------| -| **MkDocs + Material + mkdocstrings** | Excellent (Griffe AST) | Good (custom overrides) | Good (static embed) | -| Sphinx + sphinx-immaterial | Excellent (autodoc) | Poor | Poor | -| VitePress | Poor (no Python) | Good | Excellent (native Vue) | -| Docusaurus | Poor (no Python) | Excellent | Poor (React) | -| Starlight (Astro) | Poor (no Python) | Excellent | Fair | - -### Landing Page SSG - -| Option | Interactivity | Performance | Ecosystem | -|--------|--------------|-------------|-----------| -| **Astro** | Islands architecture | Best (zero JS default) | 48k stars, active | -| Next.js | Full React | Good | Most popular | -| Plain HTML + Tailwind | Manual JS | Fastest | No framework | - -### Docs Sharing (Public Site ↔ Web Dashboard) - -| Approach | Maintenance | UX Quality | API Docs | -|----------|------------|------------|----------| -| **Build output embedding** | Minimal | Good (sub-site) | Works perfectly | -| Shared markdown + dual renderers | High | Best | Broken (mkdocstrings directives) | -| Iframe embedding | Minimal | Poor (double scrollbars) | Works | -| VitePress for both | Medium | Best | Needs griffe2md | -| Micro-frontend | High | Good | Over-engineered | - -### Domain Structure - -| Option | SEO | Maintenance | Flexibility | -|--------|-----|------------|-------------| -| **Single domain, multi-tool CI merge** | Best | Medium | Best of both tools | -| Subdomain split | Good | Higher (2 repos) | Full independence | -| Single domain, single tool | Best | Lowest | Compromised quality | - -## Decision - -1. **MkDocs + Material + mkdocstrings** for documentation - - Griffe AST-based extraction (PEP 649 safe, no runtime imports) - - Griffe Pydantic extension for model field documentation - - Google-style docstring support (native) - - Same toolchain as Pydantic, FastAPI, LiteLLM - -2. **Astro** for landing page (Concept C: Hybrid) - - Zero JS by default, islands for interactive components - - Dark-to-light-to-dark gradient, vivid violet + teal palette - - Provocative hero: "What if your company had infinite, tireless employees?" - - Scroll-synced code panel, expandable architecture (future) - -3. **Build output embedding** for docs in Vue dashboard - - `mkdocs build` → copy static HTML into Vue app's `public/docs/` - - Nginx serves at `/docs/` with location block - - Same HTML in both locations, zero rendering discrepancies - -4. **Single domain** (`synthorg.io`) with multi-tool CI merge - - Astro landing page at `/` - - MkDocs docs at `/docs/` - - Single GitHub Pages deployment - - CI merges both build outputs into one artifact - -5. **Same repo** — all in `Aureliolo/synthorg` - - `docs/` for MkDocs markdown source - - `site/` for Astro landing page source - - `mkdocs.yml` at repo root - - Docs always match code version - -## Documentation Content - -- **Getting Started / Guides**: Hand-written tutorials (deferred) -- **Python API Reference**: Auto-generated from docstrings via mkdocstrings (implemented) -- **Architecture / Design Decisions**: ADRs + architecture overview (implemented) -- **REST API Docs**: Link to running Scalar instance (Litestar built-in); static OpenAPI render deferred - -## Consequences - -### Positive - -- API docs auto-update on every push — always in sync with code -- Landing page and docs use best-in-class tools for their respective jobs -- Single domain simplifies DNS and maximizes SEO -- Python-native docs toolchain (no Node.js needed for docs) - -### Negative - -- Two build tools in CI (Python + Node.js) — slightly more complex pipeline -- Landing page changes trigger full CI (mitigated by path filters) -- MkDocs Material entering maintenance mode (mitigated by Zensical migration path) -- Docs in Vue dashboard feel like a sub-site (accepted trade-off) - -### Neutral - -- Astro landing page scaffold is minimal — content development is deferred -- MkDocs overrides directory reserved for future custom landing page within docs - -## Implementation - -- `mkdocs.yml` — MkDocs configuration -- `docs/` — documentation source (index, architecture, API reference pages) -- `site/` — Astro landing page source -- `.github/workflows/pages.yml` — multi-tool CI merge + GitHub Pages deployment -- `pyproject.toml` — `docs` dependency group added diff --git a/docs/design/agents.md b/docs/design/agents.md new file mode 100644 index 0000000000..19a4bc0f85 --- /dev/null +++ b/docs/design/agents.md @@ -0,0 +1,387 @@ +--- +title: Agents & HR +description: Agent identity system, seniority levels, role catalog, hiring, firing, performance tracking, and promotions in the SynthOrg framework. +--- + +# Agents & HR + +## Agent Identity Card + +Every agent has a comprehensive identity. At the design level, agent data splits into two +layers: + +Config (immutable) +: Identity, personality, skills, model preferences, tool permissions, and authority. + Defined at hire time, changed only by explicit reconfiguration. Represented as frozen + Pydantic models. + +Runtime state (mutable-via-copy) +: Current status, active task, conversation history, and execution metrics. Evolves during + agent operation. Represented as Pydantic models using `model_copy(update=...)` for state + transitions -- never mutated in place. + +### Personality Dimensions + +Personality is split into two tiers: + +=== "Big Five (OCEAN-variant)" + + Float values (0.0--1.0) used for **internal compatibility scoring only** (not injected + into prompts). `stress_response` replaces traditional neuroticism with inverted polarity + (1.0 = very calm). Scored by `core/personality.py`. + + | Dimension | Range | Description | + |-----------|-------|-------------| + | `openness` | 0.0--1.0 | Curiosity, creativity | + | `conscientiousness` | 0.0--1.0 | Thoroughness, reliability | + | `extraversion` | 0.0--1.0 | Assertiveness, sociability | + | `agreeableness` | 0.0--1.0 | Cooperation, empathy | + | `stress_response` | 0.0--1.0 | Emotional stability (1.0 = very calm) | + + **Compatibility scoring** (weighted composite, result clamped to [0, 1]): + + - **60% Big Five similarity:** `openness`, `conscientiousness`, `agreeableness`, + `stress_response` use `1 - |diff|`; `extraversion` uses a tent-function peaking + at 0.3 diff (complementary extraverts collaborate better than identical ones) + - **20% Collaboration alignment:** ordinal adjacency + (`INDEPENDENT` ↔ `PAIR` ↔ `TEAM`); scored 1.0 for same, 0.5 for adjacent, 0.0 + for opposite + - **20% Conflict approach:** constructive pairs score 1.0, destructive pairs 0.2, + mixed 0.4--0.6. Uses `itertools.combinations` for team-level averaging + +=== "Behavioral Enums" + + Injected into system prompts as natural-language labels that LLMs respond to: + + | Enum | Values | + |------|--------| + | `DecisionMakingStyle` | `analytical`, `intuitive`, `consultative`, `directive` | + | `CollaborationPreference` | `independent`, `pair`, `team` | + | `CommunicationVerbosity` | `terse`, `balanced`, `verbose` | + | `ConflictApproach` | `avoid`, `accommodate`, `compete`, `compromise`, `collaborate` (Thomas-Kilmann model) | + +### Agent Configuration Example + +???+ example "Full agent identity YAML" + + ```yaml + # --- Config layer — AgentIdentity (frozen) --- + agent: + id: "uuid" + name: "Sarah Chen" + role: "Senior Backend Developer" + department: "Engineering" + level: "Senior" + personality: + traits: + - analytical + - detail-oriented + - pragmatic + communication_style: "concise and technical" + risk_tolerance: "low" + creativity: "medium" + description: > + Sarah is a methodical backend developer who prioritizes clean + architecture and thorough testing. She pushes back on shortcuts + and advocates for proper error handling. Prefers Pythonic solutions. + # Big Five (OCEAN-variant) — internal scoring (0.0-1.0) + openness: 0.4 + conscientiousness: 0.9 + extraversion: 0.3 + agreeableness: 0.5 + stress_response: 0.75 + # Behavioral enums — injected into system prompts + decision_making: "analytical" + collaboration: "independent" + verbosity: "balanced" + conflict_approach: "compromise" + skills: + primary: + - python + - litestar + - postgresql + - system-design + secondary: + - docker + - redis + - testing + model: + provider: "example-provider" + model_id: "example-medium-001" + temperature: 0.3 + max_tokens: 8192 + fallback_model: "openrouter/example-medium-001" + memory: + type: "persistent" # persistent, project, session, none + retention_days: null # null = forever + tools: + access_level: "standard" # sandboxed | restricted | standard | elevated | custom + allowed: + - file_system + - git + - code_execution + - web_search + - terminal + denied: + - deployment + - database_admin + authority: + can_approve: ["junior_dev_tasks", "code_reviews"] + reports_to: "engineering_lead" + can_delegate_to: ["junior_developers"] + budget_limit: 5.00 + autonomy_level: null # full, semi, supervised, locked (overrides defaults) + hiring_date: "2026-02-27" + status: "active" # active, on_leave, terminated + ``` + +### Runtime State + +The runtime state layer (in `engine/`) tracks execution progress using frozen models +with `model_copy`: + +- **TaskExecution** wraps a Task with evolving execution state: status transitions, + accumulated cost (`TokenUsage`), turn count, and timestamps. +- **AgentContext** wraps `AgentIdentity` + `TaskExecution` with a unique execution ID, + conversation history, cost accumulation, turn limits, and timing. + +--- + +## Seniority & Authority Levels + +| Level | Authority | Typical Model | Cost Tier | +|-------|----------|---------------|-----------| +| Intern/Junior | Execute assigned tasks only | small / local | $ | +| Mid | Execute + suggest improvements | medium / local | $$ | +| Senior | Execute + design + review others | medium / large | $$$ | +| Lead | All above + approve + delegate | large / medium | $$$ | +| Principal/Staff | All above + architectural decisions | large | $$$$ | +| Director | Strategic decisions + budget authority | large | $$$$ | +| VP | Department-wide authority | large | $$$$ | +| C-Suite (CEO/CTO/CFO) | Company-wide authority + final approvals | large | $$$$ | + +--- + +## Role Catalog + +The role catalog is extensible -- users can add [custom roles](#dynamic-roles) via config. +The built-in catalog covers common organizational roles: + +=== "C-Suite / Executive" + + - **CEO** -- Overall strategy, final decision authority, cross-department coordination + - **CTO** -- Technical vision, architecture decisions, technology choices + - **CFO** -- Budget management, cost optimization, resource allocation + - **COO** -- Operations, process optimization, workflow management + - **CPO** -- Product strategy, roadmap, feature prioritization + +=== "Product & Design" + + - **Product Manager** -- Requirements, user stories, prioritization, stakeholder communication + - **UX Designer** -- User research, wireframes, user flows, usability + - **UI Designer** -- Visual design, component design, design systems + - **UX Researcher** -- User interviews, analytics, A/B test design + - **Technical Writer** -- Documentation, API docs, user guides + +=== "Engineering" + + - **Software Architect** -- System design, technology decisions, patterns + - **Frontend Developer** (Junior/Mid/Senior) -- UI implementation, components, state management + - **Backend Developer** (Junior/Mid/Senior) -- APIs, business logic, databases + - **Full-Stack Developer** (Junior/Mid/Senior) -- End-to-end implementation + - **DevOps/SRE Engineer** -- Infrastructure, CI/CD, monitoring, deployment + - **Database Engineer** -- Schema design, query optimization, migrations + - **Security Engineer** -- Security audits, vulnerability assessment, secure coding + +=== "Quality Assurance" + + - **QA Lead** -- Test strategy, quality gates, release readiness + - **QA Engineer** -- Test plans, manual testing, bug reporting + - **Automation Engineer** -- Test frameworks, CI integration, E2E tests + - **Performance Engineer** -- Load testing, profiling, optimization + +=== "Data & Analytics" + + - **Data Analyst** -- Metrics, dashboards, business intelligence + - **Data Engineer** -- Pipelines, ETL, data infrastructure + - **ML Engineer** -- Model training, inference, MLOps + +=== "Operations & Support" + + - **Project Manager** -- Timelines, dependencies, risk management, status tracking + - **Scrum Master** -- Agile ceremonies, impediment removal, team health + - **HR Manager** -- Hiring recommendations, team composition, performance tracking + - **Security Operations** -- Request validation, safety checks, approval workflows + +=== "Creative & Marketing" + + - **Content Writer** -- Blog posts, marketing copy, social media + - **Brand Strategist** -- Messaging, positioning, competitive analysis + - **Growth Marketer** -- Campaigns, analytics, conversion optimization + +--- + +## Dynamic Roles + +Users can define custom roles via config: + +```yaml +custom_roles: + - name: "Blockchain Developer" + department: "Engineering" + skills: ["solidity", "web3", "smart-contracts"] + system_prompt_template: "blockchain_dev.md" + authority_level: "senior" + suggested_model: "large" +``` + +--- + +## Hiring Process + +The HR system manages the agent workforce dynamically: + +1. HR agent (or human) identifies a skill gap or workload issue +2. HR generates **candidate cards** based on team needs: + - What skills are underrepresented? + - What seniority level is needed? + - What personality would complement the team? + - What model/provider fits the budget? +3. Candidate cards are presented for approval (to CEO or human) +4. Approved candidates are instantiated and onboarded +5. Onboarding includes: company context, project briefing, team introductions + +!!! info "Design decisions ([Decision Log](../architecture/decisions.md) D8)" + + - **D8.1 -- Source:** Templates + LLM customization. Templates for common roles + (reuses existing [template system](organization.md#template-system)). LLM generates + config for novel roles not covered by templates. Approval gate catches invalid/bad + configs before instantiation. + - **D8.2 -- Persistence:** Operational store via `PersistenceBackend`. YAML stays as + bootstrap seed -- operational store wins for runtime state. Enables rehiring and + auditable history. + - **D8.3 -- Hot-plug:** Agents are hot-pluggable at runtime via a dedicated + company/registry service (not `AgentEngine`, which remains the per-agent task runner). + Thread-safe registry, wired into message bus + tools + budget. + +--- + +## Firing / Offboarding + +Offboarding is triggered by: budget cuts, poor performance metrics, project completion, or +human decision. + +1. Agent's memory is archived (not deleted) +2. Active tasks are reassigned +3. Team is notified + +!!! info "Design decisions ([Decision Log](../architecture/decisions.md) D9, D10)" + + - **D9 -- Task Reassignment:** Pluggable `TaskReassignmentStrategy` protocol. Initial + strategy: queue-return -- tasks return to unassigned queue, existing `TaskRoutingService` + re-routes with priority boost for reassigned tasks. Future strategies: + same-department/lowest-load, manager-decides (LLM), HR agent decides. + - **D10 -- Memory Archival:** Pluggable `MemoryArchivalStrategy` protocol. Initial + strategy: full snapshot, read-only. Pipeline: retrieve all memories, archive to + `ArchivalStore`, selectively promote semantic+procedural memories to + `OrgMemoryBackend` (rule-based), clean hot store, mark agent TERMINATED. Rehiring + restores archived memories into a new `AgentIdentity`. Future strategies: selective + discard, full-accessible. + +--- + +## Performance Tracking + +The framework tracks detailed per-agent metrics: + +```yaml +agent_metrics: + tasks_completed: 42 + tasks_failed: 2 + average_quality_score: 8.5 # from code reviews, peer feedback + average_cost_per_task: 0.45 + average_completion_time: "2h" + collaboration_score: 7.8 # peer ratings + last_review_date: "2026-02-20" +``` + +???+ note "Design decisions ([Decision Log](../architecture/decisions.md) D2, D3, D11, D12)" + + **D2 -- Quality Scoring:** Pluggable `QualityScoringStrategy` protocol. Initial + strategy: layered combination -- + + 1. **FREE:** Objective CI signals (test pass/fail, lint, coverage delta) + 2. **~$1/day:** Small-model LLM judge (different family than agent) evaluates output + vs acceptance criteria + 3. **On-demand:** Human override via API, highest weight + + Start with Layer 1 only; add layers incrementally. Future strategies: CI-only, + LLM-only, human-only. + + --- + + **D3 -- Collaboration Scoring:** Pluggable `CollaborationScoringStrategy` protocol. + Initial strategy: automated behavioral telemetry -- + + ``` + collaboration_score = weighted_average( + delegation_success_rate, + delegation_response_latency, + conflict_resolution_constructiveness, + meeting_contribution_rate, + loop_prevention_score, + handoff_completeness + ) + ``` + + Weights are configurable per-role. Optional: periodic LLM sampling (1%) for + calibration + human override via API. Future strategies: LLM evaluation, peer + ratings, human-provided. + + --- + + **D11 -- Rolling Windows:** Pluggable `MetricsWindowStrategy` protocol. Initial + strategy: multiple simultaneous windows -- + + - **7d** for acute regressions + - **30d** for sustained patterns + - **90d** for baseline/drift + + Minimum 5 data points per window; below that, the system reports "insufficient data." + Future strategies: fixed single window, per-metric configurable. + + --- + + **D12 -- Trend Detection:** Pluggable `TrendDetectionStrategy` protocol. Initial + strategy: Theil-Sen regression slope per window + configurable thresholds classify + trends as improving/stable/declining. Theil-Sen has 29.3% outlier breakdown (tolerates + ~1 in 3 bad data points). Minimum 5 data points. Future strategies: + period-over-period, OLS regression, threshold-only. + +--- + +## Promotions & Demotions + +Agents can move between seniority levels based on performance: + +- **Promotion criteria:** Sustained high quality scores, task complexity handled, peer feedback +- **Demotion criteria:** Repeated failures, quality drops, cost inefficiency +- Promotions can unlock higher [tool access levels](operations.md#tool-access-levels) +- Model upgrades/downgrades may accompany level changes (configurable, see [auto-downgrade](operations.md#cost-controls)) + +!!! info "Design decisions ([Decision Log](../architecture/decisions.md) D13, D14, D15)" + + - **D13 -- Promotion Criteria:** Pluggable `PromotionCriteriaStrategy` protocol. Initial + strategy: configurable threshold gates. `ThresholdEvaluator` with + `min_criteria_met: int` (N of M) + `required_criteria: list[str]`. Setting `min=total` + gives AND; `min=1` gives OR. Default: junior-to-mid = 2 of 3 criteria, + mid-to-senior = all. + - **D14 -- Promotion Approval:** Pluggable `PromotionApprovalStrategy` protocol. Initial + strategy: senior+ requires human approval. Junior-to-mid auto-promotes (low cost + impact: small-to-medium ~4x). Demotions: auto-apply for cost-saving (model downgrade), + human approval for authority-reducing demotions. + - **D15 -- Model Mapping:** Pluggable `ModelMappingStrategy` protocol. Initial strategy: + default ON (`hr.promotions.model_follows_seniority: true`). Model changes at task + boundaries only (never mid-execution, consistent with + [auto-downgrade](operations.md)). Per-agent `preferred_model` overrides seniority + default. Smart routing still uses cheap models for simple tasks regardless of seniority. diff --git a/docs/design/communication.md b/docs/design/communication.md new file mode 100644 index 0000000000..fee1fe7a9f --- /dev/null +++ b/docs/design/communication.md @@ -0,0 +1,372 @@ +--- +title: Communication +description: Message bus architecture, delegation, conflict resolution strategies, and meeting protocols for inter-agent communication. +--- + +# Communication + +The communication architecture defines how agents exchange information, resolve +disagreements, and coordinate through structured meetings. All communication +patterns, conflict resolution strategies, and meeting protocols are pluggable +and configurable per company, per department, or per interaction type. + +--- + +## Communication Patterns + +The framework supports multiple communication patterns, configurable per company: + +=== "Pattern 1: Event-Driven Message Bus" + + **Recommended Default** + + ```text + ┌──────────┐ ┌─────────────────┐ ┌──────────┐ + │ Agent A │────>│ Message Bus │<────│ Agent B │ + └──────────┘ │ (Topics/Queues) │ └──────────┘ + └────────┬────────┘ + │ + ┌───────────┼───────────┐ + v v v + #engineering #product #all-hands + #code-review #design #incidents + ``` + + - Agents publish to topics, subscribe to relevant channels + - Async by default, enables parallelism + - Decoupled -- agents do not need to know about each other + - Natural audit trail of all communications + + Best for + : Most scenarios; scales well, production-ready pattern. + +=== "Pattern 2: Hierarchical Delegation" + + ```text + CEO --> CTO --> Eng Lead --> Sr Dev --> Jr Dev + | + └--> QA Lead --> QA Eng + ``` + + - Tasks flow down the hierarchy, results flow up + - Each level can decompose and refine tasks before delegating + - Authority enforcement built into the flow + + Best for + : Structured organizations with clear chains of command. + +=== "Pattern 3: Meeting-Based" + + ```text + ┌─────────────────────────────────┐ + │ Sprint Planning │ + │ PM + CTO + Devs + QA + Design │ + │ Output: Sprint backlog │ + └─────────────────────────────────┘ + │ + ┌────────┴────────┐ + │ Daily Standup │ + │ Devs + QA │ + │ Output: Status │ + └─────────────────┘ + ``` + + - Structured multi-agent conversations at defined intervals + - Standup, sprint planning, retrospective, design review, code review + + Best for + : Agile workflows, decision-making, alignment. + +=== "Pattern 4: Hybrid" + + **Recommended for Full Company** + + Combines all three patterns: + + - **Message bus** for async daily work and notifications + - **Hierarchical delegation** for task assignment and approvals + - **Meetings** for cross-team decisions and planning ceremonies + +--- + +## Communication Standards + +The framework aligns with emerging industry standards: + +A2A Protocol (Agent-to-Agent, Linux Foundation) +: Inter-agent task delegation, capability discovery via Agent Cards, and + structured task lifecycle management. + +MCP (Model Context Protocol, Agentic AI Foundation / Linux Foundation) +: Agent-to-tool integration, providing standardized tool discovery and + invocation. + +--- + +## Message Format + +```json +{ + "id": "msg-uuid", + "timestamp": "2026-02-27T10:30:00Z", + "from": "sarah_chen", + "to": "engineering", + "type": "task_update", + "priority": "normal", + "channel": "#backend", + "content": "Completed API endpoint for user authentication. PR ready for review.", + "attachments": [ + {"type": "artifact", "ref": "pr-42"} + ], + "metadata": { + "task_id": "task-123", + "project_id": "proj-456", + "tokens_used": 1200, + "cost_usd": 0.018 + } +} +``` + +--- + +## Communication Config + +???+ example "Full communication configuration" + + ```yaml + communication: + default_pattern: "hybrid" + message_bus: + backend: "internal" # internal, redis, rabbitmq, kafka + channels: + - "#all-hands" + - "#engineering" + - "#product" + - "#design" + - "#incidents" + - "#code-review" + - "#watercooler" + meetings: + enabled: true + types: + - name: "daily_standup" + frequency: "per_sprint_day" + participants: ["engineering", "qa"] + duration_tokens: 2000 + - name: "sprint_planning" + frequency: "bi_weekly" + participants: ["all"] + duration_tokens: 5000 + - name: "code_review" + trigger: "on_pr" + participants: ["author", "reviewers"] + hierarchy: + enforce_chain_of_command: true + allow_skip_level: false # can a junior message the CEO directly? + ``` + +--- + +## Loop Prevention + +Agent communication loops (A delegates to B who delegates back to A) are a +critical risk. The framework enforces multiple safeguards: + +| Mechanism | Description | Default | +|-----------|-------------|---------| +| **Max delegation depth** | Hard limit on chain length (A->B->C->D stops at depth N) | 5 | +| **Message rate limit** | Max messages per agent pair within a time window | 10 per minute | +| **Identical request dedup** | Detects and rejects duplicate task delegations within a window | 60s window | +| **Circuit breaker** | If an agent pair exceeds error/bounce threshold, block further messages until manual reset or cooldown | 3 bounces, 5min cooldown | +| **Task ancestry tracking** | Every delegated task carries its full delegation chain; agents cannot delegate back to any ancestor in the chain | Always on | + +???+ example "Loop prevention configuration" + + ```yaml + loop_prevention: + max_delegation_depth: 5 + rate_limit: + max_per_pair_per_minute: 10 + burst_allowance: 3 + dedup_window_seconds: 60 + circuit_breaker: + bounce_threshold: 3 + cooldown_seconds: 300 + ``` + + Ancestry tracking is always enabled and is not user-configurable. + +When a loop is detected, the framework: + +1. Blocks the looping message +2. Notifies the sending agent with the detected loop chain +3. Escalates to the sender's manager (or human if at top of hierarchy) +4. Logs the loop for analytics and process improvement + +--- + +## Conflict Resolution Protocol + +When two or more agents disagree on an approach (architecture, implementation, +priority), the framework provides multiple configurable resolution strategies +behind a `ConflictResolver` protocol. New strategies can be added without +modifying existing ones. The strategy is configurable per company, per +department, or per conflict type. + +=== "Strategy 1: Authority + Dissent Log" + + **Default Strategy** + + The agent with higher authority level decides. Cross-department conflicts + (incomparable authority) escalate to the lowest common manager in the + hierarchy. The losing agent's reasoning is preserved as a **dissent record** + -- a structured log entry containing the conflict context, both positions, + and the resolution. Dissent records feed into organizational learning and + can be reviewed during retrospectives. + + ```yaml + conflict_resolution: + strategy: "authority" # authority, debate, human, hybrid + ``` + + - Deterministic, zero extra tokens, fast resolution + - Dissent records create institutional memory of alternative approaches + +=== "Strategy 2: Structured Debate + Judge" + + Both agents present arguments (1 round each). A judge -- their shared + manager, the CEO, or a configurable arbitrator agent -- evaluates both + positions and decides. The judge's reasoning and both arguments are logged + as a dissent record. + + ```yaml + conflict_resolution: + strategy: "debate" + debate: + judge: "shared_manager" # shared_manager, ceo, designated_agent + ``` + + - Better decisions -- forces agents to articulate reasoning + - Higher token cost, adds latency proportional to argument length + +=== "Strategy 3: Human Escalation" + + All genuine conflicts go to the human approval queue with both positions + summarized. The agent(s) park the conflicting task and work on other tasks + while waiting (see [Approval Timeout](operations.md#approval-timeout-policy)). + + ```yaml + conflict_resolution: + strategy: "human" + ``` + + - Safest -- human always makes the call + - Bottleneck at scale, depends on human availability + +=== "Strategy 4: Hybrid" + + **Recommended for Production** + + Combines strategies with an intelligent review layer: + + 1. Both agents present arguments (1 round) -- preserving dissent + 2. A **conflict review agent** evaluates the result: + - If the resolution is **clear** (one position is objectively better, + or authority applies cleanly) -- resolve automatically, log dissent + record + - If the resolution is **ambiguous** (genuine trade-offs, no clear + winner) -- escalate to human queue with both positions + the review + agent's analysis + + ```yaml + conflict_resolution: + strategy: "hybrid" + hybrid: + review_agent: "conflict_reviewer" # dedicated agent or role + escalate_on_ambiguity: true + ``` + + - Best balance: most conflicts resolve fast, humans only see genuinely + hard calls + - Most complex to implement; review agent itself needs careful prompt + design + +--- + +## Meeting Protocol + +Meetings (Pattern 3 above) follow configurable protocols that determine how +agents interact during structured multi-agent conversations. Different meeting +types naturally suit different protocols. All protocols implement a +`MeetingProtocol` protocol, making the system extensible -- new protocols can be +registered and selected per meeting type. Cost bounds are enforced by +`duration_tokens` in the [communication config](#communication-config). + +=== "Protocol 1: Round-Robin Transcript" + + The meeting leader calls each participant in turn. A shared transcript + grows as each agent responds, seeing all prior contributions. The leader + summarizes and extracts action items at the end. + + ```yaml + meeting_protocol: "round_robin" + round_robin: + max_turns_per_agent: 2 + max_total_turns: 16 + leader_summarizes: true + ``` + + - Simple, natural conversation feel, each agent sees full context + - Token cost grows quadratically; last speaker has more context (ordering + bias) + + Best for + : Daily standups, status updates, small groups (3--5 agents). + +=== "Protocol 2: Async Position Papers + Synthesizer" + + Each agent independently writes a short position paper (parallel execution, + no shared context). A synthesizer agent reads all positions, identifies + agreements and conflicts, and produces decisions + action items. + + ```yaml + meeting_protocol: "position_papers" + position_papers: + max_tokens_per_position: 300 + synthesizer: "meeting_leader" # who synthesizes + ``` + + - Cheapest -- parallel calls, no quadratic growth, no ordering bias, no + groupthink + - Loses back-and-forth dialogue; agents cannot challenge each other's ideas + + Best for + : Brainstorming, architecture proposals, large groups, cost-sensitive + meetings. + +=== "Protocol 3: Structured Phases" + + Meeting split into phases with targeted participation: + + 1. **Agenda broadcast** -- leader shares agenda and context to all + participants + 2. **Input gathering** -- each agent submits input independently (parallel) + 3. **Discussion round** -- only triggered if conflicts are detected between + inputs; relevant agents debate (1 round, capped tokens) + 4. **Decision + action items** -- leader synthesizes, creates tasks from + action items + + ```yaml + meeting_protocol: "structured_phases" + auto_create_tasks: true # action items become tasks (top-level, applies to any protocol) + structured_phases: + skip_discussion_if_no_conflicts: true + max_discussion_tokens: 1000 + ``` + + - Cost-efficient -- parallel input, discussion only when needed + - More complex orchestration; conflict detection between inputs adds + implementation complexity + + Best for + : Sprint planning, design reviews, architecture decisions. diff --git a/docs/design/engine.md b/docs/design/engine.md new file mode 100644 index 0000000000..62eafca175 --- /dev/null +++ b/docs/design/engine.md @@ -0,0 +1,693 @@ +--- +title: Task & Workflow Engine +description: Task lifecycle, execution loops, routing, orchestration, crash recovery, graceful shutdown, and workspace isolation. +--- + +# Task & Workflow Engine + +The task and workflow engine orchestrates how work flows through a synthetic +organization -- from task creation and assignment through agent execution, +crash recovery, and graceful shutdown. Every major subsystem (execution loops, +recovery strategies, shutdown strategies, workspace isolation) is implemented +behind a pluggable protocol interface. + +--- + +## Task Lifecycle + +```mermaid +stateDiagram-v2 + [*] --> CREATED + CREATED --> ASSIGNED : assignment + + ASSIGNED --> IN_PROGRESS : starts + ASSIGNED --> FAILED : early setup failure + ASSIGNED --> BLOCKED : blocked + ASSIGNED --> CANCELLED : cancelled + ASSIGNED --> INTERRUPTED : shutdown signal + + IN_PROGRESS --> IN_REVIEW : agent done + IN_PROGRESS --> FAILED : runtime crash + IN_PROGRESS --> CANCELLED : cancelled + IN_PROGRESS --> INTERRUPTED : shutdown signal + + IN_REVIEW --> COMPLETED : approved + IN_REVIEW --> IN_PROGRESS : rework + + BLOCKED --> ASSIGNED : unblocked + + FAILED --> ASSIGNED : reassign (retry_count < max_retries) + + INTERRUPTED --> ASSIGNED : reassign on restart + + COMPLETED --> [*] + CANCELLED --> [*] +``` + +!!! info "Non-terminal states" + `BLOCKED`, `FAILED`, and `INTERRUPTED` are non-terminal: + + - **BLOCKED** returns to `ASSIGNED` when unblocked. + - **FAILED** returns to `ASSIGNED` for retry when `retry_count < max_retries` + (see [Crash Recovery](#agent-crash-recovery)). + - **INTERRUPTED** returns to `ASSIGNED` on restart + (see [Graceful Shutdown](#graceful-shutdown-protocol)). + - **COMPLETED** and **CANCELLED** are the only terminal states with no + outgoing transitions. + +!!! info "Runtime wrapper" + During execution, `Task` is wrapped by `TaskExecution` (a frozen Pydantic + model) that tracks status transitions via `model_copy(update=...)`, + accumulates `TokenUsage` cost, and records a `StatusTransition` audit trail. + The original `Task` is preserved unchanged; `to_task_snapshot()` produces a + `Task` copy with the current execution status for persistence. + +--- + +## Task Definition + +```yaml +task: + id: "task-123" + title: "Implement user authentication API" + description: "Create REST endpoints for login, register, logout with JWT tokens" + type: "development" # development, design, research, review, meeting, admin + priority: "high" # critical, high, medium, low + project: "proj-456" + created_by: "product_manager_1" + assigned_to: "sarah_chen" + reviewers: ["engineering_lead", "security_engineer"] + dependencies: ["task-120", "task-121"] + artifacts_expected: + - type: "code" + path: "src/auth/" + - type: "tests" + path: "tests/auth/" + - type: "documentation" + path: "docs/api/auth.md" + acceptance_criteria: + - "JWT-based auth with refresh tokens" + - "Rate limiting on login endpoint" + - "Unit and integration tests with >80% coverage" + - "API documentation" + estimated_complexity: "medium" # simple, medium, complex, epic + task_structure: "parallel" # sequential, parallel, mixed + coordination_topology: "auto" # auto, sas, centralized, decentralized, context_dependent + budget_limit: 2.00 # max USD for this task + deadline: null + max_retries: 1 # max reassignment attempts after failure (0 = no retry) + status: "assigned" + parent_task_id: null # parent task ID when created via delegation + delegation_chain: [] # ordered agent IDs of delegators (root first) +``` + +`task_structure` and `coordination_topology` are described in +[Task Decomposability & Coordination Topology](#task-decomposability-coordination-topology). + +--- + +## Workflow Types + +The framework supports four workflow types for organizing task execution: + +### Sequential Pipeline + +```text +Requirements --> Design --> Implementation --> Review --> Testing --> Deploy +``` + +### Parallel Execution + +```text + ┌--> Frontend Dev --┐ +Task ---| |---> Integration --> QA + └--> Backend Dev --┘ +``` + +The `ParallelExecutor` implements concurrent agent execution with +`asyncio.TaskGroup`, configurable concurrency limits, resource locking for +exclusive file access, error isolation, and progress tracking. + +### Kanban Board + +```text +Backlog | Ready | In Progress | Review | Done + o | o | * | o | *** + o | o | * | | ** + o | | | | * +``` + +### Agile Sprints + +```text +Sprint Backlog --> Sprint Execution --> Review --> Retrospective --> Next Sprint +``` + +--- + +## Task Routing & Assignment + +Tasks can be assigned through multiple strategies: + +| Strategy | Description | +|----------|-------------| +| **Manual** | Human or manager explicitly assigns | +| **Role-based** | Auto-assign to agents with matching role/skills | +| **Load-balanced** | Distribute evenly across available agents | +| **Auction** | Agents "bid" on tasks based on confidence/capability | +| **Hierarchical** | Flow down through management chain | +| **Cost-optimized** | Assign to cheapest capable agent | + +All six strategies are implemented behind the `TaskAssignmentStrategy` protocol. +Scoring-based strategies filter out agents at capacity via +`AssignmentRequest.max_concurrent_tasks`. `ManualAssignmentStrategy` raises +exceptions on failure; scoring-based strategies return +`AssignmentResult(selected=None)`. + +--- + +## Agent Execution Loop + +The agent execution loop defines how an agent processes a task from start to +finish. The framework provides multiple configurable loop architectures behind +an `ExecutionLoop` protocol, making the system extensible. The default can vary +by task complexity and is configurable per agent or role. + +### ExecutionLoop Protocol + +All loop implementations satisfy the `ExecutionLoop` runtime-checkable protocol: + +`get_loop_type() -> str` +: Returns a unique identifier (e.g., `"react"`). + +`execute(...) -> ExecutionResult` +: Runs the loop to completion, accepting `AgentContext`, + `CompletionProvider`, optional `ToolInvoker`, optional `BudgetChecker`, + optional `ShutdownChecker`, and optional `CompletionConfig`. + +**Supporting models:** + +`TerminationReason` +: Enum: `COMPLETED`, `MAX_TURNS`, `BUDGET_EXHAUSTED`, `SHUTDOWN`, `ERROR`. + `max_turns` defaults to 20. + +`TurnRecord` +: Frozen per-turn stats (tokens, cost, tool calls, finish reason). + +`ExecutionResult` +: Frozen outcome with final context, termination reason, turn records, and + optional error message (required when reason is `ERROR`). + +`BudgetChecker` +: Callback type `Callable[[AgentContext], bool]` invoked before each LLM call. + +`ShutdownChecker` +: Callback type `Callable[[], bool]` checked at turn boundaries to initiate + cooperative shutdown. + +### Loop Implementations + +=== "Loop 1: ReAct" + + **Default for Simple Tasks** + + A single interleaved loop: the agent reasons about the current state, + selects an action (tool call or response), observes the result, and repeats + until done or `max_turns` is reached. + + ```mermaid + graph LR + A[Think] --> B[Act] + B --> C[Observe] + C --> A + C --> D{Terminate?} + D -->|task complete, max turns,
budget exhausted, or error| E[Done] + ``` + + ```yaml + execution_loop: "react" # react, plan_execute, hybrid, auto + ``` + + | | | + |---|---| + | **Strengths** | Simple, proven, flexible. Easy to implement. Works well for short tasks. | + | **Weaknesses** | Token-heavy on long tasks (re-reads full context every turn). No long-term planning -- greedy step-by-step. | + | **Best for** | Simple tasks, quick fixes, single-file changes. | + +=== "Loop 2: Plan-and-Execute" + + A two-phase approach: the agent first generates a step-by-step plan, then + executes each step sequentially. On failure, the agent can replan. Different + models can be used for planning vs execution (e.g., large model for + planning, small model for execution steps). + + ```mermaid + graph LR + A[Plan
1 call] --> B[Execute Steps
N calls] + B --> C{Step failed?} + C -->|yes| A + C -->|no| D[Done] + ``` + + ```yaml + execution_loop: "plan_execute" + plan_execute: + planner_model: null # null = use agent's model; override for cost optimization + executor_model: null + max_replans: 3 + ``` + + | | | + |---|---| + | **Strengths** | Token-efficient for long tasks. Auditable plan artifact. Supports model tiering. | + | **Weaknesses** | Rigid -- plan may be wrong, replanning is expensive. Over-plans simple tasks. | + | **Best for** | Complex multi-step tasks, epic-level work, tasks spanning multiple files. | + +=== "Loop 3: Hybrid Plan + ReAct Steps" + + **Recommended for Complex Tasks** + + The agent creates a high-level plan (3--7 steps). Each step is executed as a + mini-ReAct loop with its own turn limit. After each step, the agent + checkpoints -- summarizing progress and optionally replanning remaining + steps. Checkpoints are natural points for human inspection or task + suspension. + + ```mermaid + graph TD + A[Plan] --> B[Step 1: mini-ReAct] + B --> C[Checkpoint: summarize progress] + C --> D[Step 2: mini-ReAct] + D --> E[Checkpoint: replan if needed] + E --> F[Step N: mini-ReAct] + F --> G[Done] + ``` + + ```yaml + execution_loop: "hybrid" + hybrid: + max_plan_steps: 7 + max_turns_per_step: 5 + checkpoint_after_each_step: true + allow_replan: true + ``` + + | | | + |---|---| + | **Strengths** | Strategic planning + tactical flexibility. Natural checkpoints for suspension/inspection. | + | **Weaknesses** | Most complex to implement. Plan granularity needs tuning per task type. | + | **Best for** | Complex tasks, multi-file refactoring, tasks requiring both planning and adaptivity. | + +!!! tip "Auto-selection" + When `execution_loop: "auto"`, the framework selects the loop based on + `estimated_complexity`: simple -> ReAct, medium -> Plan-and-Execute, + complex/epic -> Hybrid. Configurable via `auto_loop_rules` -- a mapping + of complexity thresholds to loop implementations. + +### AgentEngine Orchestrator + +`AgentEngine` is the top-level entry point for running an agent on a task. It +composes the execution loop with prompt construction, context management, tool +invocation, and cost tracking into a single `run()` call. + +**Signature:** + +```python +async run( + identity, task, completion_config?, max_turns?, + memory_messages?, timeout_seconds? +) -> AgentRunResult +``` + +**Pipeline steps:** + +1. **Validate inputs** -- agent must be `ACTIVE`, task must be `ASSIGNED` or + `IN_PROGRESS`. Raises `ExecutionStateError` on violation. +2. **Pre-flight budget enforcement** -- if `BudgetEnforcer` is provided, check + monthly hard stop and daily limit via `check_can_execute()`, then apply + auto-downgrade via `resolve_model()`. Raises `BudgetExhaustedError` or + `DailyLimitExceededError` on violation. +3. **Build system prompt** -- calls `build_system_prompt()` with agent identity + and task. Tool definitions are NOT included in the prompt; they are supplied + via the API's `tools` parameter + ([Decision Log](../architecture/decisions.md) D22). + Follows the **non-inferable-only principle**: system prompts include only + information the agent cannot discover by reading the codebase or environment + (role constraints, custom conventions, organizational policies). +4. **Create context** -- `AgentContext.from_identity()` with the configured + `max_turns`. +5. **Seed conversation** -- injects system prompt, optional memory messages, and + formatted task instruction as initial messages. +6. **Transition task** -- `ASSIGNED` -> `IN_PROGRESS` (pass-through if already + `IN_PROGRESS`). +7. **Prepare tools and budget** -- creates `ToolInvoker` from registry and + `BudgetChecker` from `BudgetEnforcer` (task + monthly + daily limits with + pre-computed baselines and alert deduplication) or from task budget limit + alone when no enforcer is configured. +8. **Delegate to loop** -- calls `ExecutionLoop.execute()` with context, + provider, tool invoker, budget checker, and completion config. If + `timeout_seconds` is set, wraps the call in `asyncio.wait_for`; on expiry + the run returns with `TerminationReason.ERROR` but cost recording and + post-execution processing still occur. +9. **Record costs** -- records accumulated `TokenUsage` to `CostTracker` (if + available). Cost recording failures are logged but do not affect the result. +10. **Apply post-execution transitions:** + - `COMPLETED` termination: IN_PROGRESS -> IN_REVIEW -> COMPLETED (two-hop + auto-complete; reviewers planned). + - `SHUTDOWN` termination: current status -> INTERRUPTED + (see [Graceful Shutdown](#graceful-shutdown-protocol)). + - `ERROR` termination: recovery strategy is applied (default + `FailAndReassignStrategy` transitions to FAILED; + see [Crash Recovery](#agent-crash-recovery)). + - All other termination reasons (`MAX_TURNS`, `BUDGET_EXHAUSTED`) leave the + task in its current state. + - Transition failures are logged but do not discard the successful execution + result. +11. **Return result** -- wraps `ExecutionResult` in `AgentRunResult` with + engine-level metadata. + +**Error handling:** `MemoryError` and `RecursionError` propagate +unconditionally. `BudgetExhaustedError` (including `DailyLimitExceededError`) +returns `TerminationReason.BUDGET_EXHAUSTED` without recovery -- budget +exhaustion is a controlled stop, not a crash. All other exceptions are caught +and wrapped in an `AgentRunResult` with `TerminationReason.ERROR`. + +???+ note "AgentRunResult model" + `AgentRunResult` is a frozen Pydantic model wrapping `ExecutionResult` + with engine metadata: + + - `execution_result` -- outcome from the execution loop + - `system_prompt` -- the `SystemPrompt` used for this run + - `duration_seconds` -- wall-clock run time + - `agent_id`, `task_id` -- identifiers + - Computed fields: `termination_reason`, `total_turns`, `total_cost_usd`, + `is_success`, `completion_summary` + +--- + +## Agent Crash Recovery + +When an agent execution fails unexpectedly (unhandled exception, OOM, process +kill), the framework applies a recovery mechanism. Recovery strategies are +implemented behind a `RecoveryStrategy` protocol, making the system pluggable. + +### RecoveryStrategy Protocol + +| Method | Signature | Description | +|--------|-----------|-------------| +| `recover` | `async def recover(*, task_execution, error_message, context) -> RecoveryResult` | Apply recovery to a failed task execution | +| `get_strategy_type` | `def get_strategy_type() -> str` | Return strategy type identifier (must not be empty) | + +### RecoveryResult Model + +| Field | Type | Description | +|-------|------|-------------| +| `task_execution` | `TaskExecution` | Updated execution after recovery (typically `FAILED`) | +| `strategy_type` | `NotBlankStr` | Strategy identifier | +| `context_snapshot` | `AgentContextSnapshot` | Redacted snapshot (turn count, accumulated cost, message count, max turns -- no message contents) | +| `error_message` | `NotBlankStr` | Error that triggered recovery | +| `can_reassign` | `bool` (computed) | `retry_count < task.max_retries` | + +### Recovery Strategies + +=== "Strategy 1: Fail-and-Reassign" + + **Default / MVP** + + The engine catches the failure at its outermost boundary, logs a redacted + `AgentContext` snapshot (turn count, accumulated cost -- excluding message + contents to avoid leaking sensitive prompts/tool outputs), transitions the + task to `FAILED`, and makes it available for reassignment (manual or + automatic via the task router). + + ```yaml + crash_recovery: + strategy: "fail_reassign" # fail_reassign, checkpoint + ``` + + - Simple, no persistence dependency + - All progress is lost on crash -- acceptable for short single-agent tasks + + On crash: + + 1. Catch exception at the `AgentEngine` boundary (outermost `try/except` + in `AgentEngine.run()`) + 2. Log at ERROR with redacted `AgentContextSnapshot` (turn count, + accumulated cost, message count, max turns -- message contents excluded) + 3. Transition `TaskExecution` -> `FAILED` with the exception as the failure + reason + 4. `RecoveryResult.can_reassign` reports whether `retry_count < max_retries` + + !!! info + The `can_reassign` flag is computed and returned in `RecoveryResult`. + The caller (task router) is responsible for incrementing `retry_count` + when creating the next `TaskExecution`. + +=== "Strategy 2: Checkpoint Recovery" + + !!! warning "Planned" + Checkpoint recovery is planned for a future release. + + The engine persists an `AgentContext` snapshot after each completed turn. On + crash, the framework detects the failure (via heartbeat timeout or + exception), loads the last checkpoint, and resumes execution from the exact + turn where it left off. The immutable `model_copy(update=...)` pattern makes + checkpointing trivial -- each `AgentContext` is a complete, self-contained + frozen state that serializes cleanly via `model_dump_json()`. + + ```yaml + crash_recovery: + strategy: "checkpoint" + checkpoint: + persist_every_n_turns: 1 # checkpoint frequency + storage: "sqlite" # sqlite, filesystem + heartbeat_interval_seconds: 30 # detect unresponsive agents + max_resume_attempts: 2 # retry limit before falling back to fail_reassign + ``` + + - Preserves progress -- critical for long tasks (multi-step plans, + epic-level work) + - Requires persistence layer and environment state reconciliation on resume + - Natural fit with the existing immutable state model + + When resuming from a checkpoint, the agent's tools and workspace may have + changed (other agents modified files, external state drifted). The + checkpoint strategy includes a reconciliation step: the resumed agent + receives a summary of changes since the checkpoint timestamp and can adapt + its plan accordingly. + +--- + +## Graceful Shutdown Protocol + +When the process receives SIGTERM/SIGINT (user Ctrl+C, Docker stop, systemd +shutdown), the framework stops cleanly without losing work or leaking costs. +Shutdown strategies are implemented behind a `ShutdownStrategy` protocol. + +### Strategy 1: Cooperative with Timeout (Default / MVP) + +The engine sets a shutdown event, stops accepting new tasks, and gives in-flight +agents a grace period to finish their current turn. Agents check the shutdown +event at turn boundaries (between LLM calls, before tool invocations) and exit +cooperatively. After the grace period, remaining agents are force-cancelled. +**All tasks terminated by shutdown -- whether they exited cooperatively or were +force-cancelled -- are marked `INTERRUPTED`** by the engine layer. + +```yaml +graceful_shutdown: + strategy: "cooperative_timeout" # cooperative_timeout, immediate, finish_tool, checkpoint + cooperative_timeout: + grace_seconds: 30 # time for agents to finish cooperatively + cleanup_seconds: 5 # time for final cleanup (persist cost records, close connections) +``` + +On shutdown signal: + +1. Set `shutdown_event` (`asyncio.Event`) -- agents check this at turn + boundaries +2. Stop accepting new tasks (drain gate closes) +3. Wait up to `grace_seconds` for agents to exit cooperatively +4. Force-cancel remaining agents (`task.cancel()`) -- tasks transition to + `INTERRUPTED` +5. Cleanup phase (`cleanup_seconds`): persist cost records, close provider + connections, flush logs + +!!! info "INTERRUPTED status" + `INTERRUPTED` indicates the task was stopped due to process shutdown -- + regardless of whether the agent exited cooperatively or was force-cancelled + -- and is eligible for manual or automatic reassignment on restart. Valid + transitions: `ASSIGNED -> INTERRUPTED`, `IN_PROGRESS -> INTERRUPTED`, + `INTERRUPTED -> ASSIGNED`. + +!!! tip "Cross-platform compatibility" + `loop.add_signal_handler()` is not supported on Windows. The implementation + uses `signal.signal()` as a fallback. SIGINT (Ctrl+C) works cross-platform; + SIGTERM on Windows requires `os.kill()`. + +!!! warning "In-flight LLM cost leakage" + Non-streaming API calls that are interrupted result in tokens billed but no + response received (silent cost leak). The engine logs request start (with + input token count) before each provider call, so interrupted calls have at + minimum an input-cost audit record. Streaming calls are charged only for + tokens sent before disconnect. + +### Future Strategies + +Strategy 2: Immediate Cancel +: All agent tasks are cancelled immediately via `task.cancel()`. Fastest + shutdown but highest data loss -- partial tool side effects, billed-but-lost + LLM responses. + +Strategy 3: Finish Current Tool +: Like cooperative timeout, but waits for the current tool invocation to + complete even if it exceeds the grace period. Needs per-tool timeout as a + backstop for long-running sandboxed execution. + +Strategy 4: Checkpoint and Stop +: On shutdown signal, each agent persists its full `AgentContext` snapshot and + transitions to `INTERRUPTED`. On restart, the engine loads checkpoints and + resumes execution. This naturally extends + [Checkpoint Recovery](#agent-crash-recovery) -- the only difference is + whether the checkpoint was written proactively (graceful shutdown) or loaded + from the last turn (crash recovery). + +--- + +## Concurrent Workspace Isolation + +When multiple agents work on the same codebase concurrently, they may need to +edit overlapping files. The framework provides a pluggable +`WorkspaceIsolationStrategy` protocol for managing concurrent file access. + +### Strategy 1: Planner + Git Worktrees (Default) + +The task planner decomposes work to minimize file overlap across agents. Each +agent operates in its own git worktree (shared `.git` object database, +independent working tree). On completion, branches are merged sequentially. + +This is the dominant industry pattern (used by OpenAI Codex, Cursor, Claude +Code, VS Code background agents). + +```text +Planner decomposes task: +|- Agent A: src/auth/ (worktree-A) +|- Agent B: src/api/ (worktree-B) +└- Agent C: tests/ (worktree-C) + +Each in isolated git worktree + | +On completion: sequential merge +|- Merge A -> main +|- Rebase B on main, merge +└- Rebase C on main, merge + | +Textual conflicts: git detects, escalate to human or review agent +Semantic conflicts: review agent evaluates merged result +``` + +???+ example "Workspace isolation configuration" + + ```yaml + workspace_isolation: + strategy: "planner_worktrees" # planner_worktrees, sequential, file_locking + planner_worktrees: + max_concurrent_worktrees: 8 + merge_order: "completion" # completion (first done merges first), priority, manual + conflict_escalation: "human" # human, review_agent + ``` + +- True filesystem isolation -- agents cannot overwrite each other's work +- Maximum parallelism during execution; conflicts deferred to merge time +- Leverages mature git infrastructure for merge, diff, and history + +### Future Strategies + +Strategy 2: Sequential Dependencies +: Tasks with overlapping file scopes are ordered sequentially via a dependency + graph. Prevents conflicts by construction but limits parallelism. Requires + upfront knowledge of which files a task will touch. + +Strategy 3: File-Level Locking +: Files are locked at task assignment time. Eliminates conflicts at the source + but requires predicting file access -- difficult for LLM agents that + discover what to edit as they go. Risk of deadlock if multiple agents need + overlapping file sets. + +### State Coordination vs Workspace Isolation + +These are complementary systems handling different types of shared state: + +| State Type | Coordination | Mechanism | +|-----------|-------------|-----------| +| Framework state (tasks, assignments, budget) | Centralized single-writer (`TaskEngine`) | `model_copy(update=...)` via async queue | +| Code and files (agent work output) | Workspace isolation (`WorkspaceIsolationStrategy`) | Git worktrees / branches | +| Agent memory (personal) | Per-agent ownership | Each agent owns its memory exclusively | +| Org memory (shared knowledge) | Single-writer (`OrgMemoryBackend`) | `OrgMemoryBackend` protocol with role-based write access control | + +--- + +## Task Decomposability & Coordination Topology + +Empirical research on agent scaling +([Kim et al., 2025](https://arxiv.org/abs/2512.08296) -- 180 controlled +experiments across 3 LLM families and 4 benchmarks) demonstrates that **task +decomposability is the strongest predictor of multi-agent effectiveness** -- +stronger than team size, model capability, or coordination architecture. + +### Task Structure Classification + +Each task carries a `task_structure` field classifying its decomposability: + +| Structure | Description | Multi-Agent Effect | Example | +|-----------|-------------|------------|---------| +| `sequential` | Steps must execute in strict order; each depends on prior state | **Negative** (-39% to -70%) | Multi-step build processes, ordered migrations, chained API calls | +| `parallel` | Sub-problems can be investigated independently, then synthesized | **Positive** (+57% to +81%) | Financial analysis (revenue + cost + market), multi-file review, research across sources | +| `mixed` | Some sub-tasks are parallel, but a sequential backbone connects phases | **Variable** (depends on ratio) | Feature implementation (design // research -> implement -> test) | + +Classification can be: + +- **Explicit** -- set in task config by the task creator or manager agent +- **Inferred** -- derived from task properties (tool count, dependency graph, + acceptance criteria structure) by the task router + +### Per-Task Coordination Topology + +The [communication pattern](communication.md#communication-patterns) is +configured at the company level, but **coordination topology can be selected +per-task** based on task structure and properties. This allows the engine to use +the most efficient coordination approach for each task rather than applying a +single company-wide pattern. + +| Task Properties | Recommended Topology | Rationale | +|----------------|---------------------|-----------| +| `sequential` + few artifacts (<=4) | **Single-agent (SAS)** | Coordination overhead fragments reasoning capacity on sequential tasks | +| `parallel` + structured domain | **Centralized** | Orchestrator decomposes, sub-agents execute in parallel, orchestrator synthesizes. Lowest error amplification (4.4x) | +| `parallel` + exploratory/open-ended | **Decentralized** | Peer debate enables diverse exploration of high-entropy search spaces | +| `mixed` | **Context-dependent** | Sequential backbone handled by single agent; parallel sub-tasks delegated to sub-agents | + +### Auto Topology Selector + +When topology is set to `"auto"`, the engine selects coordination topology based +on measurable task properties: + +```yaml +coordination: + topology: "auto" # auto, sas, centralized, decentralized, context_dependent + auto_topology_rules: + # sequential tasks -> always single-agent + sequential_override: "sas" + # parallel tasks -> select based on domain structure + parallel_default: "centralized" + # mixed tasks -> SAS backbone for sequential phases, delegates parallel sub-tasks + mixed_default: "context_dependent" # hybrid: not a single topology -- engine selects per-phase +``` + +The auto-selector uses task structure, artifact count, and (when available from +the memory subsystem) historical single-agent success rate as inputs. + +!!! info "Research basis" + These heuristics are derived from Kim et al. (2025), which achieved 87% + accuracy predicting optimal architecture from task properties across + held-out configurations. The SynthOrg context differs (role-differentiated + agents vs. identical agents), so thresholds should be validated empirically + once multi-agent execution is implemented. diff --git a/docs/design/index.md b/docs/design/index.md new file mode 100644 index 0000000000..8d6a74d98e --- /dev/null +++ b/docs/design/index.md @@ -0,0 +1,209 @@ +--- +title: Design Overview +description: Core vision, design principles, and foundational concepts of the SynthOrg framework for building synthetic organizations. +--- + +# Design Overview + +## Core Vision + +SynthOrg is a **configurable AI company framework** where AI agents operate within a virtual +organization. Each agent has a defined role, personality, skills, memory, and model backend. +The company can be configured from a 2-person startup to a 50+ enterprise, handling software +development, business operations, creative work, or any domain. + +## Design Principles + +
+ +- **Configuration over Code** + + --- + + Company structures, roles, and workflows are defined via config, not hardcoded. + +- **Provider Agnostic** + + --- + + Any LLM backend: cloud APIs, OpenRouter, Ollama, custom endpoints. + +- **Composable** + + --- + + Mix and match roles, teams, and workflows. Build any type of company. + +- **Observable** + + --- + + Every agent action, communication, and decision is logged and visible. + +- **Autonomy Spectrum** + + --- + + From full human oversight to fully autonomous operation. + +- **Cost Aware** + + --- + + Built-in budget tracking, model routing optimization, and spending controls. + +- **Extensible** + + --- + + Plugin architecture for new roles, tools, providers, and workflows. + +- **Local First** + + --- + + Runs locally with the option to expose on network or host remotely later. + +
+ +## What This Is NOT + +- Not a chatbot or conversational AI product +- Not locked to software development only (though that is a primary use case) +- Not a wrapper around a single model or provider +- Not a toy/demo -- designed for real, production-quality output + +## MVP Definition + +The MVP validates the core hypothesis: **a single agent can complete a real task end-to-end** +within the framework's architecture. + +**MVP scope:** + +- Single agent executing tasks via the **ReAct** [execution loop](engine.md#agent-execution-loop) +- **Subprocess sandbox** for file system and git tools (Docker optional for code execution) +- **Fail-and-reassign** crash recovery +- **Cooperative graceful shutdown** with configurable timeout +- **Proxy metrics**: turns/tokens/cost per task +- System prompt builder with agent personality injection + +!!! info "How to read the design specification" + + Sections describe the full vision. The full design is documented upfront to inform + architecture decisions -- protocol interfaces are designed even for features that are + not yet implemented. For current implementation status, see the + [Roadmap](../roadmap/index.md). + +## Configuration Philosophy + +The framework follows **progressive disclosure** -- users only configure what they need: + +1. **Templates** handle 90% of users -- pick a template, override 2-3 values, go +2. **Minimal config** for custom setups -- everything has sensible defaults +3. **Full config** for power users -- every knob exposed but none required + +**Minimal custom company** (all other settings use defaults): + +```yaml +company: + name: "Acme Corp" + template: "startup" + budget_monthly: 50.00 +``` + +All configuration systems in the framework are **pluggable** -- strategies, backends, and +policies are swappable via protocol interfaces without modifying existing code. Sensible +defaults are chosen for each, documented in the relevant section alongside the full +configuration reference. + +--- + +## Glossary + +| Term | Definition | +|------|-----------| +| **Agent** | An AI entity with a role, personality, model backend, memory, and tool access. The primary entity in the framework. Within a company context, agents serve as the company's employees. | +| **Company** | A configured organization of agents with structure, hierarchy, and workflows | +| **Department** | A grouping of related roles (Engineering, Product, Design, Operations, etc.) | +| **Role** | A job definition with required skills, responsibilities, authority level, and tool access | +| **Skill** | A capability an agent possesses (coding, writing, analysis, design, etc.) | +| **Task** | A unit of work assigned to one or more agents | +| **Project** | A collection of related tasks with a goal, deadline, and assigned team | +| **Meeting** | A structured multi-agent interaction for decisions, reviews, or planning | +| **Artifact** | Any output produced by agents: code, documents, designs, reports, etc. | + +## Entity Relationships + +The following diagram illustrates how the core entities in SynthOrg relate to each other: + +```mermaid +graph TD + Company --> Departments + Company --> Projects + Company --> Config + Company --> HR["HR Registry"] + + Departments --> DeptHead["Department Head (Agent)"] + Departments --> Members["Members (Agent[])"] + + Projects --> Tasks + Projects --> Team["Team (Agent[])"] + + Tasks --> Assigned["Assigned Agent(s)"] + Tasks --> Artifacts + Tasks --> Status["Status / History"] + + Config --> Autonomy["Autonomy Level"] + Config --> Budget + Config --> CommSettings["Communication Settings"] + Config --> ToolPerms["Tool Permissions"] + + HR --> Active["Active Agents[]"] + HR --> Roles["Available Roles[]"] + HR --> Queue["Hiring Queue"] +``` + +--- + +
+ +- [**Agents & HR**](agents.md) + + --- + + Agent identity, seniority levels, role catalog, hiring, firing, performance tracking, + and promotions. + +- [**Organization & Templates**](organization.md) + + --- + + Company types, organizational hierarchy, department configuration, template system, + and dynamic scaling. + +- [**Communication**](communication.md) + + --- + + Message bus, delegation, conflict resolution, and meeting protocols. + +- [**Engine**](engine.md) + + --- + + Execution loops, task decomposition, routing, orchestration, and recovery. + +- [**Memory**](memory.md) + + --- + + Agent memory, retrieval pipeline, shared organizational memory, and consolidation. + +- [**Operations**](operations.md) + + --- + + Budget enforcement, security, progressive trust, autonomy levels, and approval + workflows. + +
diff --git a/docs/design/memory.md b/docs/design/memory.md new file mode 100644 index 0000000000..4e676aea98 --- /dev/null +++ b/docs/design/memory.md @@ -0,0 +1,552 @@ +--- +title: Memory & Persistence +description: Agent memory architecture, shared organizational memory, backend protocols, operational data persistence, and memory injection strategies. +--- + +# Memory & Persistence + +The SynthOrg framework separates two distinct storage concerns: + +- **Agent memory** -- what agents know, remember, and learn (working, episodic, semantic, procedural, social) +- **Operational data** -- tasks, cost records, messages, and audit logs generated during execution + +Both are implemented behind pluggable protocol interfaces, making storage backends swappable via +configuration without modifying application code. + +--- + +## Memory Architecture + +```text ++-------------------------------------------------+ +| Agent Memory System | ++----------+----------+-----------+---------------+ +| Working | Episodic | Semantic | Procedural | +| Memory | Memory | Memory | Memory | +| | | | | +| Current | Past | Knowledge | Skills & | +| task | events & | & facts | how-to | +| context | decisions| learned | | ++----------+----------+-----------+---------------+ +| Storage Backend | +| SQLite / PostgreSQL / File-based | +| + Mem0 (initial) / Custom Stack (future) | +| See Decision Log | ++-------------------------------------------------+ +``` + +Each agent maintains its own memory store. The storage backend is selected via configuration +and all access flows through the [`MemoryBackend`](#memorybackend-protocol) protocol. + +--- + +## Memory Types + +| Type | Scope | Persistence | Example | +|------|-------|-------------|---------| +| **Working** | Current task | None (in-context) | "I'm implementing the auth endpoint" | +| **Episodic** | Past events | Configurable | "Last sprint the team chose JWT over sessions" | +| **Semantic** | Knowledge | Long-term | "This project uses Litestar with aiosqlite" | +| **Procedural** | Skills/patterns | Long-term | "Code reviews require 2 approvals here" | +| **Social** | Relationships | Long-term | "The QA lead prefers detailed test plans" | + +--- + +## Memory Levels + +Memory persistence is configurable per agent, from no persistence to fully persistent storage. + +???+ note "Memory Level Configuration" + + ```yaml + memory: + level: "persistent" # none, session, project, persistent (default: session) + backend: "mem0" # mem0, custom, cognee, graphiti (future) -- see Decision Log + storage: + data_dir: "/data/memory" # mounted Docker volume path + vector_store: "qdrant" # qdrant (embedded), qdrant-external, etc. + history_store: "sqlite" # sqlite, postgresql + options: + retention_days: null # null = forever + max_memories_per_agent: 10000 + consolidation_interval: "daily" # compress old memories + shared_knowledge_base: true # agents can access shared facts + ``` + +--- + +## Shared Organizational Memory + +Beyond individual agent memory, the framework provides **organizational memory** -- company-wide +knowledge that all agents can access: policies, conventions, architecture decision records (ADRs), +coding standards, and operational procedures. This is not personal episodic memory ("what I did +last Tuesday") but institutional knowledge ("the team always uses Litestar, not Flask"). + +Shared organizational memory is implemented behind an `OrgMemoryBackend` protocol, making the +system highly modular and extensible. New backends can be added without modifying existing ones. + +### Backend 1: Hybrid Prompt + Retrieval (Default) + +Critical rules (5--10 items, e.g., "no commits to main," "all PRs need 2 approvals") are injected +into every agent's system prompt. Extended knowledge (ADRs, detailed procedures, style guides) is +stored in a queryable store and retrieved on demand at task start. + +```yaml +org_memory: + backend: "hybrid_prompt_retrieval" # hybrid_prompt_retrieval, graph_rag, temporal_kg + core_policies: # always in system prompt + - "All code must have 80%+ test coverage" + - "Use Litestar, not Flask" + - "PRs require 2 approvals" + extended_store: + backend: "sqlite" # sqlite, postgresql + max_retrieved_per_query: 5 + write_access: + policies: ["human"] # only humans write core policies + adrs: ["human", "senior", "lead", "c_suite"] + procedures: ["human", "senior", "lead", "c_suite"] +``` + +**Strengths:** Simple to implement. Core rules are always present. Extended knowledge scales +with the organization. + +**Limitations:** Basic retrieval may miss relational connections between policies. + +### Research Directions + +The following backends illustrate why `OrgMemoryBackend` is a protocol -- the architecture +supports future upgrades without modifying existing code. These are research directions that +may inform future work if organizational memory needs outgrow the Hybrid Prompt + Retrieval +approach. + +!!! info "Research Direction: GraphRAG Knowledge Graph" + + Organizational knowledge stored as entities + relationships in a knowledge graph. Agents + query via graph traversal, enabling multi-hop reasoning: "Litestar is the standard" is + linked to "don't use Flask," which is linked to "exception: data team uses Django for admin." + + ```yaml + org_memory: + backend: "graph_rag" + graph: + store: "sqlite" # graph stored in relational DB, or dedicated graph DB + entity_extraction: "auto" # auto-extract entities from ADRs and policies + ``` + + **Strengths:** Significant accuracy improvement over vector-only retrieval (some benchmarks + report 3--4x gains). Multi-hop reasoning captures policy relationships. + + **Limitations:** More complex infrastructure. Entity extraction can be noisy. Heavier setup. + +!!! info "Research Direction: Temporal Knowledge Graph" + + Like GraphRAG but tracks how facts change over time. "The team used Flask until March 2026, + then switched to Litestar." Agents see current truth but can query history for context. + + ```yaml + org_memory: + backend: "temporal_kg" + temporal: + track_changes: true + history_retention_days: null # null = forever + ``` + + **Strengths:** Handles policy evolution naturally. Agents understand when and why things changed. + + **Limitations:** Most complex. Potentially overkill for small organizations or local-first use. + +### OrgMemoryBackend Protocol + +All backends implement the `OrgMemoryBackend` protocol: + +- `query(OrgMemoryQuery) -> tuple[OrgFact, ...]` +- `write(OrgFactWriteRequest, *, author: OrgFactAuthor) -> NotBlankStr` +- `list_policies() -> tuple[OrgFact, ...]` +- Lifecycle methods: `connect`, `disconnect`, `health_check`, `is_connected`, `backend_name` + +The MVP ships with Backend 1 (Hybrid Prompt + Retrieval). The selected memory layer backend +Mem0 ([Decision Log](../architecture/decisions.md)) provides optional graph memory via Neo4j/FalkorDB, which could reduce +implementation effort for the research direction backends. + +!!! tip "Write Access Control" + + Core policies are human-only. ADRs and procedures can be written by senior+ agents. All + writes are versioned and auditable. This prevents agents from corrupting shared organizational + knowledge while allowing senior agents to document decisions. + +--- + +## Memory Backend Protocol + +Agent memory is implemented behind a pluggable `MemoryBackend` protocol (Mem0 initial, custom +stack future -- see [Decision Log](../architecture/decisions.md)). Application code depends only on the protocol; the storage engine is an +implementation detail swappable via config. + +### Enums + +| Enum | Values | Purpose | +|------|--------|---------| +| `MemoryCategory` | WORKING, EPISODIC, SEMANTIC, PROCEDURAL, SOCIAL | Memory type categories | +| `MemoryLevel` | PERSISTENT, PROJECT, SESSION, NONE | Persistence level per agent | +| `ConsolidationInterval` | HOURLY, DAILY, WEEKLY, NEVER | How often old memories are compressed | + +### MemoryBackend Protocol + +```python +@runtime_checkable +class MemoryBackend(Protocol): + """Lifecycle + CRUD for agent memory storage.""" + + async def connect(self) -> None: ... + async def disconnect(self) -> None: ... + async def health_check(self) -> bool: ... + + @property + def is_connected(self) -> bool: ... + @property + def backend_name(self) -> NotBlankStr: ... + + async def store(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... + async def retrieve(self, agent_id: NotBlankStr, query: MemoryQuery) -> tuple[MemoryEntry, ...]: ... + async def get(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> MemoryEntry | None: ... + async def delete(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... + async def count(self, agent_id: NotBlankStr, *, category: MemoryCategory | None = None) -> int: ... +``` + +### MemoryCapabilities Protocol + +Backends that implement `MemoryCapabilities` expose what features they support, enabling +runtime capability checks before attempting operations. + +```python +@runtime_checkable +class MemoryCapabilities(Protocol): + """Capability discovery for memory backends.""" + + @property + def supported_categories(self) -> frozenset[MemoryCategory]: ... + @property + def supports_graph(self) -> bool: ... + @property + def supports_temporal(self) -> bool: ... + @property + def supports_vector_search(self) -> bool: ... + @property + def supports_shared_access(self) -> bool: ... + @property + def max_memories_per_agent(self) -> int | None: ... +``` + +### SharedKnowledgeStore Protocol + +Backends that support cross-agent shared knowledge implement this protocol alongside +`MemoryBackend`. Not all backends require cross-agent queries -- this keeps the base protocol +clean. + +```python +@runtime_checkable +class SharedKnowledgeStore(Protocol): + """Cross-agent shared knowledge operations.""" + + async def publish(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... + async def search_shared(self, query: MemoryQuery, *, exclude_agent: NotBlankStr | None = None) -> tuple[MemoryEntry, ...]: ... + async def retract(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... +``` + +### Error Hierarchy + +All memory errors inherit from `MemoryError` so callers can catch the entire family with a +single except clause. + +| Error | When Raised | +|-------|------------| +| `MemoryError` | Base exception for all memory operations | +| `MemoryConnectionError` | Backend connection cannot be established or is lost | +| `MemoryStoreError` | A store or delete operation fails | +| `MemoryRetrievalError` | A retrieve, search, or count operation fails | +| `MemoryNotFoundError` | A specific memory ID is not found | +| `MemoryConfigError` | Memory configuration is invalid | +| `MemoryCapabilityError` | An unsupported operation is attempted for a backend | + +### Configuration + +```yaml +memory: + backend: "mem0" + level: "persistent" # none, session, project, persistent (default: session) + storage: + data_dir: "/data/memory" + vector_store: "qdrant" + history_store: "sqlite" + options: + retention_days: null # null = forever + max_memories_per_agent: 10000 + consolidation_interval: "daily" + shared_knowledge_base: true +``` + +Configuration is modeled by `CompanyMemoryConfig` (top-level), `MemoryStorageConfig` +(storage paths/backends), and `MemoryOptionsConfig` (behaviour tuning). All are frozen +Pydantic models. The `create_memory_backend(config)` factory returns an isolated +`MemoryBackend` instance per company. + +### Consolidation and Retention + +Memory consolidation, retention enforcement, and archival are configured via frozen Pydantic +models in `memory/consolidation/config.py`: + +| Config | Purpose | +|--------|---------| +| `ConsolidationConfig` | Top-level: `max_memories_per_agent` limit, nested `retention` and `archival` sub-configs | +| `RetentionConfig` | Per-category `RetentionRule` tuples (category + retention_days), optional `default_retention_days` fallback | +| `ArchivalConfig` | Enables/disables archival of consolidated entries to `ArchivalStore` | + +!!! abstract "Scope Note" + + Retention is currently per-category, not per-agent. Per-agent retention overrides are a + scope gap to be addressed in a future iteration. + +--- + +## Operational Data Persistence + +Agent memory is handled by the `MemoryBackend` protocol (Mem0 initial, custom stack future -- +see [Decision Log](../architecture/decisions.md)). **Operational data** -- tasks, cost records, messages, audit logs -- is a separate +concern managed by a pluggable `PersistenceBackend` protocol. Application code depends only on +repository protocols; the storage engine is an implementation detail swappable via config. + +### Architecture + +```text ++------------------------------------------------------------------+ +| Application Code | +| engine/ budget/ communication/ security/ | +| | | | | | +| v v v v | +| +------+ +------+ +----------+ +----------+ | +| | Task | | Cost | | Message | | Audit | <-- Repository | +| | Repo | | Repo | | Repo | | Repo | Protocols | +| +--+---+ +--+---+ +----+-----+ +----+-----+ | +| +--------+----------+------------+ | +| | | +| +-------------------+-------------------------------------------+ +| | PersistenceBackend (protocol) | +| | connect() . disconnect() . health_check() . migrate() | +| +-------------------+-------------------------------------------+ +| | | +| +-------------------+-------------------------------------------+ +| | SQLitePersistenceBackend (initial) | +| | PostgresPersistenceBackend (future) | +| | MariaDBPersistenceBackend (future) | +| +---------------------------------------------------------------+ ++------------------------------------------------------------------+ +``` + +### Protocol Design + +```python +@runtime_checkable +class PersistenceBackend(Protocol): + """Lifecycle management for operational data storage.""" + + async def connect(self) -> None: ... + async def disconnect(self) -> None: ... + async def health_check(self) -> bool: ... + async def migrate(self) -> None: ... + + @property + def is_connected(self) -> bool: ... + @property + def backend_name(self) -> NotBlankStr: ... + + @property + def tasks(self) -> TaskRepository: ... + @property + def cost_records(self) -> CostRecordRepository: ... + @property + def messages(self) -> MessageRepository: ... + # ... plus lifecycle_events, task_metrics, collaboration_metrics, + # parked_contexts, audit_entries +``` + +Each entity type has its own repository protocol: + +```python +@runtime_checkable +class TaskRepository(Protocol): + """CRUD + query interface for Task persistence.""" + + async def save(self, task: Task) -> None: ... + async def get(self, task_id: str) -> Task | None: ... + async def list_tasks(self, *, status: TaskStatus | None = None, assigned_to: str | None = None, project: str | None = None) -> tuple[Task, ...]: ... + async def delete(self, task_id: str) -> bool: ... + +@runtime_checkable +class CostRecordRepository(Protocol): + """CRUD + aggregation interface for CostRecord persistence.""" + + async def save(self, record: CostRecord) -> None: ... + async def query(self, *, agent_id: str | None = None, task_id: str | None = None) -> tuple[CostRecord, ...]: ... + async def aggregate(self, *, agent_id: str | None = None) -> float: ... + +@runtime_checkable +class MessageRepository(Protocol): + """CRUD + query interface for Message persistence.""" + + async def save(self, message: Message) -> None: ... + async def get_history(self, channel: str, *, limit: int | None = None) -> tuple[Message, ...]: ... +``` + +### Configuration + +```yaml +persistence: + backend: "sqlite" # sqlite, postgresql, mariadb (future) + sqlite: + path: "/data/synthorg.db" # database file path (mounted volume in Docker) + wal_mode: true # WAL for concurrent read performance + journal_size_limit: 67108864 # 64 MB WAL journal limit + # postgresql: # future + # url: "postgresql://user:pass@host:5432/synthorg" + # pool_size: 10 + # mariadb: # future + # url: "mariadb://user:pass@host:3306/synthorg" + # pool_size: 10 +``` + +### Entities Persisted + +| Entity | Source Module | Repository | Key Queries | +|--------|-------------|------------|-------------| +| `Task` | `core/task.py` | `TaskRepository` | by status, by assignee, by project | +| `CostRecord` | `budget/cost_record.py` | `CostRecordRepository` | by agent, by task, aggregations | +| `Message` | `communication/message.py` | `MessageRepository` | by channel | +| `AuditEntry` | `security/models.py` | `AuditRepository` | by agent, by action type, by verdict, by risk level, time range | +| `ParkedContext` | `security/timeout/parked_context.py` | `ParkedContextRepository` | by execution_id, by agent_id, by task_id | +| Agent runtime state (planned) | `engine/` | `AgentStateRepository` (planned) | by agent_id, active agents | + +### Migration Strategy + +- Migrations run programmatically at startup via `PersistenceBackend.migrate()` +- Initial migration creates all tables +- Versioned migrations implemented per-backend (e.g., `persistence/sqlite/migrations.py` for SQLite) +- SQLite uses `user_version` pragma for version tracking; PostgreSQL/MariaDB use a migrations table + +### Key Principles + +Application code never imports a concrete backend +: Only repository protocols are used. This ensures complete decoupling from the storage engine. + +Adding a new backend requires no changes to consumers +: Implement `PersistenceBackend` + all repository protocols. Existing application code works unchanged. + +Same entity models everywhere +: Repositories accept and return the existing frozen Pydantic models (`Task`, `CostRecord`, `Message`). No ORM models or data transfer objects. + +Async throughout +: All repository methods are async, matching the framework's concurrency model. + +### Multi-Tenancy + +Each company gets its own database. The `PersistenceConfig` embedded in a company's `RootConfig` +specifies the backend type and connection details (e.g., a unique SQLite file path or PostgreSQL +database URL). The `create_backend(config)` factory returns an isolated `PersistenceBackend` +instance per company -- no shared state, no cross-company data leakage. + +```python +# One database per company -- configured in each company's YAML +company_a_backend = create_backend(company_a_config.persistence) +company_b_backend = create_backend(company_b_config.persistence) +# Each backend has independent lifecycle: connect -> migrate -> use -> disconnect +``` + +!!! warning "Planned" + + Runtime backend switching (e.g., migrating a company from SQLite to PostgreSQL during + operation) is a planned future capability. The protocol-based design already supports this -- + the engine would disconnect the current backend, connect a new one with different config, + and migrate. Implementation details (data migration tooling, zero-downtime switchover, + connection draining) are deferred to the PostgreSQL backend implementation. + +--- + +## Memory Injection Strategies + +Agent memory reaches agents through pluggable injection strategies behind the +`MemoryInjectionStrategy` protocol. The strategy determines *how* memories are surfaced to +the agent during execution. + +=== "Context Injection (Default)" + + Pre-retrieves relevant memories before execution, ranks by relevance and recency, enforces + a token budget, and formats memories as `ChatMessage`(s) injected between the system prompt + and task instruction. The agent passively receives memories. + + **Pipeline:** + + 1. `MemoryBackend.retrieve()` -- fetch candidate memories + 2. Rank by relevance + recency (algorithm below) + 3. Filter by `min_relevance` threshold + 4. Apply `MemoryFilterStrategy` ([Decision Log](../architecture/decisions.md) D23, optional) -- exclude inferable content + 5. Greedy token-budget packing + 6. Format as `ChatMessage` (configured role: SYSTEM or USER) with delimiters + + Shared memories (from `SharedKnowledgeStore`) are fetched in parallel, merged with personal + memories (no `personal_boost` for shared), and ranked together. + + **Ranking Algorithm:** + + 1. `relevance = entry.relevance_score ?? config.default_relevance` + 2. Personal entries: `relevance = min(relevance + personal_boost, 1.0)` + 3. `recency = exp(-decay_rate * age_hours)` + 4. `combined = relevance_weight * relevance + recency_weight * recency` + 5. Filter: `combined >= min_relevance` + 6. Sort descending by `combined_score` + + !!! tip "Non-Inferable Filter" + + Retrieved memories are filtered before injection to exclude content the agent can + discover by reading the codebase or environment. Only non-inferable information is + injected: prior decisions, learned conventions, interpersonal context, historical + outcomes. [Research](https://arxiv.org/abs/2602.11988) shows generic context increases + cost 20%+ with minimal success improvement; LLM-generated context can actually reduce + success rates. + + **Filter strategy ([Decision Log](../architecture/decisions.md) D23):** Pluggable `MemoryFilterStrategy` protocol. Initial + implementation uses tag-based filtering at write time. A `non-inferable` tag convention + with advisory validation at the `MemoryBackend.store()` boundary warns on missing tags + but never blocks. The system prompt instructs agents what qualifies as non-inferable: + design rationale, team decisions, "why not X," cross-repo knowledge. Uses existing + `MemoryMetadata.tags` and `MemoryQuery.tags` -- zero new models needed. + +=== "Tool-Based Retrieval (Future)" + + The agent has `recall_memory` / `search_memory` tools it calls on-demand during execution. + The agent actively decides when and what to remember. More token-efficient (only retrieves + when needed) but consumes tool-call turns and requires agent discipline to invoke. + +=== "Self-Editing Memory (Future)" + + The agent has structured memory blocks (core, archival, recall) it reads AND writes during + execution via dedicated tools. Core memory is always in context; archival and recall are + searched via tools. Most sophisticated (Letta/MemGPT-inspired) but highest complexity and + LLM overhead. + +### MemoryInjectionStrategy Protocol + +All strategies implement `MemoryInjectionStrategy`: + +```python +class MemoryInjectionStrategy(Protocol): + + async def prepare_messages( + self, agent_id: NotBlankStr, query_text: str, token_budget: int + ) -> tuple[ChatMessage, ...]: ... + + def get_tool_definitions(self) -> tuple[ToolDefinition, ...]: ... + + @property + def strategy_name(self) -> str: ... +``` + +Strategy selection via config: `memory.retrieval.strategy: context | tool_based | self_editing` diff --git a/docs/design/operations.md b/docs/design/operations.md new file mode 100644 index 0000000000..7c383f41da --- /dev/null +++ b/docs/design/operations.md @@ -0,0 +1,993 @@ +--- +title: Operations +description: LLM providers, budget management, tools, security, and human interaction. +--- + +# Operations + +This section covers the operational infrastructure of the SynthOrg framework: how agents +access LLM providers, how costs are tracked and controlled, how tools are sandboxed and +permissioned, how security policies are enforced, and how humans interact with the system. + +--- + +## Providers + +### Provider Abstraction + +The framework provides a unified interface for all LLM interactions. The provider layer +abstracts away vendor differences, exposing a single `completion()` method regardless of +whether the backend is a cloud API, OpenRouter, Ollama, or a custom endpoint. + +```text ++-------------------------------------------------+ +| Unified Model Interface | +| completion(messages, tools, config) -> resp | ++-----------+-----------+-----------+--------------+ +| Cloud API | OpenRouter| Ollama | Custom | +| Adapter | Adapter | Adapter | Adapter | ++-----------+-----------+-----------+--------------+ +| Direct | 400+ LLMs | Local LLMs| Any API | +| API call | via OR | Self-host | | ++-----------+-----------+-----------+--------------+ +``` + +### Provider Configuration + +???+ note "Provider Configuration (YAML)" + + Model IDs, pricing, and provider examples below are **illustrative**. Actual models, costs, + and provider availability are determined during implementation and loaded dynamically from + provider APIs where possible. + + ```yaml + providers: + example-provider: + api_key: "${PROVIDER_API_KEY}" + models: # example entries -- real list loaded from provider + - id: "example-large-001" + alias: "large" + cost_per_1k_input: 0.015 # illustrative, verify at implementation time + cost_per_1k_output: 0.075 + max_context: 200000 + estimated_latency_ms: 1500 # optional, used by fastest strategy + - id: "example-medium-001" + alias: "medium" + cost_per_1k_input: 0.003 + cost_per_1k_output: 0.015 + max_context: 200000 + estimated_latency_ms: 500 + - id: "example-small-001" + alias: "small" + cost_per_1k_input: 0.0008 + cost_per_1k_output: 0.004 + max_context: 200000 + estimated_latency_ms: 200 + + openrouter: + api_key: "${OPENROUTER_API_KEY}" + base_url: "https://openrouter.ai/api/v1" + models: # example entries + - id: "vendor-a/model-medium" + alias: "or-medium" + - id: "vendor-b/model-pro" + alias: "or-pro" + - id: "vendor-c/model-reasoning" + alias: "or-reasoning" + + ollama: + base_url: "http://localhost:11434" + models: # example entries + - id: "llama3.3:70b" + alias: "local-llama" + cost_per_1k_input: 0.0 # free, local + cost_per_1k_output: 0.0 + - id: "qwen2.5-coder:32b" + alias: "local-coder" + cost_per_1k_input: 0.0 + cost_per_1k_output: 0.0 + ``` + +### LiteLLM Integration + +The framework uses **LiteLLM** as the provider abstraction layer: + +- Unified API across 100+ providers +- Built-in cost tracking +- Automatic retries and fallbacks +- Load balancing across providers +- OpenAI-compatible interface (all providers normalized) + +### Model Routing Strategy + +Model routing determines which LLM handles a given request. Six strategies are available, +selectable via configuration: + +| Strategy | Behavior | +|----------|----------| +| `manual` | Resolve an explicit model override; fails if not set | +| `role_based` | Match agent seniority level to routing rules, then catalog default | +| `cost_aware` | Match task-type rules, then pick cheapest model within budget | +| `cheapest` | Alias for `cost_aware` | +| `fastest` | Match task-type rules, then pick fastest model (by `estimated_latency_ms`) within budget; falls back to cheapest when no latency data is available | +| `smart` | Priority cascade: override > task-type > role > seniority > cheapest > fallback chain | + +```yaml +routing: + strategy: "smart" # smart, cheapest, fastest, role_based, cost_aware, manual + rules: + - role_level: "C-Suite" + preferred_model: "large" + fallback: "medium" + - role_level: "Senior" + preferred_model: "medium" + fallback: "small" + - role_level: "Junior" + preferred_model: "small" + fallback: "local-coder" + - task_type: "code_review" + preferred_model: "medium" + - task_type: "documentation" + preferred_model: "small" + - task_type: "architecture" + preferred_model: "large" + fallback_chain: + - "example-provider" + - "openrouter" + - "ollama" +``` + +--- + +## Budget and Cost Management + +### Budget Hierarchy + +The framework enforces a hierarchical budget structure. Allocations cascade from the company +level through departments to individual teams. + +```mermaid +graph TD + Company["Company Budget ($100/month)"] + Company --> Eng["Engineering (50%) -- $50"] + Company --> QA["Quality/QA (10%) -- $10"] + Company --> Product["Product (15%) -- $15"] + Company --> Ops["Operations (10%) -- $10"] + Company --> Reserve["Reserve (15%) -- $15"] + + Eng --> Backend["Backend Team (40%) -- $20"] + Eng --> Frontend["Frontend Team (30%) -- $15"] + Eng --> DevOps["DevOps Team (30%) -- $15"] +``` + +!!! abstract "Note" + + Percentages are illustrative defaults. All allocations are configurable per company. + +### Cost Tracking + +Every API call is tracked with full context: + +```json +{ + "agent_id": "sarah_chen", + "task_id": "task-123", + "provider": "example-provider", + "model": "example-medium-001", + "input_tokens": 4500, + "output_tokens": 1200, + "cost_usd": 0.0315, + "timestamp": "2026-02-27T10:30:00Z" +} +``` + +`CostRecord` stores `input_tokens` and `output_tokens`; `total_tokens` is a `@computed_field` +property on `TokenUsage` (the model embedded in `CompletionResponse`). Spending aggregation +models (`AgentSpending`, `DepartmentSpending`, `PeriodSpending`) extend a shared +`_SpendingTotals` base class. + +### CFO Agent Responsibilities + +The CFO agent (when enabled) acts as a cost management system. Budget tracking, per-task cost +recording, and cost controls are enforced by `BudgetEnforcer` (a service the engine composes). +CFO cost optimization is implemented via `CostOptimizer`. + +- Monitor real-time spending across all agents +- Alert when departments approach budget limits +- Suggest model downgrades when budget is tight +- Report daily/weekly spending summaries +- Recommend hiring/firing based on cost efficiency +- Block tasks that would exceed remaining budget +- Optimize model routing for cost/quality balance + +`CostOptimizer` implements anomaly detection (sigma + spike factor), per-agent efficiency +analysis, model downgrade recommendations (via `ModelResolver`), routing optimization +suggestions, and operation approval evaluation. `ReportGenerator` produces multi-dimensional +spending reports with task/provider/model breakdowns and period-over-period comparison. + +### Cost Controls + +The budget system enforces three layers: pre-flight checks, in-flight monitoring, and +task-boundary auto-downgrade. + +```yaml +budget: + total_monthly: 100.00 + reset_day: 1 + alerts: + warn_at: 75 # percent + critical_at: 90 + hard_stop_at: 100 + per_task_limit: 5.00 + per_agent_daily_limit: 10.00 + auto_downgrade: + enabled: true + threshold: 85 # percent of budget used + boundary: "task_assignment" # task_assignment only -- NEVER mid-execution + downgrade_map: # ordered pairs -- aliases reference configured models + - ["large", "medium"] + - ["medium", "small"] + - ["small", "local-small"] +``` + +!!! tip "Auto-Downgrade Boundary" + + Model downgrades apply only at **task assignment time**, never mid-execution. An agent + halfway through an architecture review cannot be switched to a cheaper model -- the task + completes on its assigned model. The next task assignment respects the downgrade threshold. + This prevents quality degradation from mid-thought model switches. + +!!! info "Minimal Configuration" + + The only required field is `total_monthly`. All other fields have sensible defaults: + + ```yaml + budget: + total_monthly: 100.00 + ``` + +### LLM Call Analytics + +Every LLM provider call is tracked with comprehensive metadata for financial reporting, +debugging, and orchestration overhead analysis. + +#### Per-Call Tracking and Proxy Overhead Metrics + +Every completion call produces a `CompletionResponse` with `TokenUsage` (token counts and +cost). The engine layer creates a `CostRecord` (with agent/task context) and records it +into `CostTracker`. The engine additionally logs **proxy overhead metrics** at task +completion: + +- `turns_per_task` -- number of LLM turns to complete the task +- `tokens_per_task` -- total tokens consumed +- `cost_per_task` -- total USD cost +- `duration_seconds` -- wall-clock execution time +- `prompt_tokens` -- estimated system prompt tokens +- `prompt_token_ratio` -- ratio of prompt tokens to total tokens (overhead indicator; warns when >0.3) + +These are natural overhead indicators -- a task consuming 15 turns and 50k tokens for a +one-line fix signals a problem. Metrics are captured in `TaskCompletionMetrics`, a frozen +Pydantic model with a `from_run_result()` factory method. + +#### Call Categorization and Orchestration Ratio + +When multi-agent coordination exists, each `CostRecord` is tagged with a **call category**: + +| Category | Description | Examples | +|----------|-------------|---------| +| `productive` | Direct task work -- tool calls, code generation, task output | Agent writing code, running tests | +| `coordination` | Inter-agent communication -- delegation, reviews, meetings | Manager reviewing work, agent presenting in meeting | +| `system` | Framework overhead -- system prompt injection, context loading | Initial prompt, [memory retrieval injection](memory.md#memory-injection-strategies) | + +The **orchestration ratio** (`coordination / total`) is surfaced in metrics and alerts. If +coordination tokens consistently exceed productive tokens, the company configuration needs +tuning (fewer approval layers, simpler [meeting protocols](communication.md#meeting-protocol), +etc.). + +???+ note "Coordination Metrics Suite" + + A comprehensive suite of coordination metrics derived from empirical agent scaling research + ([Kim et al., 2025](https://arxiv.org/abs/2512.08296)). These metrics explain coordination + dynamics and enable data-driven tuning of multi-agent configurations. + + | Metric | Symbol | Definition | What It Signals | + |--------|--------|------------|-----------------| + | **Coordination efficiency** | `Ec` | `success_rate / (turns / turns_sas)` -- success normalized by relative turn count vs single-agent baseline | Overall coordination ROI. Low Ec = coordination costs exceed benefits | + | **Coordination overhead** | `O%` | `(turns_mas - turns_sas) / turns_sas * 100%` -- relative turn increase | Communication cost. Optimal band: 200--300%. Above 400% = over-coordination | + | **Error amplification** | `Ae` | `error_rate_mas / error_rate_sas` -- relative failure probability | Whether MAS corrects or propagates errors. Centralized ~4.4x, Independent ~17.2x | + | **Message density** | `c` | Inter-agent messages per reasoning turn | Communication intensity. Performance saturates at ~0.39 messages/turn | + | **Redundancy rate** | `R` | Mean cosine similarity of agent output embeddings | Agent agreement. Optimal at ~0.41 (balances fusion with independence) | + + All 5 metrics are opt-in via `coordination_metrics.enabled` in analytics config. `Ec` and + `O%` are cheap (turn counting). `Ae` requires baseline comparison data. `c` and `R` require + semantic analysis of agent outputs. + + ```yaml + coordination_metrics: + enabled: false # opt-in -- enable for data gathering + collect: + - efficiency # cheap -- turn counting + - overhead # cheap -- turn counting + - error_amplification # requires SAS baseline data + - message_density # requires message counting infrastructure + - redundancy # requires embedding computation on outputs + baseline_window: 50 # number of SAS runs to establish baseline for Ae + error_taxonomy: + enabled: false # opt-in -- enable for targeted diagnosis + categories: + - logical_contradiction + - numerical_drift + - context_omission + - coordination_failure + ``` + +???+ note "Full Analytics Layer Configuration" + + Expanded per-call metadata for comprehensive financial and operational reporting: + + ```yaml + call_analytics: + track: + - call_category # productive, coordination, system + - success # true/false + - retry_count # 0 = first attempt succeeded + - retry_reason # rate_limit, timeout, internal_error + - latency_ms # wall-clock time for the call + - finish_reason # stop, tool_use, max_tokens, error + - cache_hit # prompt caching hit/miss (provider-dependent) + aggregation: + - per_agent_daily # agent spending over time + - per_task # total cost per task + - per_department # department-level rollups + - per_provider # provider reliability and cost comparison + - orchestration_ratio # coordination vs productive tokens + alerts: + orchestration_ratio: + info: 0.30 # info if coordination > 30% of total + warn: 0.50 # warn if coordination > 50% of total + critical: 0.70 # critical if coordination > 70% of total + retry_rate_warn: 0.1 # warn if > 10% of calls need retries + ``` + + Analytics metadata is append-only and never blocks execution. Failed analytics writes are + logged and skipped -- the agent's task is never delayed by telemetry. + +#### Coordination Error Taxonomy + +When coordination metrics collection is enabled, the system can optionally classify +coordination errors into structured categories for targeted diagnosis. + +| Error Category | Description | Detection Method | +|---------------|-------------|-----------------| +| **Logical contradiction** | Agent asserts both "X is true" and "X is false," or derives conclusions violating its stated premises | Semantic contradiction detection on agent outputs | +| **Numerical drift** | Accumulated computational errors from cascading rounding or unit conversion (>5% deviation) | Numerical comparison against ground truth or cross-agent verification | +| **Context omission** | Failure to reference previously established entities, relationships, or state required for current reasoning | Missing-reference detection across agent conversation history | +| **Coordination failure** | Message misinterpretation, task allocation conflicts, state synchronization errors between agents | Protocol-level error detection in orchestration layer | + +Error taxonomy classification requires semantic analysis of agent outputs and is expensive. +Enable via `coordination_metrics.error_taxonomy.enabled: true` only when actively gathering +data for system tuning. The classification pipeline runs post-execution (never blocks agent +work) and logs structured events to the observability layer. + +Error categories derived from [Kim et al., 2025](https://arxiv.org/abs/2512.08296) and the +Multi-Agent System Failure Taxonomy (MAST) by Cemri et al. (2025). + +--- + +## Tool and Capability System + +### Tool Categories + +| Category | Tools | Typical Roles | +|----------|-------|---------------| +| **File System** | Read, write, edit, list, delete files | All developers, writers | +| **Code Execution** | Run code in sandboxed environments | Developers, QA | +| **Version Control** | Git operations, PR management | Developers, DevOps | +| **Web** | HTTP requests, web scraping, search | Researchers, analysts | +| **Database** | Query, migrate, admin | Backend devs, DBAs | +| **Terminal** | Shell commands (sandboxed) | DevOps, senior devs | +| **Design** | Image generation, mockup tools | Designers | +| **Communication** | Email, Slack, notifications | PMs, executives | +| **Analytics** | Metrics, dashboards, reporting | Data analysts, CFO | +| **Deployment** | CI/CD, container management | DevOps, SRE | +| **MCP Servers** | Any MCP-compatible tool | Configurable per agent | + +### Tool Execution Model + +When the LLM requests multiple tool calls in a single turn, `ToolInvoker.invoke_all` executes +them **concurrently** using `asyncio.TaskGroup`. An optional `max_concurrency` parameter +(default unbounded) limits parallelism via `asyncio.Semaphore`. Recoverable errors are captured +as `ToolResult(is_error=True)` without aborting sibling invocations. Non-recoverable errors +(`MemoryError`, `RecursionError`) are collected and re-raised after all tasks complete (bare +exception for one, `ExceptionGroup` for multiple). + +**Permission checking** follows a priority-based system: + +1. `get_permitted_definitions()` filters tool definitions sent to the LLM -- the agent only + sees tools it is permitted to use +2. At invocation time, denied tools return `ToolResult(is_error=True)` with a descriptive + denial reason (defense-in-depth against LLM hallucinating unpresented tools) + +Resolution order: denied list (highest) > allowed list > access-level categories > deny (default). + +### Tool Sandboxing + +Tool execution uses a **layered sandboxing strategy** with a pluggable `SandboxBackend` +protocol. The default configuration uses lighter isolation for low-risk tools and stronger +isolation for high-risk tools. + +#### Sandbox Backends + +| Backend | Isolation | Latency | Dependencies | Status | +|---------|-----------|---------|--------------|--------| +| `SubprocessSandbox` | Process-level: env filtering (allowlist + denylist), restricted PATH (configurable via `extra_safe_path_prefixes`), workspace-scoped cwd, timeout + process-group kill, library injection var blocking, explicit transport cleanup on Windows | ~ms | None | Implemented | +| `DockerSandbox` | Container-level: ephemeral container, mounted workspace, no network, resource limits (CPU/memory/time) | ~1-2s cold start | Docker | Implemented | +| `K8sSandbox` | Pod-level: per-agent containers, namespace isolation, resource quotas, network policies | ~2-5s | Kubernetes | Future | + +???+ note "Default Layered Sandbox Configuration" + + ```yaml + sandboxing: + default_backend: "subprocess" # subprocess, docker, k8s + overrides: # per-category backend overrides + file_system: "subprocess" # low risk -- fast, no deps + git: "subprocess" # low risk -- workspace-scoped + web: "docker" # medium risk -- needs network isolation + code_execution: "docker" # high risk -- strong isolation required + terminal: "docker" # high risk -- arbitrary commands + database: "docker" # high risk -- data mutation + subprocess: + timeout_seconds: 30 + workspace_only: true # restrict filesystem access to project dir + restricted_path: true # strip dangerous binaries from PATH + docker: + image: "synthorg-sandbox:latest" # pre-built image with common runtimes + network: "none" # no network by default + network_overrides: # category-specific network policies + database: "bridge" # database tools need TCP access to DB host + web: "egress-only" # web tools need outbound HTTP; no inbound + allowed_hosts: [] # allowlist of host:port pairs + memory_limit: "512m" + cpu_limit: "1.0" + timeout_seconds: 120 + mount_mode: "ro" # read-only by default + auto_remove: true # ephemeral -- container removed after execution + k8s: # future -- per-agent pod isolation + namespace: "synthorg-agents" + resource_requests: + cpu: "250m" + memory: "256Mi" + resource_limits: + cpu: "1" + memory: "1Gi" + network_policy: "deny-all" # default deny, allowlist per tool + ``` + +Docker is optional -- only required when code execution, terminal, web, or database tools are +enabled. File system and git tools work out of the box with subprocess isolation. This keeps +the local-first experience lightweight while providing strong isolation where it matters. + +Docker MVP uses `aiodocker` (async-native) with a pre-built image +(Python 3.14 + Node.js LTS + basic utils, <500MB). If Docker is unavailable, the framework +fails with a clear error -- no unsafe subprocess fallback for code execution +([Decision Log](../architecture/decisions.md) D16). + +!!! info "Scaling Path" + + In a future Kubernetes deployment (Phase 3-4), each agent can run in its own pod via + `K8sSandbox`. At that point, the layered configuration becomes less relevant -- all tools + execute within the agent's isolated pod. The `SandboxBackend` protocol makes this + transition seamless. + +### MCP Integration + +External tools are integrated via the **Model Context Protocol** (MCP). + +- **SDK:** Official `mcp` Python SDK, pinned version. A thin `MCPBridgeTool` adapter layer + isolates the rest of the codebase from SDK API changes + ([Decision Log](../architecture/decisions.md) D17) +- **Transports:** stdio (local/dev) and Streamable HTTP (remote/production). Deprecated SSE + is skipped. +- **Result mapping:** Text blocks concatenate to `content: str`; image/audio use placeholders + with base64 in metadata; `structuredContent` maps to `metadata["structured_content"]`; + `isError` maps 1:1 to `is_error` + ([Decision Log](../architecture/decisions.md) D18) + +### Action Type System + +Action types classify agent actions for use by [autonomy presets](#autonomy-levels), +[SecOps validation](#security-operations-agent), +[tiered timeout policies](#approval-timeout-policy), and +[progressive trust](#progressive-trust) +([Decision Log](../architecture/decisions.md) D1). + +**Registry:** `StrEnum` for ~25 built-in action types (type safety, autocomplete, typos caught +at compile time) + `ActionTypeRegistry` for custom types via explicit registration. Unknown +strings are rejected at config load time -- a typo in `human_approval` list silently meaning +"skip approval" is a critical safety concern. + +**Granularity:** Two-level `category:action` hierarchy. Category shortcuts expand to all +actions in that category (e.g., `auto_approve: ["code"]` expands to all `code:*` actions). +Fine-grained overrides are supported (e.g., `human_approval: ["code:create"]`). + +**Taxonomy (~25 leaf types):** + +```text +code:read, code:write, code:create, code:delete, code:refactor +test:write, test:run +docs:write +vcs:read, vcs:commit, vcs:push, vcs:branch +deploy:staging, deploy:production +comms:internal, comms:external +budget:spend, budget:exceed +org:hire, org:fire, org:promote +db:query, db:mutate, db:admin +arch:decide +``` + +**Classification:** Static tool metadata. Each `BaseTool` declares its `action_type`. Default +mapping from `ToolCategory` to action type. Non-tool actions (`org:hire`, `budget:spend`) are +triggered by engine-level operations. No LLM in the security classification path. + +### Tool Access Levels + +???+ note "Tool Access Level Configuration" + + ```yaml + tool_access: + levels: + sandboxed: + description: "No external access. Isolated workspace." + file_system: "workspace_only" + code_execution: "containerized" + network: "none" + git: "local_only" + + restricted: + description: "Limited external access with approval." + file_system: "project_directory" + code_execution: "containerized" + network: "allowlist_only" + git: "read_and_branch" + requires_approval: ["deployment", "database_write"] + + standard: + description: "Normal development access." + file_system: "project_directory" + code_execution: "containerized" + network: "open" + git: "full" + terminal: "restricted_commands" + + elevated: + description: "Full access for senior/trusted agents." + file_system: "full" + code_execution: "containerized" + network: "open" + git: "full" + terminal: "full" + deployment: true + + custom: + description: "Per-agent custom configuration." + ``` + +The current `ToolPermissionChecker` implements **category-level gating only** -- each access +level maps to a set of permitted `ToolCategory` values. The granular sub-constraints shown +above (network mode, containerization) are planned for Docker/K8s sandbox backends. + +### Progressive Trust + +Agents can earn higher tool access over time through configurable trust strategies. The trust +system implements a `TrustStrategy` protocol, making it extensible. All four strategies are +implemented. + +!!! warning "Security Invariant" + + The `standard_to_elevated` promotion **always** requires human approval. No agent can + auto-gain production access regardless of trust strategy. + +=== "Disabled (Default)" + + Trust is disabled. Agents receive their configured access level at hire time and it never + changes. Simplest option -- useful when the human manages permissions manually. + + ```yaml + trust: + strategy: "disabled" # disabled, weighted, per_category, milestone + initial_level: "standard" # fixed access level for all agents + ``` + +=== "Weighted Score" + + A single trust score computed from weighted factors: task difficulty completed, error rate, + time active, and human feedback. One global trust level per agent, applied to all tool + categories. + + ```yaml + trust: + strategy: "weighted" + initial_level: "sandboxed" + weights: + task_difficulty: 0.3 # harder tasks completed = more trust + completion_rate: 0.25 + error_rate: 0.25 # inverse -- fewer errors = more trust + human_feedback: 0.2 + promotion_thresholds: + sandboxed_to_restricted: 0.4 + restricted_to_standard: 0.6 + standard_to_elevated: + score: 0.8 + requires_human_approval: true # always human-gated + ``` + + Simple model, easy to understand. One number to track. However, too coarse -- an agent + trusted for file edits should not auto-gain deployment access. + +=== "Per-Category" + + Separate trust tracks per tool category (filesystem, git, deployment, database, network). + An agent can be "standard" for files but "sandboxed" for deployment. Promotion criteria + differ per category. + + ```yaml + trust: + strategy: "per_category" + initial_levels: + file_system: "restricted" + git: "restricted" + code_execution: "sandboxed" + deployment: "sandboxed" + database: "sandboxed" + terminal: "sandboxed" + promotion_criteria: + file_system: + restricted_to_standard: + tasks_completed: 10 + quality_score_min: 7.0 + deployment: + sandboxed_to_restricted: + tasks_completed: 20 + quality_score_min: 8.5 + requires_human_approval: true # always human-gated for deployment + ``` + + Granular. Matches real security models (IAM roles). Prevents gaming via easy tasks. Trust + state is a matrix per agent, not a scalar. + +=== "Milestone Gates" + + Explicit capability milestones aligned with the Cloud Security Alliance Agentic Trust + Framework. Automated promotion for low-risk levels. Human approval gates for elevated + access. Trust is time-bound and subject to periodic re-verification. + + ```yaml + trust: + strategy: "milestone" + initial_level: "sandboxed" + milestones: + sandboxed_to_restricted: + tasks_completed: 5 + quality_score_min: 7.0 + auto_promote: true # no human needed + restricted_to_standard: + tasks_completed: 20 + quality_score_min: 8.0 + time_active_days: 7 + auto_promote: true + standard_to_elevated: + requires_human_approval: true # always human-gated + clean_history_days: 14 # no errors in last 14 days + re_verification: + enabled: true + interval_days: 90 # re-verify every 90 days + decay_on_idle_days: 30 # demote one level if idle 30+ days + decay_on_error_rate: 0.15 # demote if error rate exceeds 15% + ``` + + Industry-aligned. Re-verification prevents stale trust. Trust decay may need tuning + to avoid frustrating users. + +--- + +## Security and Approval System + +### Approval Workflow + +```text + +---------------+ + | Task/Action | + +-------+-------+ + | + +-------v-------+ + | Security Ops | + | Agent | + +-------+-------+ + / \ + +-----v-+ +---v----+ + |APPROVE | | DENY | + |(auto) | |+ reason| + +----+---+ +---+----+ + | | + Execute +---v---------+ + | Human Queue | + | (Dashboard) | + +---+---------+ + / \ + +-----v-+ +---v----------+ + |Override| |Alternative | + |Approve | |Suggested | + +--------+ +--------------+ +``` + +### Autonomy Levels + +The framework provides four built-in autonomy presets that control which actions agents can +perform independently versus which require human approval. Most users only set the level. + +```yaml +autonomy: + level: "semi" # full, semi, supervised, locked + presets: + full: + description: "Agents work independently. Human notified of results only." + auto_approve: ["all"] + human_approval: [] + + semi: + description: "Most work is autonomous. Major decisions need approval." + auto_approve: ["code", "test", "docs", "comms:internal"] + human_approval: ["deploy", "comms:external", "budget:exceed", "org:hire"] + security_agent: true + + supervised: + description: "Human approves major steps. Agents handle details." + auto_approve: ["code:write", "comms:internal"] + human_approval: ["arch", "code:create", "deploy", "vcs:push"] + security_agent: true + + locked: + description: "Human must approve every action." + auto_approve: [] + human_approval: ["all"] + security_agent: true # still runs for audit logging +``` + +**Autonomy scope** ([Decision Log](../architecture/decisions.md) D6): Three-level +resolution chain: per-agent > per-department > company default. Seniority validation prevents +Juniors/Interns from being set to `full`. + +**Runtime changes** ([Decision Log](../architecture/decisions.md) D7): Human-only +promotion via REST API (no agent, including CEO, can escalate privileges). Automatic downgrade +on: high error rate (one level down), budget exhausted (supervised), security incident (locked). +Recovery from auto-downgrade is human-only. + +### Security Operations Agent + +A special meta-agent that reviews all actions before execution: + +- Evaluates safety of proposed actions +- Checks for data leaks, credential exposure, destructive operations +- Validates actions against company policies +- Maintains an audit log of all approvals/denials +- Escalates uncertain cases to human queue with explanation +- **Cannot be overridden by other agents** (only human can override) + +**Rule engine** ([Decision Log](../architecture/decisions.md) D4): Hybrid +approach. Rule engine for known patterns (credentials, path traversal, destructive ops) -- +sub-ms, covers ~95% of cases. LLM fallback only for uncertain cases (~5%). Full autonomy mode: +rules + audit logging only, no LLM path. Hard safety rules (credential exposure, data +destruction) **never bypass** regardless of autonomy level. + +**Integration point** ([Decision Log](../architecture/decisions.md) D5): +Pluggable `SecurityInterceptionStrategy` protocol. Initial strategy intercepts before every +tool invocation -- slots into existing `ToolInvoker` between permission check and tool +execution. Post-tool-call scanning detects sensitive data in outputs. + +### Output Scan Response Policies + +After the output scanner detects sensitive data, a pluggable `OutputScanResponsePolicy` +protocol decides how to handle the findings: + +| Policy | Behavior | Default for | +|--------|----------|-------------| +| **Redact** (default) | Return scanner's redacted content as-is | `SEMI`, `SUPERVISED` autonomy | +| **Withhold** | Clear redacted content -- fail-closed, no partial data returned | `LOCKED` autonomy | +| **Log-only** | Discard findings (logs at WARNING), pass original output through | `FULL` autonomy | +| **Autonomy-tiered** | Delegate to a sub-policy based on effective autonomy level | Composite policy | + +Policy selection is declarative via `SecurityConfig.output_scan_policy_type` +(`OutputScanPolicyType` enum). A factory function (`build_output_scan_policy`) resolves the +enum to a concrete policy instance. The policy is applied *after* audit recording, preserving +audit fidelity regardless of policy outcome. + +### Approval Timeout Policy + +When an action requires human approval (per autonomy level), the agent must wait. The +framework provides configurable timeout policies that determine what happens when a human +does not respond. All policies implement a `TimeoutPolicy` protocol, configurable per autonomy +level and per action risk tier. + +During any wait -- regardless of policy -- the agent **parks** the blocked task (saving its +full serialized `AgentContext` state: conversation, progress, accumulated cost, turn count) +and picks up other available tasks from its queue. When approval arrives, the agent **resumes** +the original context exactly where it left off. This mirrors real company behavior: a developer +starts another task while waiting for a code review, then returns to the original work when +feedback arrives. + +=== "Wait Forever" + + The action stays in the human queue indefinitely. No timeout, no auto-resolution. The agent + works on other tasks in the meantime. + + ```yaml + approval_timeout: + policy: "wait" # wait, deny, tiered, escalation + ``` + + Safest -- no risk of unauthorized actions. Can stall tasks indefinitely if human is + unavailable. + +=== "Deny on Timeout" + + All unapproved actions auto-deny after a configurable timeout. The agent receives a denial + reason and can retry with a different approach or escalate explicitly. + + ```yaml + approval_timeout: + policy: "deny" + timeout_minutes: 240 # 4 hours + ``` + + Industry consensus default ("fail closed"). May stall legitimate work if human is + consistently slow. + +=== "Tiered Timeout" + + Different timeout behavior based on action risk level. Low-risk actions auto-approve after + a short wait. Medium-risk actions auto-deny. High-risk/security-critical actions wait + forever. + + ```yaml + approval_timeout: + policy: "tiered" + tiers: + low_risk: + timeout_minutes: 60 + on_timeout: "approve" # auto-approve low-risk after 1 hour + actions: ["code:write", "comms:internal", "test"] + medium_risk: + timeout_minutes: 240 + on_timeout: "deny" # auto-deny medium-risk after 4 hours + actions: ["code:create", "vcs:push", "arch:decide"] + high_risk: + timeout_minutes: null # wait forever + on_timeout: "wait" + actions: ["deploy", "db:admin", "comms:external", "org:hire"] + ``` + + Pragmatic -- low-risk tasks do not stall, critical actions stay safe. Auto-approve on + timeout carries risk. Tuning tier boundaries requires operational experience. + +=== "Escalation Chain" + + On timeout, the approval request escalates to the next human in a configured chain. If the + entire chain times out, the action is denied. + + ```yaml + approval_timeout: + policy: "escalation" + chain: + - role: "direct_manager" + timeout_minutes: 120 + - role: "department_head" + timeout_minutes: 240 + - role: "ceo_or_board" + timeout_minutes: 480 + on_chain_exhausted: "deny" # deny if entire chain times out + ``` + + Mirrors real organizations -- if one approver is unavailable, the next in line covers. + Requires configuring an escalation chain. + +!!! abstract "Park/Resume Mechanism" + + The park/resume mechanism relies on `AgentContext` snapshots (frozen Pydantic models). When + a task is parked, the full context is persisted to the + [`PersistenceBackend`](memory.md#operational-data-persistence). When approval arrives, the + framework loads the snapshot, restores the agent's conversation and state, and resumes + execution from the exact point of suspension. This works naturally with the + `model_copy(update=...)` immutability pattern. + + **Design decisions** ([Decision Log](../architecture/decisions.md)): + + - **D19 -- Risk Tier Classification:** Pluggable `RiskTierClassifier` protocol. Configurable + YAML mapping with sensible defaults. Unknown action types default to HIGH (fail-safe). + - **D20 -- Context Serialization:** Pydantic JSON via persistence backend. `ParkedContext` + model with metadata columns + `context_json` blob. Conversation stored verbatim -- + summarization is a context window management concern at resume time, not a persistence + concern. + - **D21 -- Resume Injection:** Tool result injection. Approval requests modeled as tool + calls (`request_human_approval`). Approval decision returned as `ToolResult` -- + semantically correct (approval IS the tool's return value). + +--- + +## Human Interaction Layer + +### API-First Architecture + +The REST/WebSocket API is the **primary interface** for all consumers. The Web UI and any +future CLI tool are thin clients that call the API -- they contain no business logic. + +```text ++-------------------------------------------------+ +| SynthOrg Engine | +| (Core Logic, Agent Orchestration, Tasks) | ++--------------------+----------------------------+ + | + +--------v--------+ + | REST/WS API | <-- primary interface + | (Litestar) | + +---+----------+--+ + | | + +-------v--+ +---v--------+ + | Web UI | | CLI Tool | + | (Future) | | (Future) | + +----------+ +-----------+ +``` + +!!! note "CLI Tool (Future)" + + If needed, a thin CLI utility wrapping the REST API with terminal formatting (Typer + Rich + or similar). Not a priority -- the API is fully self-sufficient. To be determined whether a + dedicated CLI is warranted or whether `curl`/`httpie` and the interactive Scalar docs at + `/docs/api` suffice. + +### API Surface + +| Endpoint | Purpose | +|----------|---------| +| `/api/v1/health` | Health check, readiness | +| `/api/v1/auth` | Authentication: setup, login, password change | +| `/api/v1/company` | CRUD company config | +| `/api/v1/agents` | List, hire, fire, modify agents | +| `/api/v1/departments` | Department management | +| `/api/v1/projects` | Project CRUD | +| `/api/v1/tasks` | Task management | +| `/api/v1/messages` | Communication log | +| `/api/v1/meetings` | Schedule, view meeting outputs | +| `/api/v1/artifacts` | Browse produced artifacts (code, docs, etc.) | +| `/api/v1/budget` | Spending, limits, projections | +| `/api/v1/approvals` | Pending human approvals queue | +| `/api/v1/analytics` | Performance metrics, dashboards | +| `/api/v1/providers` | Model provider status, config | +| `/api/v1/ws` | WebSocket for real-time updates | + +### Web UI Features + +!!! warning "Planned" + + The Web UI is a planned future component (Vue 3). The API is fully self-sufficient for + all operations. + +- **Dashboard**: Real-time company overview, active tasks, spending +- **Org Chart**: Visual hierarchy, click to inspect any agent +- **Task Board**: Kanban/list view of all tasks across projects +- **Message Feed**: Real-time feed of agent communications +- **Approval Queue**: Pending approvals with context and recommendations +- **Agent Profiles**: Detailed view of each agent's identity, history, metrics +- **Budget Panel**: Spending charts, projections, alerts +- **Meeting Logs**: Transcripts and outcomes of all agent meetings +- **Artifact Browser**: Browse and inspect all produced work +- **Settings**: Company config, autonomy levels, provider settings + +### Human Roles + +| Role | Access | Description | +|------|--------|-------------| +| **Board Member** | Observe + major approvals only | Minimal involvement, strategic oversight | +| **CEO** | Full authority, replaces CEO agent | Human IS the CEO, agents are the team | +| **Manager** | Department-level authority | Manages one team/department directly | +| **Observer** | Read-only | Watch the company operate, no intervention | +| **Pair Programmer** | Direct collaboration with one agent | Work alongside a specific agent in real-time | diff --git a/docs/design/organization.md b/docs/design/organization.md new file mode 100644 index 0000000000..074a7e6200 --- /dev/null +++ b/docs/design/organization.md @@ -0,0 +1,253 @@ +--- +title: Organization & Templates +description: Company types, organizational hierarchy, department configuration, template system, and dynamic scaling in the SynthOrg framework. +--- + +# Organization & Templates + +## Company Types + +SynthOrg provides pre-built company templates for common organizational patterns: + +| Template | Size | Roles | Use Case | +|----------|------|-------|----------| +| **Solo Founder** | 1-2 | CEO + Full-Stack Dev | Quick prototypes, solo projects | +| **Startup** | 3-5 | CEO, CTO, 2 Devs, PM | Small projects, MVPs | +| **Dev Shop** | 5-10 | Lead, Sr Dev, Jr Devs, QA, DevOps | Software development focus | +| **Product Team** | 8-15 | PM, Designer, Devs, QA, Data Analyst | Product-focused development | +| **Agency** | 10-20 | Multiple PMs, Designers, Devs, Content | Client work, multiple projects | +| **Full Company** | 20-50+ | All departments, full hierarchy | Enterprise simulation | +| **Research Lab** | 5-10 | Lead Researcher, Analysts, Engineers | Research and analysis | +| **Custom** | Any | User-defined | Anything | + +See the [Template System](#template-system) section for details on how templates are defined, +inherited, and customized. + +--- + +## Organizational Hierarchy + +The framework supports a full organizational hierarchy with reporting lines and +delegation authority: + +```mermaid +graph TD + CEO["CEO"] + + CEO --> CTO["CTO"] + CEO --> CPO["CPO"] + CEO --> CFO["CFO"] + + CTO --> EngLead["Eng Lead"] + CTO --> QALead["QA Lead"] + CTO --> DevOpsLead["DevOps Lead"] + + CPO --> PM["Product Managers"] + CPO --> Design["UX/UI Designers"] + CPO --> TechWriter["Tech Writers"] + + CFO --> BudgetMgmt["Budget Mgmt"] + + EngLead --> SrDevs["Sr Devs"] + EngLead --> JrDevs["Jr Devs"] + + QALead --> QAEng["QA Engineers"] + QALead --> AutoEng["Automation Engineers"] + + DevOpsLead --> SRE["SRE"] +``` + +Each node in the hierarchy corresponds to an [agent](agents.md) with a defined +[seniority level](agents.md#seniority-authority-levels) that determines their authority, +delegation rights, and typical model tier. + +--- + +## Department Configuration + +???+ example "Full department configuration YAML" + + ```yaml + departments: + - name: "engineering" + head: "cto" + budget_percent: 60 + teams: + - name: "backend" + lead: "backend_lead" + members: ["sr_backend_1", "mid_backend_1", "jr_backend_1"] + - name: "frontend" + lead: "frontend_lead" + members: ["sr_frontend_1", "mid_frontend_1"] + - name: "product" + head: "cpo" + budget_percent: 20 + teams: + - name: "core" + lead: "pm_lead" + members: ["pm_1", "ux_designer_1", "ui_designer_1"] + - name: "operations" + head: "coo" + budget_percent: 10 + teams: + - name: "devops" + lead: "devops_lead" + members: ["sre_1"] + - name: "quality" + head: "qa_lead" + budget_percent: 10 + teams: + - name: "qa" + lead: "qa_lead" + members: ["qa_engineer_1", "automation_engineer_1"] + ``` + +Each department defines: + +- **head** -- The agent who leads the department (typically a C-suite or Lead role) +- **budget_percent** -- The share of the company's total budget allocated to this department +- **teams** -- Named sub-groups within the department, each with a lead and members + +--- + +## Dynamic Scaling + +The company can dynamically grow or shrink through several mechanisms: + +- **Auto-scale** -- The HR agent detects workload increases and proposes new + [hires](agents.md#hiring-process) +- **Manual scale** -- A human adds or removes agents via config or UI +- **Budget-driven** -- The CFO agent caps headcount based on budget constraints +- **Skill-gap** -- HR analyzes team capabilities, identifies missing skills, and proposes + targeted hires + +--- + +## Template System + +Templates are YAML/JSON files defining a complete company setup. The framework uses templates as +the primary mechanism for bootstrapping organizations. + +### Template Structure + +```yaml +# templates/startup.yaml (simplified — real templates also declare +# variables, departments, min_agents/max_agents, and tags) +template: + name: "Tech Startup" + description: "Small team for building MVPs and prototypes" + version: "1.0" + + company: + type: "startup" + budget_monthly: "{{ budget | default(50.00) }}" + autonomy: 0.5 + + agents: + - role: "CEO" + name: "{{ ceo_name | auto }}" + model: "large" + personality_preset: "visionary_leader" + + - role: "Full-Stack Developer" + merge_id: "fullstack-senior" + name: "{{ dev1_name | auto }}" + level: "senior" + model: "medium" + personality_preset: "pragmatic_builder" + + - role: "Full-Stack Developer" + merge_id: "fullstack-mid" + name: "{{ dev2_name | auto }}" + level: "mid" + model: "small" + personality_preset: "eager_learner" + + - role: "Product Manager" + name: "{{ pm_name | auto }}" + model: "medium" + personality_preset: "strategic_planner" + + workflow: "agile_kanban" + communication: "hybrid" + + workflow_handoffs: + - from_department: "engineering" + to_department: "qa" + trigger: "pr_ready" + + escalation_paths: + - from_department: "engineering" + to_department: "security" + condition: "vulnerability_found" +``` + +Templates support **Jinja2-style variables** (`{{ variable | default(value) }}`) for +user-customizable values, and **personality presets** for reusable agent personality +configurations. + +### Template Inheritance + +Templates can extend other templates using `extends`: + +```yaml +template: + name: "Extended Startup" + extends: "startup" # inherits all agents, departments, config + agents: + - role: "QA Engineer" # appended to parent agents + level: "mid" + - role: "Full-Stack Developer" + merge_id: "fullstack-mid" + department: "engineering" + _remove: true # removes matching parent agent by key +``` + +Inheritance resolves parent-to-child chains up to **10 levels deep**. Circular inheritance +is detected via chain tracking and raises `TemplateInheritanceError`. + +### Merge Semantics + +The merge behavior during template inheritance follows these rules: + +Scalars (`company_name`, `company_type`) +: Child wins if present. + +`config` dict +: Deep-merged (child keys override parent). + +`agents` list +: Merged by `(role, department, merge_id)` composite key. When `merge_id` is omitted, it + defaults to an empty string, making the key `(role, department, "")`. The child template + can override, append, or remove (`_remove: true`) parent agents. + +`departments` list +: Merged by department `name` (case-insensitive). A child department with the same `name` + replaces the parent entry entirely; departments with new names are appended. + +`workflow_handoffs` and `escalation_paths` +: Child replaces entirely if present. + +--- + +## Company Builder + +!!! warning "Planned" + + The template system already supports creating companies from YAML configs. An interactive + wizard is planned as a future addition after the REST API is complete -- it could be a + thin CLI utility or a web form that posts to `/api/v1/company`. + +--- + +## Community Marketplace + +!!! warning "Planned" + + A future community marketplace would enable sharing and discovery of: + + - Company templates + - Custom role definitions + - Workflow configurations + - Rating and review system + - Import/export in standard format diff --git a/docs/design_spec.md b/docs/design_spec.md deleted file mode 100644 index 059cb97709..0000000000 --- a/docs/design_spec.md +++ /dev/null @@ -1,3527 +0,0 @@ -# SynthOrg - High-Level Design Specification - -> A framework for building synthetic organizations — autonomous AI agents orchestrated as a virtual company, with configurable roles, hierarchies, communication patterns, and tool access. - ---- - -## Table of Contents - -1. [Vision & Philosophy](#1-vision--philosophy) — 1.4 MVP Definition, 1.5 Configuration Philosophy -2. [Core Concepts](#2-core-concepts) -3. [Agent System](#3-agent-system) -4. [Company Structure](#4-company-structure) -5. [Communication Architecture](#5-communication-architecture) — 5.6 Conflict Resolution, 5.7 Meeting Protocol -6. [Task & Workflow Engine](#6-task--workflow-engine) — 6.5 Execution Loop, 6.6 Crash Recovery, **6.7 Graceful Shutdown**, **6.8 Workspace Isolation**, **6.9 Task Decomposability & Coordination Topology** -7. [Memory & Persistence](#7-memory--persistence) — 7.4 Shared Org Memory (Research Directions), **7.5 Memory Backend Protocol**, **7.6 Operational Data Persistence** -8. [HR & Workforce Management](#8-hr--workforce-management) -9. [Model Provider Layer](#9-model-provider-layer) -10. [Cost & Budget Management](#10-cost--budget-management) -11. [Tool & Capability System](#11-tool--capability-system) — **11.1.3 MCP Integration**, **11.1.4 Action Type System**, 11.3 Progressive Trust -12. [Security & Approval System](#12-security--approval-system) — 12.4 Approval Timeout -13. [Human Interaction Layer](#13-human-interaction-layer) -14. [Templates & Builder](#14-templates--builder) -15. [Technical Architecture](#15-technical-architecture) — 15.5 Engineering Conventions -16. [Research & Prior Art](#16-research--prior-art) — **16.3 Agent Scaling Research**, 16.4 Build vs Fork Decision -17. [Open Questions & Risks](#17-open-questions--risks) -18. [Backlog & Future Vision](#18-backlog--future-vision) - ---- - -## 1. Vision & Philosophy - -### 1.1 Core Vision - -Build a **configurable AI company framework** where AI agents operate within a virtual organization. Each agent has a defined role, personality, skills, memory, and model backend. The company can be configured from a 2-person startup to a 50+ enterprise, handling software development, business operations, creative work, or any domain. - -### 1.2 Design Principles - -| Principle | Description | -|-----------|-------------| -| **Configuration over Code** | Company structures, roles, and workflows defined via config, not hardcoded | -| **Provider Agnostic** | Any LLM backend: cloud APIs, OpenRouter, Ollama, custom endpoints | -| **Composable** | Mix and match roles, teams, workflows. Build any type of company | -| **Observable** | Every agent action, communication, and decision is logged and visible | -| **Autonomy Spectrum** | From full human oversight to fully autonomous operation | -| **Cost Aware** | Built-in budget tracking, model routing optimization, spending controls | -| **Extensible** | Plugin architecture for new roles, tools, providers, and workflows | -| **Local First** | Runs locally with option to expose on network or host remotely later | - -### 1.3 What This Is NOT - -- Not a chatbot or conversational AI product -- Not locked to software development only (though that is a primary use case) -- Not a wrapper around a single model or provider -- Not a toy/demo - designed for real, production-quality output - -### 1.4 MVP Definition - -The MVP validates the core hypothesis: **a single agent can complete a real task end-to-end** within the framework's architecture. - -**MVP scope:** - -- Single agent executing tasks via the **ReAct** execution loop -- **Subprocess sandbox** for file system and git tools (Docker optional for code execution) -- **Fail-and-reassign** crash recovery -- **Cooperative graceful shutdown** with configurable timeout -- **Proxy metrics**: turns/tokens/cost per task -- System prompt builder with agent personality injection - -> **How to read this spec:** Sections describe the full vision. The full design is documented upfront to inform architecture decisions — protocol interfaces are designed even for features that are not yet implemented. - -> **Implementation snapshot (2026-03-10):** -> All major subsystems are implemented: config/core models, provider layer, single-agent engine, multi-agent orchestration (message bus, delegation, loop prevention, conflict resolution, meeting protocols), API surface (REST + WebSocket), Docker sandbox, MCP bridge, code runner, HR engine (hiring/firing/onboarding/offboarding/registry, performance tracking, promotion/demotion), memory layer (retrieval pipeline, shared org memory, consolidation/archival — backend selected per [ADR-001](decisions/ADR-001-memory-layer.md)), persistence (SQLite backend, audit entry persistence), budget enforcement (BudgetEnforcer, cost tiers, quota/subscription tracking, CFO cost optimization), SecOps agent (rule engine, audit log, output scanner, output scan response policies, risk classifier, ToolInvoker integration), progressive trust (4 strategies behind TrustStrategy protocol), autonomy levels (presets, resolver, change strategy), and approval timeout policies (4 policies, park/resume service, risk tier classifier). -> - **Remaining:** Mem0 adapter backend, approval workflow gates. - -### 1.5 Configuration Philosophy - -The framework follows **progressive disclosure** — users only configure what they need: - -1. **Templates** handle 90% of users — pick a template, override 2–3 values, go -2. **Minimal config** for custom setups — everything has sensible defaults -3. **Full config** for power users — every knob exposed but none required - -**Minimal custom company** (all other settings use defaults): - -```yaml -company: - name: "Acme Corp" - template: "startup" - budget_monthly: 50.00 -``` - -All configuration systems in the framework are **pluggable** — strategies, backends, and policies are swappable via protocol interfaces without modifying existing code. Sensible defaults are chosen for each, documented in the relevant section alongside the full configuration reference. - ---- - -## 2. Core Concepts - -### 2.1 Glossary - -| Term | Definition | -|------|-----------| -| **Agent** | An AI entity with a role, personality, model backend, memory, and tool access. The primary entity in the framework. Within a company context, agents serve as the company's employees. | -| **Company** | A configured organization of agents with structure, hierarchy, and workflows | -| **Department** | A grouping of related roles (Engineering, Product, Design, Operations, etc.) | -| **Role** | A job definition with required skills, responsibilities, authority level, and tool access | -| **Skill** | A capability an agent possesses (coding, writing, analysis, design, etc.) | -| **Task** | A unit of work assigned to one or more agents | -| **Project** | A collection of related tasks with a goal, deadline, and assigned team | -| **Meeting** | A structured multi-agent interaction for decisions, reviews, or planning | -| **Artifact** | Any output produced by agents: code, documents, designs, reports, etc. | - -### 2.2 Entity Relationships - -```text -Company - ├── Departments[] - │ ├── Department Head (Agent) - │ └── Members (Agent[]) - ├── Projects[] - │ ├── Tasks[] - │ │ ├── Assigned Agent(s) - │ │ ├── Artifacts[] - │ │ └── Status / History - │ └── Team (Agent[]) - ├── Config - │ ├── Autonomy Level - │ ├── Budget - │ ├── Communication Settings - │ └── Tool Permissions - └── HR Registry - ├── Active Agents[] - ├── Available Roles[] - └── Hiring Queue -``` - ---- - -## 3. Agent System - -### 3.1 Agent Identity Card - -Every agent has a comprehensive identity. At the design level, agent data splits into two layers: - -- **Config (immutable)**: identity, personality, skills, model preferences, tool permissions, authority. Defined at hire time, changed only by explicit reconfiguration. Represented as frozen Pydantic models. -- **Runtime state (mutable-via-copy)**: current status, active task, conversation history, execution metrics. Evolves during agent operation. Represented as Pydantic models using `model_copy(update=...)` for state transitions — never mutated in place. - -> **Current state:** Both layers are implemented. Config layer: `AgentIdentity` (frozen, in `core/agent.py`). Runtime state layer: `TaskExecution`, `AgentContext`, `AgentContextSnapshot` (frozen + `model_copy`, in `engine/`). `AgentEngine` orchestrates execution via `run()`. All identifier/name fields use `NotBlankStr` (from `core.types`) for automatic whitespace rejection; optional identifier fields use `NotBlankStr | None`; tuple fields use `tuple[NotBlankStr, ...]` for per-element validation. - -**Personality dimensions** split into two tiers: - -- **Big Five (OCEAN-variant)** — floats (0.0–1.0) used for internal compatibility scoring only (not injected into prompts). `stress_response` replaces traditional neuroticism with inverted polarity (1.0 = very calm). Scored by `core/personality.py`. -- **Behavioral enums** — injected into system prompts as natural-language labels that LLMs respond to: - - `DecisionMakingStyle`: `analytical`, `intuitive`, `consultative`, `directive` - - `CollaborationPreference`: `independent`, `pair`, `team` - - `CommunicationVerbosity`: `terse`, `balanced`, `verbose` - - `ConflictApproach`: `avoid`, `accommodate`, `compete`, `compromise`, `collaborate` (Thomas-Kilmann model) - -```yaml -# --- Config layer — AgentIdentity (frozen) --- -agent: - id: "uuid" - name: "Sarah Chen" - role: "Senior Backend Developer" - department: "Engineering" - level: "Senior" # Junior, Mid, Senior, Lead, Principal, Director, VP, C-Suite - personality: - traits: - - analytical - - detail-oriented - - pragmatic - communication_style: "concise and technical" - risk_tolerance: "low" # low, medium, high - creativity: "medium" # low, medium, high - description: > - Sarah is a methodical backend developer who prioritizes clean architecture - and thorough testing. She pushes back on shortcuts and advocates for - proper error handling. Prefers Pythonic solutions. - # Big Five (OCEAN-variant) dimensions — internal scoring (0.0-1.0) - openness: 0.4 # curiosity, creativity - conscientiousness: 0.9 # thoroughness, reliability - extraversion: 0.3 # assertiveness, sociability - agreeableness: 0.5 # cooperation, empathy - stress_response: 0.75 # emotional stability (1.0 = very calm) - # Behavioral enums — injected into system prompts - decision_making: "analytical" # analytical, intuitive, consultative, directive - collaboration: "independent" # independent, pair, team - verbosity: "balanced" # terse, balanced, verbose - conflict_approach: "compromise" # avoid, accommodate, compete, compromise, collaborate - skills: - primary: - - python - - litestar - - postgresql - - system-design - secondary: - - docker - - redis - - testing - model: - provider: "example-provider" # example provider - model_id: "example-medium-001" # example model — actual models TBD per agent/role - temperature: 0.3 - max_tokens: 8192 - fallback_model: "openrouter/example-medium-001" # example fallback - memory: - type: "persistent" # persistent, project, session, none - retention_days: null # null = forever - tools: - access_level: "standard" # sandboxed | restricted | standard | elevated | custom - allowed: - - file_system - - git - - code_execution - - web_search - - terminal - denied: - - deployment - - database_admin - authority: - can_approve: ["junior_dev_tasks", "code_reviews"] - reports_to: "engineering_lead" - can_delegate_to: ["junior_developers"] - budget_limit: 5.00 # max USD per task - autonomy_level: null # optional: full, semi, supervised, locked (overrides department/company default, §12.2) - hiring_date: "2026-02-27" - status: "active" # active, on_leave, terminated (on config model today) - -# --- Runtime state — engine/ (frozen + model_copy) --- -# TaskExecution wraps Task with evolving execution state: -# status: TaskStatus # evolves via with_transition() -# transition_log: tuple[StatusTransition, ...] -# accumulated_cost: TokenUsage # running totals -# turn_count: int # LLM turns completed -# started_at / completed_at: AwareDatetime | None -# -# AgentContext wraps AgentIdentity + TaskExecution with: -# execution_id: str # uuid4, unique per run -# conversation: tuple[ChatMessage, ...] -# accumulated_cost: TokenUsage # running totals -# turn_count: int # LLM turns completed -# max_turns: int # hard limit (default 20) -# started_at: AwareDatetime -``` - -### 3.2 Seniority & Authority Levels - -| Level | Authority | Typical Model | Cost Tier | -|-------|----------|---------------|-----------| -| Intern/Junior | Execute assigned tasks only | small / local | $ | -| Mid | Execute + suggest improvements | medium / local | $$ | -| Senior | Execute + design + review others | medium / large | $$$ | -| Lead | All above + approve + delegate | large / medium | $$$ | -| Principal/Staff | All above + architectural decisions | large | $$$$ | -| Director | Strategic decisions + budget authority | large | $$$$ | -| VP | Department-wide authority | large | $$$$ | -| C-Suite (CEO/CTO/CFO) | Company-wide authority + final approvals | large | $$$$ | - -### 3.3 Role Catalog (Extensible) - -#### C-Suite / Executive - -- **CEO** - Overall strategy, final decision authority, cross-department coordination -- **CTO** - Technical vision, architecture decisions, technology choices -- **CFO** - Budget management, cost optimization, resource allocation -- **COO** - Operations, process optimization, workflow management -- **CPO** - Product strategy, roadmap, feature prioritization - -#### Product & Design - -- **Product Manager** - Requirements, user stories, prioritization, stakeholder communication -- **UX Designer** - User research, wireframes, user flows, usability -- **UI Designer** - Visual design, component design, design systems -- **UX Researcher** - User interviews, analytics, A/B test design -- **Technical Writer** - Documentation, API docs, user guides - -#### Engineering - -- **Software Architect** - System design, technology decisions, patterns -- **Frontend Developer** (Junior/Mid/Senior) - UI implementation, components, state management -- **Backend Developer** (Junior/Mid/Senior) - APIs, business logic, databases -- **Full-Stack Developer** (Junior/Mid/Senior) - End-to-end implementation -- **DevOps/SRE Engineer** - Infrastructure, CI/CD, monitoring, deployment -- **Database Engineer** - Schema design, query optimization, migrations -- **Security Engineer** - Security audits, vulnerability assessment, secure coding - -#### Quality Assurance - -- **QA Lead** - Test strategy, quality gates, release readiness -- **QA Engineer** - Test plans, manual testing, bug reporting -- **Automation Engineer** - Test frameworks, CI integration, E2E tests -- **Performance Engineer** - Load testing, profiling, optimization - -#### Data & Analytics - -- **Data Analyst** - Metrics, dashboards, business intelligence -- **Data Engineer** - Pipelines, ETL, data infrastructure -- **ML Engineer** - Model training, inference, MLOps - -#### Operations & Support - -- **Project Manager** - Timelines, dependencies, risk management, status tracking -- **Scrum Master** - Agile ceremonies, impediment removal, team health -- **HR Manager** - Hiring recommendations, team composition, performance tracking -- **Security Operations** - Request validation, safety checks, approval workflows - -#### Creative & Marketing - -- **Content Writer** - Blog posts, marketing copy, social media -- **Brand Strategist** - Messaging, positioning, competitive analysis -- **Growth Marketer** - Campaigns, analytics, conversion optimization - -### 3.4 Dynamic Roles - -Users can define custom roles via config: - -```yaml -custom_roles: - - name: "Blockchain Developer" - department: "Engineering" - skills: ["solidity", "web3", "smart-contracts"] - system_prompt_template: "blockchain_dev.md" - authority_level: "senior" - suggested_model: "large" -``` - ---- - -## 4. Company Structure - -### 4.1 Company Types (Templates) - -| Template | Size | Roles | Use Case | -|----------|------|-------|----------| -| **Solo Founder** | 1-2 | CEO + Full-Stack Dev | Quick prototypes, solo projects | -| **Startup** | 3-5 | CEO, CTO, 2 Devs, PM | Small projects, MVPs | -| **Dev Shop** | 5-10 | Lead, Sr Dev, Jr Devs, QA, DevOps | Software development focus | -| **Product Team** | 8-15 | PM, Designer, Devs, QA, Data Analyst | Product-focused development | -| **Agency** | 10-20 | Multiple PMs, Designers, Devs, Content | Client work, multiple projects | -| **Full Company** | 20-50+ | All departments, full hierarchy | Enterprise simulation | -| **Research Lab** | 5-10 | Lead Researcher, Analysts, Engineers | Research and analysis | -| **Custom** | Any | User-defined | Anything | - -### 4.2 Organizational Hierarchy - -```text - ┌─────────┐ - │ CEO │ - └────┬────┘ - ┌──────────────┼──────────────┐ - ┌────┴────┐ ┌────┴────┐ ┌─────┴────┐ - │ CTO │ │ CPO │ │ CFO │ - └────┬────┘ └────┬────┘ └────┬─────┘ - │ │ │ - ┌─────────┼────────┐ │ Budget Mgmt - │ │ │ │ -┌───┴───┐ ┌──┴──┐ ┌───┴──┐ ├── Product Managers -│ Eng │ │ QA │ │DevOps│ ├── UX/UI Designers -│ Lead │ │Lead │ │ Lead │ └── Tech Writers -└───┬───┘ └──┬──┘ └──┬───┘ - │ │ │ - Sr Devs QA Eng SRE - Jr Devs Auto Eng -``` - -### 4.3 Department Configuration - -```yaml -departments: - engineering: - head: "cto" - budget_percent: 60 - teams: - - name: "backend" - lead: "backend_lead" - members: ["sr_backend_1", "mid_backend_1", "jr_backend_1"] - - name: "frontend" - lead: "frontend_lead" - members: ["sr_frontend_1", "mid_frontend_1"] - product: - head: "cpo" - budget_percent: 20 - teams: - - name: "core" - lead: "pm_lead" - members: ["pm_1", "ux_designer_1", "ui_designer_1"] - operations: - head: "coo" - budget_percent: 10 - teams: - - name: "devops" - lead: "devops_lead" - members: ["sre_1"] - quality: - head: "qa_lead" - budget_percent: 10 - teams: - - name: "qa" - lead: "qa_lead" - members: ["qa_engineer_1", "automation_engineer_1"] -``` - -### 4.4 Dynamic Scaling - -The company can dynamically grow or shrink: - -- **Auto-scale**: HR agent detects workload increase, proposes new hires -- **Manual scale**: Human adds/removes agents via config or UI -- **Budget-driven**: CFO agent caps headcount based on budget constraints -- **Skill-gap**: HR analyzes team capabilities, identifies missing skills, proposes hires - ---- - -## 5. Communication Architecture - -### 5.1 Communication Patterns - -The system supports multiple communication patterns, configurable per company: - -#### Pattern 1: Event-Driven Message Bus (Recommended Default) - -```text -┌──────────┐ ┌─────────────────┐ ┌──────────┐ -│ Agent A │────▶│ Message Bus │◀────│ Agent B │ -└──────────┘ │ (Topics/Queues) │ └──────────┘ - └────────┬────────┘ - │ - ┌───────────┼───────────┐ - ▼ ▼ ▼ - #engineering #product #all-hands - #code-review #design #incidents -``` - -- Agents publish to topics, subscribe to relevant channels -- Async by default, enables parallelism -- Decoupled - agents don't need to know about each other -- Natural audit trail of all communications -- **Best for**: Most scenarios, scales well, production-ready pattern - -#### Pattern 2: Hierarchical Delegation - -```text -CEO ──▶ CTO ──▶ Eng Lead ──▶ Sr Dev ──▶ Jr Dev - │ - └──▶ QA Lead ──▶ QA Eng -``` - -- Tasks flow down the hierarchy, results flow up -- Each level can decompose/refine tasks before delegating -- Authority enforcement built into the flow -- **Best for**: Structured organizations, clear chains of command - -#### Pattern 3: Meeting-Based - -```text -┌─────────────────────────────────┐ -│ Sprint Planning │ -│ PM + CTO + Devs + QA + Design │ -│ Output: Sprint backlog │ -└─────────────────────────────────┘ - │ -┌────────┴────────┐ -│ Daily Standup │ -│ Devs + QA │ -│ Output: Status │ -└─────────────────┘ -``` - -- Structured multi-agent conversations at defined intervals -- Standup, sprint planning, retrospective, design review, code review -- **Best for**: Agile workflows, decision-making, alignment - -#### Pattern 4: Hybrid (Recommended for Full Company) - -Combines all three: -- **Message bus** for async daily work and notifications -- **Hierarchical delegation** for task assignment and approvals -- **Meetings** for cross-team decisions and planning ceremonies - -### 5.2 Communication Standards - -The framework should align with emerging industry standards: - -- **A2A Protocol** (Agent-to-Agent, Linux Foundation) - For inter-agent task delegation, capability discovery via Agent Cards, and structured task lifecycle management -- **MCP** (Model Context Protocol, Agentic AI Foundation / Linux Foundation) - For agent-to-tool integration, providing standardized tool discovery and invocation - -### 5.3 Message Format - -```json -{ - "id": "msg-uuid", - "timestamp": "2026-02-27T10:30:00Z", - "from": "sarah_chen", - "to": "engineering", - "type": "task_update", - "priority": "normal", - "channel": "#backend", - "content": "Completed API endpoint for user authentication. PR ready for review.", - "attachments": [ - {"type": "artifact", "ref": "pr-42"} - ], - "metadata": { - "task_id": "task-123", - "project_id": "proj-456", - "tokens_used": 1200, - "cost_usd": 0.018 - } -} -``` - -### 5.4 Communication Config - -```yaml -communication: - default_pattern: "hybrid" - message_bus: - backend: "internal" # internal, redis, rabbitmq, kafka - channels: - - "#all-hands" - - "#engineering" - - "#product" - - "#design" - - "#incidents" - - "#code-review" - - "#watercooler" - meetings: - enabled: true - types: - - name: "daily_standup" - frequency: "per_sprint_day" - participants: ["engineering", "qa"] - duration_tokens: 2000 - - name: "sprint_planning" - frequency: "bi_weekly" - participants: ["all"] - duration_tokens: 5000 - - name: "code_review" - trigger: "on_pr" - participants: ["author", "reviewers"] - hierarchy: - enforce_chain_of_command: true - allow_skip_level: false # can a junior message the CEO directly? -``` - -### 5.5 Loop Prevention - -Agent communication loops (A delegates to B who delegates back to A) are a critical risk. The framework enforces multiple safeguards: - -| Mechanism | Description | Default | -|-----------|-------------|---------| -| **Max delegation depth** | Hard limit on chain length (A→B→C→D stops at depth N) | 5 | -| **Message rate limit** | Max messages per agent pair within a time window | 10 per minute | -| **Identical request dedup** | Detects and rejects duplicate task delegations within a window | 60s window | -| **Circuit breaker** | If an agent pair exceeds error/bounce threshold, block further messages until manual reset or cooldown | 3 bounces → 5min cooldown | -| **Task ancestry tracking** | Every delegated task carries its full delegation chain; agents cannot delegate back to any ancestor in the chain | Always on | - -```yaml -loop_prevention: - max_delegation_depth: 5 - rate_limit: - max_per_pair_per_minute: 10 - burst_allowance: 3 - dedup_window_seconds: 60 - circuit_breaker: - bounce_threshold: 3 - cooldown_seconds: 300 - ancestry_tracking: true # always on, not configurable -``` - -When a loop is detected, the framework: -1. Blocks the looping message -2. Notifies the sending agent with the detected loop chain -3. Escalates to the sender's manager (or human if at top of hierarchy) -4. Logs the loop for analytics and process improvement - -> **Current state:** The communication foundation is implemented: `MessageBus` protocol with `InMemoryMessageBus` backend (asyncio queues, pull-model `receive()`), `MessageDispatcher` for concurrent handler routing via `asyncio.TaskGroup`, `AgentMessenger` per-agent facade (auto-fills sender/timestamp/ID, deterministic direct-channel naming `@{sorted_a}:{sorted_b}`), and `DeliveryEnvelope` for delivery tracking. Loop prevention (§5.5) is implemented: `DelegationGuard` orchestrates five mechanisms (ancestry, depth, dedup, rate limit, circuit breaker) with `LoopPreventionConfig`. Hierarchical delegation is implemented via `DelegationService` with `HierarchyResolver` and `AuthorityValidator`. Task model extended with `parent_task_id` and `delegation_chain` fields. Conflict resolution (§5.6) is implemented: `ConflictResolver` protocol with four strategies (Authority, Debate, HumanEscalation, Hybrid), `ConflictResolutionService` orchestrator, `DissentRecord` audit trail, and `HierarchyResolver.get_lowest_common_manager()` for cross-department conflict escalation. Meeting protocol (§5.7) is implemented with all 3 protocols (round-robin, position papers, structured phases) via `MeetingOrchestrator` in `communication/meeting/`. - -### 5.6 Conflict Resolution Protocol - -When two or more agents disagree on an approach (architecture, implementation, priority, etc.), the framework provides multiple configurable resolution strategies behind a `ConflictResolver` protocol. New strategies can be added without modifying existing ones. The strategy is configurable per company, per department, or per conflict type. - -> **Current state:** All four strategies implemented: `AuthorityResolver` (seniority + hierarchy proximity), `DebateResolver` (judge-based with `JudgeEvaluator` protocol), `HumanEscalationResolver` (stub pending approval queue #37), `HybridResolver` (automated review + escalation). `ConflictResolutionService` orchestrates strategy selection and audit trail (`DissentRecord`). Models: `Conflict`, `ConflictPosition`, `ConflictResolution` (frozen Pydantic). Config: `ConflictResolutionConfig`, `DebateConfig`, `HybridConfig`. `HierarchyResolver` extended with `get_lowest_common_manager()` and `get_delegation_depth()`. Event constants in `observability/events/conflict.py`. - -#### Strategy 1: Authority + Dissent Log (Default) - -The agent with higher authority level decides. Cross-department conflicts (incomparable authority) escalate to the lowest common manager in the hierarchy. The losing agent's reasoning is preserved as a **dissent record** — a structured log entry containing the conflict context, both positions, and the resolution. Dissent records feed into organizational learning and can be reviewed during retrospectives. - -```yaml -conflict_resolution: - strategy: "authority" # authority, debate, human, hybrid -``` - -- Deterministic, zero extra tokens, fast resolution -- Dissent records create institutional memory of alternative approaches - -#### Strategy 2: Structured Debate + Judge - -Both agents present arguments (1 round each). A judge — their shared manager, the CEO, or a configurable arbitrator agent — evaluates both positions and decides. The judge's reasoning and both arguments are logged as a dissent record. - -```yaml -conflict_resolution: - strategy: "debate" - debate: - judge: "shared_manager" # shared_manager, ceo, designated_agent -``` - -- Better decisions — forces agents to articulate reasoning -- Higher token cost, adds latency proportional to argument length - -#### Strategy 3: Human Escalation - -All genuine conflicts go to the human approval queue with both positions summarized. The agent(s) park the conflicting task and work on other tasks while waiting (see §12.4 Approval Timeout). - -```yaml -conflict_resolution: - strategy: "human" -``` - -- Safest — human always makes the call -- Bottleneck at scale, depends on human availability - -#### Strategy 4: Hybrid (Recommended for Production) - -Combines strategies with an intelligent review layer: - -1. Both agents present arguments (1 round) — preserving dissent -2. A **conflict review agent** evaluates the result: - - If the resolution is **clear** (one position is objectively better, or authority applies cleanly) → resolve automatically, log dissent record - - If the resolution is **ambiguous** (genuine trade-offs, no clear winner) → escalate to human queue with both positions + the review agent's analysis - -```yaml -conflict_resolution: - strategy: "hybrid" - hybrid: - review_agent: "conflict_reviewer" # dedicated agent or role - escalate_on_ambiguity: true -``` - -- Best balance: most conflicts resolve fast, humans only see genuinely hard calls -- Most complex to implement; review agent itself needs careful prompt design - -### 5.7 Meeting Protocol - -Meetings (§5.1 Pattern 3) follow configurable protocols that determine how agents interact during structured multi-agent conversations. Different meeting types naturally suit different protocols. All protocols implement a `MeetingProtocol` protocol, making the system extensible — new protocols can be registered and selected per meeting type. Cost bounds are enforced by `duration_tokens` in meeting config (§5.4). - -> **Current state:** All 3 meeting protocols are implemented in `communication/meeting/`: `RoundRobinProtocol`, `PositionPapersProtocol`, and `StructuredPhasesProtocol`. The `MeetingOrchestrator` runs meetings end-to-end with token budget enforcement via `TokenTracker`. Shared LLM response parsing for decisions and action items is in `_parsing.py`. All protocols implement the `MeetingProtocol` protocol interface. - -#### Protocol 1: Round-Robin Transcript - -The meeting leader calls each participant in turn. A shared transcript grows as each agent responds, seeing all prior contributions. The leader summarizes and extracts action items at the end. - -```yaml -meeting_protocol: "round_robin" -round_robin: - max_turns_per_agent: 2 - max_total_turns: 16 - leader_summarizes: true -``` - -- Simple, natural conversation feel, each agent sees full context -- Token cost grows quadratically; last speaker has more context (ordering bias) -- **Best for**: Daily standups, status updates, small groups (3-5 agents) - -#### Protocol 2: Async Position Papers + Synthesizer - -Each agent independently writes a short position paper (parallel execution, no shared context). A synthesizer agent reads all positions, identifies agreements and conflicts, and produces decisions + action items. - -```yaml -meeting_protocol: "position_papers" -position_papers: - max_tokens_per_position: 300 - synthesizer: "meeting_leader" # who synthesizes -``` - -- Cheapest — parallel calls, no quadratic growth, no ordering bias, no groupthink -- Loses back-and-forth dialogue; agents can't challenge each other's ideas -- **Best for**: Brainstorming, architecture proposals, large groups, cost-sensitive meetings - -#### Protocol 3: Structured Phases - -Meeting split into phases with targeted participation: - -1. **Agenda broadcast** — leader shares agenda and context to all participants -2. **Input gathering** — each agent submits input independently (parallel) -3. **Discussion round** — only triggered if conflicts are detected between inputs; relevant agents debate (1 round, capped tokens) -4. **Decision + action items** — leader synthesizes, creates tasks from action items - -```yaml -meeting_protocol: "structured_phases" -auto_create_tasks: true # action items become tasks (top-level, applies to any protocol) -structured_phases: - skip_discussion_if_no_conflicts: true - max_discussion_tokens: 1000 -``` - -- Cost-efficient — parallel input, discussion only when needed -- More complex orchestration; conflict detection between inputs needs design -- **Best for**: Sprint planning, design reviews, architecture decisions - ---- - -## 6. Task & Workflow Engine - -### 6.1 Task Lifecycle - -```text - ┌──────────┐ - │ CREATED │ - └─────┬─────┘ - │ assignment - ┌─────▼─────┐ ┌──────────┐ - ┌──────│ ASSIGNED │──────────▶│ FAILED │ - │ └─────┬─────┘◀───┐ └────┬─────┘ - │ │ starts │ reassign │ - │ ┌─────▼─────┐ │ ┌────▼─────┐ - │ │IN_PROGRESS │───┼─────▶│ (retry) │ - │ └─────┬─────┘ │ └──────────┘ - │ │ ◀── (rework) - │ │ agent done - │ ┌─────▼─────┐ - │ │ IN_REVIEW │ - │ └─────┬─────┘ - │ │ approved - │ ┌─────▼─────┐ - │ │ COMPLETED │ - │ └────────────┘ - │ - │ blocked cancelled (from ASSIGNED or IN_PROGRESS) - ┌─────▼─────┐ ┌────────────┐ - │ BLOCKED │ │ CANCELLED │ ◀── ASSIGNED / IN_PROGRESS - └─────┬─────┘ └────────────┘ - │ unblocked (terminal) - └──▶ ASSIGNED - - shutdown signal: - ┌─────────────┐ - │ INTERRUPTED │──── reassign on restart ──▶ ASSIGNED - └─────────────┘ -``` - -> **Non-terminal states:** BLOCKED, FAILED, and INTERRUPTED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6), INTERRUPTED returns to ASSIGNED on restart (see §6.7). COMPLETED and CANCELLED are terminal states with no outgoing transitions. -> -> **Transitions into FAILED:** Both `ASSIGNED → FAILED` (early setup failures) and `IN_PROGRESS → FAILED` (runtime crashes) are valid. `FAILED → ASSIGNED` enables reassignment when `retry_count < max_retries`. -> -> **Transitions into INTERRUPTED:** Both `ASSIGNED → INTERRUPTED` and `IN_PROGRESS → INTERRUPTED` are valid (graceful shutdown can occur at any active phase). `INTERRUPTED → ASSIGNED` enables reassignment on restart. - -> **Runtime wrapper:** During execution, `Task` is wrapped by `TaskExecution` (in `engine/task_execution.py`). `TaskExecution` is a frozen Pydantic model that tracks status transitions via `model_copy(update=...)`, accumulates `TokenUsage` cost, and records a `StatusTransition` audit trail. The original `Task` is preserved unchanged; `to_task_snapshot()` produces a `Task` copy with the current execution status for persistence. - -### 6.2 Task Definition - -```yaml -task: - id: "task-123" - title: "Implement user authentication API" - description: "Create REST endpoints for login, register, logout with JWT tokens" - type: "development" # development, design, research, review, meeting, admin - priority: "high" # critical, high, medium, low - project: "proj-456" - created_by: "product_manager_1" - assigned_to: "sarah_chen" - reviewers: ["engineering_lead", "security_engineer"] - dependencies: ["task-120", "task-121"] - artifacts_expected: - - type: "code" - path: "src/auth/" - - type: "tests" - path: "tests/auth/" - - type: "documentation" - path: "docs/api/auth.md" - acceptance_criteria: - - "JWT-based auth with refresh tokens" - - "Rate limiting on login endpoint" - - "Unit and integration tests with >80% coverage" - - "API documentation" - estimated_complexity: "medium" # simple, medium, complex, epic - task_structure: "parallel" # sequential, parallel, mixed (see §6.9) - coordination_topology: "auto" # auto, sas, centralized, decentralized, context_dependent (see §6.9) - budget_limit: 2.00 # max USD for this task - deadline: null - max_retries: 1 # max reassignment attempts after failure (0 = no retry) - status: "assigned" - parent_task_id: null # parent task ID when created via delegation - delegation_chain: [] # ordered agent IDs of delegators (root first) -``` - -### 6.3 Workflow Types - -#### Sequential Pipeline - -```text -Requirements ──▶ Design ──▶ Implementation ──▶ Review ──▶ Testing ──▶ Deploy -``` - -#### Parallel Execution - -```text - ┌──▶ Frontend Dev ──┐ -Task ───┤ ├──▶ Integration ──▶ QA - └──▶ Backend Dev ──┘ -``` - -> **Current state:** `ParallelExecutor` (in `engine/parallel.py`) implements concurrent agent execution with `asyncio.TaskGroup`, configurable concurrency limits, resource locking for exclusive file access, error isolation, and progress tracking. Models in `engine/parallel_models.py`: `AgentAssignment`, `ParallelExecutionGroup`, `AgentOutcome`, `ParallelExecutionResult`, `ParallelProgress`. - -#### Kanban Board - -```text -Backlog │ Ready │ In Progress │ Review │ Done - ○ │ ○ │ ● │ ○ │ ●●● - ○ │ ○ │ ● │ │ ●● - ○ │ │ │ │ ● -``` - -#### Agile Sprints - -```text -Sprint Backlog → Sprint Execution → Review → Retrospective → Next Sprint -``` - -### 6.4 Task Routing & Assignment - -Tasks can be assigned through multiple strategies: - -| Strategy | Description | -|----------|-------------| -| **Manual** | Human or manager explicitly assigns | -| **Role-based** | Auto-assign to agents with matching role/skills | -| **Load-balanced** | Distribute evenly across available agents | -| **Auction** | Agents "bid" on tasks based on confidence/capability | -| **Hierarchical** | Flow down through management chain | -| **Cost-optimized** | Assign to cheapest capable agent | - -> **Current state:** All six strategies are implemented behind the `TaskAssignmentStrategy` protocol. Manual, Role-based, Load-balanced, Cost-optimized, and Auction strategies are in the static `STRATEGY_MAP`. Hierarchical requires a `HierarchyResolver` at runtime via `build_strategy_map(hierarchy=...)`. Config-level `TaskAssignmentConfig` validates strategy names against the known set. Scoring-based strategies filter out agents at capacity via `AssignmentRequest.max_concurrent_tasks`. Error signaling contract: `ManualAssignmentStrategy` raises exceptions (`TaskAssignmentError`, `NoEligibleAgentError`); scoring-based strategies return `AssignmentResult(selected=None)`. `TaskAssignmentService` propagates both patterns. - -### 6.5 Agent Execution Loop - -The agent execution loop defines how an agent processes a task from start to finish. The framework provides multiple configurable loop architectures behind an `ExecutionLoop` protocol, making the system extensible. The default can vary by task complexity, and is configurable per agent or role. - -> **Current state:** ReAct (Loop 1) and Plan-and-Execute (Loop 2) are implemented. `ParallelExecutor` enables concurrent `AgentEngine.run()` calls with `TaskGroup` + Semaphore concurrency limits, resource locking, and error isolation (see §6.3). Hybrid loop and auto-selection are planned. - -#### ExecutionLoop Protocol - -All loop implementations satisfy the `ExecutionLoop` runtime-checkable protocol (defined in `engine/loop_protocol.py`): - -- **`get_loop_type() -> str`** — returns a unique identifier (e.g. `"react"`) -- **`execute(...) -> ExecutionResult`** — runs the loop to completion, accepting `AgentContext`, `CompletionProvider`, optional `ToolInvoker`, optional `BudgetChecker`, optional `ShutdownChecker`, and optional `CompletionConfig` - -Supporting models: - -- **`TerminationReason`** — enum: `COMPLETED`, `MAX_TURNS`, `BUDGET_EXHAUSTED`, `SHUTDOWN`, `ERROR` -- **`TurnRecord`** — frozen per-turn stats (tokens, cost, tool calls, finish reason) -- **`ExecutionResult`** — frozen outcome with final context, termination reason, turn records, and optional error message (required when reason is `ERROR`) -- **`BudgetChecker`** — callback type `Callable[[AgentContext], bool]` invoked before each LLM call -- **`ShutdownChecker`** — callback type `Callable[[], bool]` checked at turn boundaries to initiate cooperative shutdown - -#### Loop 1: ReAct (Default for Simple Tasks) - -A single interleaved loop: the agent reasons about the current state, selects an action (tool call or response), observes the result, and repeats until done or `max_turns` is reached. - -```text -┌──────────────────────────────────────────┐ -│ ReAct Loop │ -│ │ -│ ┌─────────┐ ┌──────┐ ┌──────────┐ │ -│ │ Think │──▶│ Act │──▶│ Observe │ │ -│ └─────────┘ └──────┘ └────┬─────┘ │ -│ ▲ │ │ -│ └─────────────────────────┘ │ -│ │ -│ Terminate when: task complete, max │ -│ turns, budget exhausted, or error │ -└──────────────────────────────────────────┘ -``` - -```yaml -execution_loop: "react" # react, plan_execute, hybrid, auto -``` - -- Simple, proven, flexible. Easy to implement. Works well for short tasks -- Token-heavy on long tasks (re-reads full context every turn). No long-term planning — greedy step-by-step -- **Best for**: Simple tasks, quick fixes, single-file changes - -#### Loop 2: Plan-and-Execute - -A two-phase approach: the agent first generates a step-by-step plan, then executes each step sequentially. On failure, the agent can replan. Different models can be used for planning vs execution (e.g., large for planning, small for execution steps). - -```text -┌──────────────────────────────────────────┐ -│ Plan-and-Execute │ -│ │ -│ ┌──────────┐ ┌───────────────────┐ │ -│ │ Plan │───▶│ Execute Steps │ │ -│ │ (1 call) │ │ (N calls) │ │ -│ └──────────┘ └────────┬──────────┘ │ -│ ▲ │ │ -│ └────── replan ──────┘ │ -│ (on step failure) │ -└──────────────────────────────────────────┘ -``` - -```yaml -execution_loop: "plan_execute" -plan_execute: - planner_model: null # null = use agent's model; override for cost optimization - executor_model: null - max_replans: 3 -``` - -- Token-efficient for long tasks. Auditable plan artifact. Supports model tiering -- Rigid — plan may be wrong, replanning is expensive. Over-plans simple tasks -- **Best for**: Complex multi-step tasks, epic-level work, tasks spanning multiple files - -#### Loop 3: Hybrid Plan + ReAct Steps (Recommended for Complex Tasks) - -The agent creates a high-level plan (3-7 steps). Each step is executed as a mini-ReAct loop with its own turn limit. After each step, the agent checkpoints — summarizing progress and optionally replanning remaining steps. Checkpoints are natural points for human inspection or task suspension. - -```text -┌──────────────────────────────────────────────┐ -│ Hybrid: Plan + ReAct Steps │ -│ │ -│ ┌──────────┐ │ -│ │ Plan │ │ -│ └────┬─────┘ │ -│ │ │ -│ ┌────▼────────────────────────────────────┐ │ -│ │ Step 1: mini-ReAct (think→act→observe) │ │ -│ └────┬────────────────────────────────────┘ │ -│ │ checkpoint: summarize progress │ -│ ┌────▼────────────────────────────────────┐ │ -│ │ Step 2: mini-ReAct │ │ -│ └────┬────────────────────────────────────┘ │ -│ │ checkpoint: replan if needed │ -│ ┌────▼────────────────────────────────────┐ │ -│ │ Step N: mini-ReAct │ │ -│ └─────────────────────────────────────────┘ │ -└──────────────────────────────────────────────┘ -``` - -```yaml -execution_loop: "hybrid" -hybrid: - max_plan_steps: 7 - max_turns_per_step: 5 - checkpoint_after_each_step: true - allow_replan: true -``` - -- Strategic planning + tactical flexibility. Natural checkpoints for suspension/inspection -- Most complex to implement. Plan granularity needs tuning per task type -- **Best for**: Complex tasks, multi-file refactoring, tasks requiring both planning and adaptivity - -> **Auto-selection (optional):** When `execution_loop: "auto"`, the framework selects the loop based on `estimated_complexity`: simple → ReAct, medium → Plan-and-Execute, complex/epic → Hybrid. Configurable via `auto_loop_rules` — a mapping of complexity thresholds to loop implementations (e.g., `{simple_max_tokens: 500, medium_max_tokens: 3000}` with corresponding loop assignments). - -#### AgentEngine Orchestrator - -`AgentEngine` (in `engine/agent_engine.py`) is the top-level entry point for running an agent on a task. It composes the execution loop with prompt construction, context management, tool invocation, and cost tracking into a single `run()` call. - -**`async run(identity, task, completion_config?, max_turns?, memory_messages?, timeout_seconds?) -> AgentRunResult`** - -Pipeline steps: - -1. **Validate inputs** — agent must be `ACTIVE`, task must be `ASSIGNED` or `IN_PROGRESS`. Raises `ExecutionStateError` on violation. -2. **Pre-flight budget enforcement** — if `BudgetEnforcer` is provided, check monthly hard stop and daily limit via `check_can_execute()`, then apply auto-downgrade via `resolve_model()`. Raises `BudgetExhaustedError` or `DailyLimitExceededError` on violation. -3. **Build system prompt** — calls `build_system_prompt()` with agent identity and task. Tool definitions are NOT included — they are supplied via the API's `tools` parameter (see D22 below). Follows the **non-inferable-only principle**: system prompts include only information the agent cannot discover by reading the codebase or environment (role constraints, custom conventions, organizational policies). Generic architecture overviews and file structure descriptions are excluded — [research](https://arxiv.org/abs/2602.11988) shows they reduce success rates while increasing costs 20%+. - -> **Decision ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D22):** Do NOT list available tools in the system prompt — the API's `tools` parameter already injects richer tool definitions including JSON schemas. The system prompt listing is strictly inferior (no schemas) and wastes 200-400+ tokens per call. Behavioral guidance ("when to use tool X vs Y") may be added later as non-redundant value. -4. **Create context** — `AgentContext.from_identity()` with the configured `max_turns`. -5. **Seed conversation** — injects system prompt, optional memory messages, and formatted task instruction as initial messages. -6. **Transition task** — `ASSIGNED` → `IN_PROGRESS` (pass-through if already `IN_PROGRESS`). -7. **Prepare tools and budget** — creates `ToolInvoker` from registry and `BudgetChecker` from `BudgetEnforcer` (task + monthly + daily limits with pre-computed baselines and alert deduplication) or from task budget limit alone when no enforcer is configured. -8. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. If `timeout_seconds` is set, wraps the call in `asyncio.wait_for`; on expiry the run returns with `TerminationReason.ERROR` but cost recording and post-execution processing still occur. -9. **Record costs** — records accumulated `TokenUsage` to `CostTracker` (if available). Cost recording failures are logged but do not affect the result. -10. **Apply post-execution transitions** — on `COMPLETED` termination: IN_PROGRESS → IN_REVIEW → COMPLETED (two-hop auto-complete; reviewers planned). On `SHUTDOWN` termination: current status → INTERRUPTED (see §6.7). On `ERROR` termination: recovery strategy is applied (default `FailAndReassignStrategy` transitions to FAILED; see §6.6). All other termination reasons (`MAX_TURNS`, `BUDGET_EXHAUSTED`) leave the task in its current state. Transition failures are logged but do not discard the successful execution result. -11. **Return result** — wraps `ExecutionResult` in `AgentRunResult` with engine-level metadata. - -Error handling: `MemoryError` and `RecursionError` propagate unconditionally. `BudgetExhaustedError` (including `DailyLimitExceededError`) returns `TerminationReason.BUDGET_EXHAUSTED` without recovery — budget exhaustion is a controlled stop, not a crash. All other exceptions are caught and wrapped in an `AgentRunResult` with `TerminationReason.ERROR`. - -Constructor accepts: `provider` (required), `execution_loop` (defaults to `ReactLoop`), `tool_registry`, `cost_tracker`, `recovery_strategy` (defaults to `FailAndReassignStrategy`), `shutdown_checker`, `budget_enforcer`. The `run()` method also accepts `memory_messages` — optional working memory to inject between the system prompt and task instruction. - -Logs structured events under the `execution.engine.*` namespace (13 constants in `events/execution.py`): creation, start, prompt built, completion, errors, budget stopped, invalid input, task transitions, cost recording outcomes, task metrics, and timeout. - -**`AgentRunResult`** — frozen Pydantic model wrapping `ExecutionResult` with engine metadata: - -- `execution_result` — outcome from the execution loop -- `system_prompt` — the `SystemPrompt` used for this run -- `duration_seconds` — wall-clock run time -- `agent_id`, `task_id` — identifiers -- Computed fields: `termination_reason`, `total_turns`, `total_cost_usd`, `is_success`, `completion_summary` - -### 6.6 Agent Crash Recovery - -When an agent execution fails unexpectedly (unhandled exception, OOM, process kill), the framework needs a recovery mechanism. Recovery strategies are implemented behind a `RecoveryStrategy` protocol, making the system pluggable — new strategies can be added without modifying existing ones. - -> **MVP: Fail-and-Reassign only (Strategy 1).** Checkpoint Recovery is planned. - -**`RecoveryStrategy` protocol:** - -| Method | Signature | Description | -|--------|-----------|-------------| -| `recover` | `async def recover(*, task_execution: TaskExecution, error_message: str, context: AgentContext) -> RecoveryResult` | Apply recovery to a failed task execution | -| `get_strategy_type` | `def get_strategy_type() -> str` | Return strategy type identifier (must not be empty) | - -**`RecoveryResult` model (frozen):** - -| Field | Type | Description | -|-------|------|-------------| -| `task_execution` | `TaskExecution` | Updated execution after recovery (typically `FAILED`) | -| `strategy_type` | `NotBlankStr` | Strategy identifier | -| `context_snapshot` | `AgentContextSnapshot` | Redacted snapshot (turn count, accumulated cost, message count, max turns — no message contents) | -| `error_message` | `NotBlankStr` | Error that triggered recovery | -| `can_reassign` | `bool` (computed) | `retry_count < task.max_retries` | - -#### Strategy 1: Fail-and-Reassign (Default / MVP) - -The engine catches the failure at its outermost boundary, logs a redacted `AgentContext` snapshot (turn count, accumulated cost — excluding message contents to avoid leaking sensitive prompts/tool outputs), transitions the task to `FAILED`, and makes it available for reassignment (manual or automatic via the task router). - -> **Non-terminal state:** `FAILED` is a `TaskStatus` variant alongside `CANCELLED`. `FAILED` differs from `CANCELLED` (which is terminal) in that failed tasks are eligible for automatic reassignment. Valid transitions: `IN_PROGRESS → FAILED`, `ASSIGNED → FAILED` (early setup failures), `FAILED → ASSIGNED` (reassignment). See the updated §6.1 lifecycle diagram. - -```yaml -crash_recovery: - strategy: "fail_reassign" # fail_reassign, checkpoint -``` - -- Simple, no persistence dependency -- All progress is lost on crash — acceptable for short single-agent tasks in the MVP - -On crash: -1. Catch exception at the `AgentEngine` boundary (outermost `try/except` in `AgentEngine.run()`) -2. Log at ERROR with redacted `AgentContextSnapshot` (turn count, accumulated cost, message count, max turns — message contents excluded) -3. Transition `TaskExecution` → `FAILED` with the exception as the failure reason -4. `RecoveryResult.can_reassign` reports whether `retry_count < max_retries` - -> **Current limitation:** The `can_reassign` flag is computed and returned in `RecoveryResult`, but automated reassignment is not yet implemented — the task router (§6.4) will consume this in a future release. The caller (task router) is responsible for incrementing `retry_count` when creating the next `TaskExecution`. - -#### Strategy 2: Checkpoint Recovery (Planned) - -The engine persists an `AgentContext` snapshot after each completed turn. On crash, the framework detects the failure (via heartbeat timeout or exception), loads the last checkpoint, and resumes execution from the exact turn where it left off. The immutable `model_copy(update=...)` pattern makes checkpointing trivial — each `AgentContext` is a complete, self-contained frozen state that serializes cleanly via `model_dump_json()`. - -```yaml -crash_recovery: - strategy: "checkpoint" - checkpoint: - persist_every_n_turns: 1 # checkpoint frequency - storage: "sqlite" # sqlite, filesystem - heartbeat_interval_seconds: 30 # detect unresponsive agents - max_resume_attempts: 2 # retry limit before falling back to fail_reassign -``` - -- Preserves progress — critical for long tasks (multi-step plans, epic-level work) -- Requires persistence layer and environment state reconciliation on resume -- Natural fit with the existing immutable state model - -> **Environment reconciliation:** When resuming from a checkpoint, the agent's tools and workspace may have changed (other agents modified files, external state drifted). The checkpoint strategy includes a reconciliation step: the resumed agent receives a summary of changes since the checkpoint timestamp and can adapt its plan accordingly. This is analogous to a developer returning to a branch after colleagues have pushed changes. - -### 6.7 Graceful Shutdown Protocol - -When the process receives SIGTERM/SIGINT (user Ctrl+C, Docker stop, systemd shutdown), the framework needs to stop cleanly without losing work or leaking costs. Shutdown strategies are implemented behind a `ShutdownStrategy` protocol, making the system pluggable — new strategies can be added without modifying existing ones. - -> **MVP: Cooperative with Timeout only (Strategy 1).** Other strategies are future options enabled by the protocol interface. - -#### Strategy 1: Cooperative with Timeout (Default / MVP) - -The engine sets a shutdown event, stops accepting new tasks, and gives in-flight agents a grace period to finish their current turn. Agents check the shutdown event at turn boundaries (between LLM calls, before tool invocations) and exit cooperatively. After the grace period, remaining agents are force-cancelled. **All tasks terminated by shutdown — whether they exited cooperatively or were force-cancelled — are marked `INTERRUPTED`** by the engine layer. - -```yaml -graceful_shutdown: - strategy: "cooperative_timeout" # cooperative_timeout, immediate, finish_tool, checkpoint - cooperative_timeout: - grace_seconds: 30 # time for agents to finish cooperatively - cleanup_seconds: 5 # time for final cleanup (persist cost records, close connections) -``` - -On shutdown signal: -1. Set `shutdown_event` (`asyncio.Event`) — agents check this at turn boundaries -2. Stop accepting new tasks (drain gate closes) -3. Wait up to `grace_seconds` for agents to exit cooperatively -4. Force-cancel remaining agents (`task.cancel()`) — tasks transition to `INTERRUPTED` -5. Cleanup phase (`cleanup_seconds`): persist cost records, close provider connections, flush logs - -> **Non-terminal status:** `INTERRUPTED` is a `TaskStatus` variant. Unlike `FAILED` (eligible for automatic reassignment) or `CANCELLED` (terminal), `INTERRUPTED` indicates the task was stopped due to process shutdown — regardless of whether the agent exited cooperatively or was force-cancelled — and is eligible for manual or automatic reassignment on restart. Valid transitions: `ASSIGNED → INTERRUPTED`, `IN_PROGRESS → INTERRUPTED`, `INTERRUPTED → ASSIGNED` (reassignment on restart). See the updated §6.1 lifecycle diagram. -> -> **Windows compatibility:** `loop.add_signal_handler()` is not supported on Windows. The implementation uses `signal.signal()` as a fallback. SIGINT (Ctrl+C) works cross-platform; SIGTERM on Windows requires `os.kill()`. -> -> **In-flight LLM calls:** Non-streaming API calls that are interrupted result in tokens billed but no response received (silent cost leak). The engine logs request start (with input token count) before each provider call, so interrupted calls have at minimum an input-cost audit record. Streaming calls are charged only for tokens sent before disconnect. - -#### Strategy 2: Immediate Cancel (Future Option) - -All agent tasks are cancelled immediately via `task.cancel()`. Fastest shutdown but highest data loss — partial tool side effects, billed-but-lost LLM responses. - -#### Strategy 3: Finish Current Tool (Future Option) - -Like cooperative timeout, but waits for the current tool invocation to complete even if it exceeds the grace period. Needs per-tool timeout as a backstop for long-running sandboxed execution. - -#### Strategy 4: Checkpoint and Stop (Planned) - -On shutdown signal, each agent persists its full `AgentContext` snapshot and transitions to `SUSPENDED`. On restart, the engine loads checkpoints and resumes execution. This naturally extends the `CheckpointStrategy` from §6.6 — the only difference is whether the checkpoint was written proactively (graceful shutdown) or loaded from the last turn (crash recovery). - -> **Planned non-terminal status:** `SUSPENDED` is a new `TaskStatus` variant for checkpoint-based shutdown, to be added alongside `INTERRUPTED`. - -### 6.8 Concurrent Workspace Isolation - -> **Current state:** The `WorkspaceIsolationStrategy` protocol, `PlannerWorktreeStrategy` (git worktree backend), `MergeOrchestrator` (sequential merge with configurable conflict escalation), and `WorkspaceIsolationService` (lifecycle orchestrator with rollback and best-effort teardown) are implemented in `engine/workspace/`. `_validate_git_ref` raises context-appropriate exception types (`WorkspaceMergeError` in merge, `WorkspaceCleanupError` in teardown) with matching log events. `_run_git` similarly accepts a `log_event` parameter for context-aware timeout logging. Runtime multi-agent coordination using these components is planned. - -When multiple agents work on the same codebase concurrently, they may need to edit overlapping files. The framework provides a pluggable `WorkspaceIsolationStrategy` protocol for managing concurrent file access. The default strategy combines intelligent task decomposition with git worktree isolation — the dominant industry pattern (used by OpenAI Codex, Cursor, Claude Code, VS Code background agents). - -#### Strategy 1: Planner + Git Worktrees (Default) - -The task planner decomposes work to minimize file overlap across agents. Each agent operates in its own git worktree (shared `.git` object database, independent working tree). On completion, branches are merged sequentially. - -```text -Planner decomposes task: -├─ Agent A: src/auth/ (worktree-A) -├─ Agent B: src/api/ (worktree-B) -└─ Agent C: tests/ (worktree-C) - -Each in isolated git worktree - │ -On completion: sequential merge -├─ Merge A → main -├─ Rebase B on main, merge -└─ Rebase C on main, merge - │ -Textual conflicts: git detects, escalate to human or review agent -Semantic conflicts: review agent evaluates merged result -``` - -```yaml -workspace_isolation: - strategy: "planner_worktrees" # planner_worktrees, sequential, file_locking - planner_worktrees: - max_concurrent_worktrees: 8 - merge_order: "completion" # completion (first done merges first), priority, manual - conflict_escalation: "human" # human, review_agent -``` - -- True filesystem isolation — agents cannot overwrite each other's work -- Maximum parallelism during execution; conflicts deferred to merge time -- Leverages mature git infrastructure for merge, diff, and history - -#### Strategy 2: Sequential Dependencies (Future Option) - -Tasks with overlapping file scopes are ordered sequentially via a dependency graph. Prevents conflicts by construction but limits parallelism. Requires upfront knowledge of which files a task will touch. - -#### Strategy 3: File-Level Locking (Future Option) - -Files are locked at task assignment time. Eliminates conflicts at the source but requires predicting file access — difficult for LLM agents that discover what to edit as they go. Risk of deadlock if multiple agents need overlapping file sets. - -#### State Coordination vs Workspace Isolation - -These are complementary systems handling different types of shared state: - -| State Type | Coordination | Mechanism | -|-----------|-------------|-----------| -| Framework state (tasks, assignments, budget) | Centralized single-writer (`TaskEngine`) | `model_copy(update=...)` via async queue | -| Code and files (agent work output) | Workspace isolation (`WorkspaceIsolationStrategy`) | Git worktrees / branches | -| Agent memory (personal) | Per-agent ownership | Each agent owns its memory exclusively | -| Org memory (shared knowledge) | Single-writer (`OrgMemoryBackend`) | `OrgMemoryBackend` protocol with role-based write access control | - -### 6.9 Task Decomposability & Coordination Topology - -> **Current state:** Task structure classification (`TaskStructureClassifier`), DAG-based decomposition (`DecompositionService`, `DependencyGraph`, `ManualDecompositionStrategy`), LLM-based decomposition (`LlmDecompositionStrategy` with tool calling and JSON content fallback), status rollup (`StatusRollup`), agent-task scoring (`AgentTaskScorer`), routing (`TaskRoutingService`), and auto topology selection (`TopologySelector`) are implemented in `engine/decomposition/` and `engine/routing/`. Workspace isolation (`PlannerWorktreeStrategy`, `MergeOrchestrator`, `WorkspaceIsolationService`) is implemented in `engine/workspace/`. Runtime multi-agent coordination is planned. - -Empirical research on agent scaling ([Kim et al., 2025](https://arxiv.org/abs/2512.08296) — 180 controlled experiments across 3 LLM families and 4 benchmarks) demonstrates that **task decomposability is the strongest predictor of multi-agent effectiveness** — stronger than team size, model capability, or coordination architecture. - -#### Task Structure Classification - -Each task carries a `task_structure` field (see §6.2 Task Definition) classifying its decomposability: - -| Structure | Description | MAS Effect | Example | -|-----------|-------------|------------|---------| -| `sequential` | Steps must execute in strict order; each depends on prior state | **Negative** (−39% to −70%) | Multi-step build processes, ordered migrations, chained API calls | -| `parallel` | Sub-problems can be investigated independently, then synthesized | **Positive** (+57% to +81%) | Financial analysis (revenue + cost + market), multi-file review, research across sources | -| `mixed` | Some sub-tasks are parallel, but a sequential backbone connects phases | **Variable** (depends on ratio) | Feature implementation (design ∥ research → implement → test) | - -Classification can be: -- **Explicit** — set in task config by the task creator or manager agent -- **Inferred** — derived from task properties (tool count, dependency graph, acceptance criteria structure) by the task router - -#### Per-Task Coordination Topology - -The communication pattern (§5.1) is configured at the company level, but **coordination topology can be selected per-task** based on task structure and properties. This allows the engine to use the most efficient coordination approach for each task rather than applying a single company-wide pattern. - -| Task Properties | Recommended Topology | Rationale | -|----------------|---------------------|-----------| -| `sequential` + few artifacts (≤4) | **Single-agent (SAS)** | Coordination overhead fragments reasoning capacity on sequential tasks | -| `parallel` + structured domain | **Centralized** | Orchestrator decomposes, sub-agents execute in parallel, orchestrator synthesizes. Lowest error amplification (4.4×) | -| `parallel` + exploratory/open-ended | **Decentralized** | Peer debate enables diverse exploration of high-entropy search spaces | -| `mixed` | **Context-dependent** | Sequential backbone handled by single agent; parallel sub-tasks delegated to sub-agents | - -#### Auto Topology Selector - -When topology is set to `"auto"`, the engine selects coordination topology based on measurable task properties: - -```yaml -coordination: - topology: "auto" # auto, sas, centralized, decentralized, context_dependent - auto_topology_rules: - # sequential tasks → always single-agent - sequential_override: "sas" - # parallel tasks → select based on domain structure - parallel_default: "centralized" - # mixed tasks → SAS backbone for sequential phases, delegates parallel sub-tasks - mixed_default: "context_dependent" # hybrid: not a single topology — engine selects per-phase -``` - -The auto-selector uses task structure, artifact count, and (when available from the memory subsystem) historical single-agent success rate as inputs. The exact selection logic is an implementation detail — the spec defines the interface and the empirically-grounded heuristics above. - -> **Reference:** These heuristics are derived from Kim et al. (2025), which achieved 87% accuracy predicting optimal architecture from task properties across held-out configurations. Our context differs (role-differentiated agents vs. identical agents), so thresholds should be validated empirically once multi-agent execution is implemented. - ---- - -## 7. Memory & Persistence - -### 7.1 Memory Architecture - -```text -┌─────────────────────────────────────────────┐ -│ Agent Memory System │ -├──────────┬──────────┬───────────┬───────────┤ -│ Working │ Episodic │ Semantic │Procedural │ -│ Memory │ Memory │ Memory │ Memory │ -│ │ │ │ │ -│ Current │ Past │ Knowledge │ Skills & │ -│ task │ events & │ & facts │ how-to │ -│ context │ decisions│ learned │ │ -├──────────┴──────────┴───────────┴───────────┤ -│ Storage Backend │ -│ SQLite / PostgreSQL / File-based │ -│ + Mem0 (initial) / Custom Stack (future) │ -│ See ADR-001 │ -└─────────────────────────────────────────────┘ -``` - -### 7.2 Memory Types - -| Type | Scope | Persistence | Example | -|------|-------|-------------|---------| -| **Working** | Current task | None (in-context) | "I'm implementing the auth endpoint" | -| **Episodic** | Past events | Configurable | "Last sprint we chose JWT over sessions" | -| **Semantic** | Knowledge | Long-term | "This project uses Litestar with aiosqlite" | -| **Procedural** | Skills/patterns | Long-term | "Code reviews require 2 approvals here" | -| **Social** | Relationships | Long-term | "The QA lead prefers detailed test plans" | - -### 7.3 Memory Levels (Configurable) - -```yaml -memory: - level: "persistent" # none, session, project, persistent (default: session) - backend: "mem0" # mem0, custom, cognee, graphiti (future) — see ADR-001 - storage: - data_dir: "/data/memory" # mounted Docker volume path - vector_store: "qdrant" # qdrant (embedded), qdrant-external, etc. - history_store: "sqlite" # sqlite, postgresql - options: - retention_days: null # null = forever - max_memories_per_agent: 10000 - consolidation_interval: "daily" # compress old memories - shared_knowledge_base: true # agents can access shared facts (see §7.4) -``` - -### 7.4 Shared Organizational Memory - -Beyond individual agent memory (§7.1–7.3), the framework needs **organizational memory** — company-wide knowledge that all agents can access: policies, conventions, architecture decision records (ADRs), coding standards, and operational procedures. This is not personal episodic memory ("what I did last Tuesday") but institutional knowledge ("we always use Litestar, not Flask"). - -Shared organizational memory is implemented behind an `OrgMemoryBackend` protocol, making the system highly modular and extensible. New backends can be added without modifying existing ones. - -#### Backend 1: Hybrid Prompt + Retrieval (Default / MVP) - -Critical rules (5-10 items, e.g., "no commits to main," "all PRs need 2 approvals") are injected into every agent's system prompt. Extended knowledge (ADRs, detailed procedures, style guides) is stored in a queryable store and retrieved on demand at task start. - -```yaml -org_memory: - backend: "hybrid_prompt_retrieval" # hybrid_prompt_retrieval, graph_rag, temporal_kg - core_policies: # always in system prompt - - "All code must have 80%+ test coverage" - - "Use Litestar, not Flask" - - "PRs require 2 approvals" - extended_store: - backend: "sqlite" # sqlite, postgresql - max_retrieved_per_query: 5 - write_access: - policies: ["human"] # only humans write core policies - adrs: ["human", "senior", "lead", "c_suite"] - procedures: ["human", "senior", "lead", "c_suite"] -``` - -- Simple to implement. Core rules always present. Extended knowledge scales -- Basic retrieval may miss relational connections between policies - -#### Research Directions - -The following backends illustrate why `OrgMemoryBackend` is a protocol — the architecture supports future upgrades without modifying existing code. These are **not planned implementations**; they are research directions that may inform future work if/when organizational memory needs outgrow the Hybrid Prompt + Retrieval approach. - -#### Backend 2: GraphRAG Knowledge Graph (Research) - -Organizational knowledge stored as entities + relationships in a knowledge graph. Agents query via graph traversal, enabling multi-hop reasoning: "Litestar is our standard" → linked to → "don't use Flask" → linked to → "exception: data team uses Django for admin." - -```yaml -org_memory: - backend: "graph_rag" - graph: - store: "sqlite" # graph stored in relational DB, or dedicated graph DB - entity_extraction: "auto" # auto-extract entities from ADRs and policies -``` - -- Significant accuracy improvement over vector-only retrieval (some benchmarks report 3–4x gains). Multi-hop reasoning captures policy relationships -- More complex infrastructure. Entity extraction can be noisy. Heavier setup - -#### Backend 3: Temporal Knowledge Graph (Research) - -Like GraphRAG but tracks how facts change over time. "We used Flask until March 2026, then switched to Litestar." Agents see current truth but can query history for context. - -```yaml -org_memory: - backend: "temporal_kg" - temporal: - track_changes: true - history_retention_days: null # null = forever -``` - -- Handles policy evolution naturally. Agents understand when and why things changed -- Most complex. Potentially overkill for small companies or local-first use - -> **Extensibility:** All backends implement the `OrgMemoryBackend` protocol (`query(OrgMemoryQuery) → tuple[OrgFact, ...]`, `write(OrgFactWriteRequest, *, author: OrgFactAuthor) → NotBlankStr`, `list_policies() → tuple[OrgFact, ...]`, plus `connect`/`disconnect`/`health_check`/`is_connected`/`backend_name` lifecycle). The MVP ships with Backend 1; Backends 2 and 3 are research directions that may be explored if the default approach proves insufficient. The selected memory layer backend Mem0 (ADR-001) provides optional graph memory via Neo4j/FalkorDB, which could reduce implementation effort for Backends 2-3. -> **Write access control:** Core policies are human-only. ADRs and procedures can be written by senior+ agents. All writes are versioned and auditable. This prevents agents from corrupting shared organizational knowledge while allowing senior agents to document decisions. - -### 7.5 Memory Backend Protocol - -Agent memory (§7.1–7.4) is implemented behind a pluggable `MemoryBackend` protocol (Mem0 initial, custom stack future — ADR-001). Application code depends only on the protocol; the storage engine is an implementation detail swappable via config. - -#### Enums - -| Enum | Values | Purpose | -|------|--------|---------| -| `MemoryCategory` | WORKING, EPISODIC, SEMANTIC, PROCEDURAL, SOCIAL | Memory type categories (§7.2) | -| `MemoryLevel` | PERSISTENT, PROJECT, SESSION, NONE | Persistence level per agent (§7.3) | -| `ConsolidationInterval` | HOURLY, DAILY, WEEKLY, NEVER | How often old memories are compressed | - -#### MemoryBackend Protocol - -```python -@runtime_checkable -class MemoryBackend(Protocol): - """Lifecycle + CRUD for agent memory storage.""" - - async def connect(self) -> None: ... - async def disconnect(self) -> None: ... - async def health_check(self) -> bool: ... - - @property - def is_connected(self) -> bool: ... - @property - def backend_name(self) -> NotBlankStr: ... - - async def store(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... - async def retrieve(self, agent_id: NotBlankStr, query: MemoryQuery) -> tuple[MemoryEntry, ...]: ... - async def get(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> MemoryEntry | None: ... - async def delete(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... - async def count(self, agent_id: NotBlankStr, *, category: MemoryCategory | None = None) -> int: ... -``` - -#### MemoryCapabilities Protocol - -Backends that implement `MemoryCapabilities` expose what features they support, enabling runtime capability checks before attempting operations. - -```python -@runtime_checkable -class MemoryCapabilities(Protocol): - """Capability discovery for memory backends.""" - - @property - def supported_categories(self) -> frozenset[MemoryCategory]: ... - @property - def supports_graph(self) -> bool: ... - @property - def supports_temporal(self) -> bool: ... - @property - def supports_vector_search(self) -> bool: ... - @property - def supports_shared_access(self) -> bool: ... - @property - def max_memories_per_agent(self) -> int | None: ... -``` - -#### SharedKnowledgeStore Protocol - -Backends that support cross-agent shared knowledge implement this protocol alongside `MemoryBackend`. Not all backends need cross-agent queries — this keeps the base protocol clean. - -```python -@runtime_checkable -class SharedKnowledgeStore(Protocol): - """Cross-agent shared knowledge operations.""" - - async def publish(self, agent_id: NotBlankStr, request: MemoryStoreRequest) -> NotBlankStr: ... - async def search_shared(self, query: MemoryQuery, *, exclude_agent: NotBlankStr | None = None) -> tuple[MemoryEntry, ...]: ... - async def retract(self, agent_id: NotBlankStr, memory_id: NotBlankStr) -> bool: ... -``` - -#### Error Hierarchy - -All memory errors inherit from `MemoryError` so callers can catch the entire family with a single except clause. - -| Error | When Raised | -|-------|------------| -| `MemoryError` | Base exception for all memory operations | -| `MemoryConnectionError` | Backend connection cannot be established or is lost | -| `MemoryStoreError` | A store or delete operation fails | -| `MemoryRetrievalError` | A retrieve, search, or count operation fails | -| `MemoryNotFoundError` | A specific memory ID is not found | -| `MemoryConfigError` | Memory configuration is invalid | -| `MemoryCapabilityError` | An unsupported operation is attempted for a backend | - -#### Configuration - -```yaml -memory: - backend: "mem0" - level: "persistent" # none, session, project, persistent (default: session) - storage: - data_dir: "/data/memory" - vector_store: "qdrant" - history_store: "sqlite" - options: - retention_days: null # null = forever - max_memories_per_agent: 10000 - consolidation_interval: "daily" - shared_knowledge_base: true -``` - -Configuration is modeled by `CompanyMemoryConfig` (top-level), `MemoryStorageConfig` (storage paths/backends), and `MemoryOptionsConfig` (behaviour tuning). All are frozen Pydantic models. The `create_memory_backend(config)` factory returns an isolated `MemoryBackend` instance per company. - -#### Consolidation & Retention Configuration - -Memory consolidation, retention enforcement, and archival are configured via frozen Pydantic models in `memory/consolidation/config.py`: - -| Config | Purpose | -|--------|---------| -| `ConsolidationConfig` | Top-level: `max_memories_per_agent` limit, nested `retention` and `archival` sub-configs | -| `RetentionConfig` | Per-category `RetentionRule` tuples (category + retention_days), optional `default_retention_days` fallback | -| `ArchivalConfig` | Enables/disables archival of consolidated entries to `ArchivalStore` | - -Note: Retention is currently per-category, not per-agent. Per-agent retention overrides are a scope gap to be addressed in a future iteration. - -### 7.6 Operational Data Persistence - -Agent memory (§7.1–7.5) is handled by the `MemoryBackend` protocol (Mem0 initial, custom stack future — ADR-001). **Operational data** — tasks, cost records, messages, audit logs — is a separate concern managed by a pluggable `PersistenceBackend` protocol. Application code depends only on repository protocols; the storage engine is an implementation detail swappable via config. - -```text -┌──────────────────────────────────────────────────────────────────┐ -│ Application Code │ -│ engine/ budget/ communication/ security/ │ -│ │ │ │ │ │ -│ ▼ ▼ ▼ ▼ │ -│ ┌──────┐ ┌──────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Task │ │ Cost │ │ Message │ │ Audit │ ← Repository │ -│ │ Repo │ │ Repo │ │ Repo │ │ Repo │ Protocols │ -│ └──┬───┘ └──┬───┘ └────┬─────┘ └────┬─────┘ │ -│ └────────┴──────────┴────────────┘ │ -│ │ │ -│ ┌───────────────────┴───────────────────────────────────────┐ │ -│ │ PersistenceBackend (protocol) │ │ -│ │ connect() · disconnect() · health_check() · migrate() │ │ -│ └───────────────────┬───────────────────────────────────────┘ │ -│ │ │ -│ ┌───────────────────┴───────────────────────────────────────┐ │ -│ │ SQLitePersistenceBackend (initial) │ │ -│ │ PostgresPersistenceBackend (future) │ │ -│ │ MariaDBPersistenceBackend (future) │ │ -│ └───────────────────────────────────────────────────────────┘ │ -└──────────────────────────────────────────────────────────────────┘ -``` - -#### Protocol Design - -```python -@runtime_checkable -class PersistenceBackend(Protocol): - """Lifecycle management for operational data storage.""" - - async def connect(self) -> None: ... - async def disconnect(self) -> None: ... - async def health_check(self) -> bool: ... - async def migrate(self) -> None: ... - - @property - def is_connected(self) -> bool: ... - @property - def backend_name(self) -> NotBlankStr: ... - - @property - def tasks(self) -> TaskRepository: ... - @property - def cost_records(self) -> CostRecordRepository: ... - @property - def messages(self) -> MessageRepository: ... - # ... plus lifecycle_events, task_metrics, collaboration_metrics, - # parked_contexts, audit_entries -``` - -Each entity type has its own repository protocol: - -```python -@runtime_checkable -class TaskRepository(Protocol): - """CRUD + query interface for Task persistence.""" - - async def save(self, task: Task) -> None: ... - async def get(self, task_id: str) -> Task | None: ... - async def list_tasks(self, *, status: TaskStatus | None = None, assigned_to: str | None = None, project: str | None = None) -> tuple[Task, ...]: ... - async def delete(self, task_id: str) -> bool: ... - -@runtime_checkable -class CostRecordRepository(Protocol): - """CRUD + aggregation interface for CostRecord persistence.""" - - async def save(self, record: CostRecord) -> None: ... - async def query(self, *, agent_id: str | None = None, task_id: str | None = None) -> tuple[CostRecord, ...]: ... - async def aggregate(self, *, agent_id: str | None = None) -> float: ... - -@runtime_checkable -class MessageRepository(Protocol): - """CRUD + query interface for Message persistence.""" - - async def save(self, message: Message) -> None: ... - async def get_history(self, channel: str, *, limit: int | None = None) -> tuple[Message, ...]: ... -``` - -#### Configuration - -```yaml -persistence: - backend: "sqlite" # sqlite, postgresql, mariadb (future) - sqlite: - path: "/data/synthorg.db" # database file path (mounted volume in Docker) - wal_mode: true # WAL for concurrent read performance - journal_size_limit: 67108864 # 64 MB WAL journal limit - # postgresql: # future - # url: "postgresql://user:pass@host:5432/synthorg" - # pool_size: 10 - # mariadb: # future - # url: "mariadb://user:pass@host:3306/synthorg" - # pool_size: 10 -``` - -#### Entities Persisted - -| Entity | Source Module | Repository | Key Queries | -|--------|-------------|------------|-------------| -| `Task` | `core/task.py` | `TaskRepository` | by status, by assignee, by project | -| `CostRecord` | `budget/cost_record.py` | `CostRecordRepository` | by agent, by task, aggregations | -| `Message` | `communication/message.py` | `MessageRepository` | by channel | -| `AuditEntry` | `security/models.py` | `AuditRepository` | by agent, by action type, by verdict, by risk level, time range | -| `ParkedContext` | `security/timeout/parked_context.py` | `ParkedContextRepository` | by execution_id, by agent_id, by task_id | -| Agent runtime state (planned) | `engine/` | `AgentStateRepository` (planned) | by agent_id, active agents | - -#### Migration Strategy - -- Migrations run programmatically at startup via `PersistenceBackend.migrate()` -- Initial migration creates all tables -- Versioned migrations implemented per-backend (e.g. `persistence/sqlite/migrations.py` for SQLite) -- SQLite uses `user_version` pragma for version tracking; PostgreSQL/MariaDB use a migrations table - -#### Key Principles - -- **App code never imports a concrete backend** — only repository protocols -- **Adding a new backend** requires implementing `PersistenceBackend` + all repository protocols — no changes to consumers -- **Same entity models everywhere** — repositories accept and return the existing frozen Pydantic models (Task, CostRecord, Message), no ORM models or data transfer objects -- **Async throughout** — all repository methods are async, matching the project's concurrency model - -#### Multi-Tenancy - -Each company gets its own database. The `PersistenceConfig` embedded in a company's `RootConfig` specifies the backend type and connection details (e.g. a unique SQLite file path or PostgreSQL database URL). The `create_backend(config)` factory returns an isolated `PersistenceBackend` instance per company — no shared state, no cross-company data leakage. - -```python -# One database per company — configured in each company's YAML -company_a_backend = create_backend(company_a_config.persistence) -company_b_backend = create_backend(company_b_config.persistence) -# Each backend has independent lifecycle: connect → migrate → use → disconnect -``` - -#### Future: Runtime Backend Switching - -Runtime backend switching (e.g. migrating a company from SQLite to PostgreSQL during operation) is a planned future capability. The protocol-based design already supports this — the engine would disconnect the current backend, connect a new one with different config, and migrate. Implementation details (data migration tooling, zero-downtime switchover, connection draining) are deferred to the PostgreSQL backend implementation. - -### 7.7 Memory Injection Strategies - -Agent memory reaches agents through pluggable injection strategies behind -the `MemoryInjectionStrategy` protocol. The strategy determines *how* -memories are surfaced to the agent during execution. - -#### Strategy 1: Context Injection (Default / MVP) - -Pre-retrieves relevant memories before execution, ranks by -relevance+recency, enforces token budget, formats as ChatMessage(s) -injected between system prompt and task instruction. Agent passively -receives memories. - -> **Non-inferable filter:** Retrieved memories should be filtered before injection to exclude content the agent can discover by reading the codebase or environment. Only inject memories containing non-inferable information: prior decisions, learned conventions, interpersonal context, historical outcomes. [Research](https://arxiv.org/abs/2602.11988) shows generic context increases cost 20%+ with minimal success improvement; LLM-generated context can actually reduce success rates. -> -> **Decision ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D23):** Pluggable `MemoryFilterStrategy` protocol. Initial: tag-based at write time. Define `non-inferable` tag convention with advisory validation at `MemoryBackend.store()` boundary (warns on missing tags, never blocks). System prompt instructs agents what qualifies: design rationale, team decisions, "why not X", cross-repo knowledge = non-inferable; code structure, API signatures, file contents = inferable. Uses existing `MemoryMetadata.tags` and `MemoryQuery.tags` — zero new models needed. Future strategies: LLM classification at retrieval, keyword/pattern heuristic. - -Pipeline: `MemoryBackend.retrieve()` -> rank by relevance+recency -> -filter by min_relevance -> apply `MemoryFilterStrategy` (D23, optional) -> -greedy token-budget packing -> format as ChatMessage (configured role: -SYSTEM or USER) with delimiters. - -Ranking algorithm: -1. `relevance = entry.relevance_score ?? config.default_relevance` -2. Personal entries: `relevance = min(relevance + personal_boost, 1.0)` -3. `recency = exp(-decay_rate * age_hours)` -4. `combined = relevance_weight * relevance + recency_weight * recency` -5. Filter: `combined >= min_relevance` -6. Sort descending by `combined_score` - -Shared memories (from `SharedKnowledgeStore`) are fetched in parallel, -merged with personal memories (no personal_boost for shared), and -ranked together. - -#### Strategy 2: Tool-Based Retrieval (Future) - -Agent has `recall_memory` / `search_memory` tools it calls on-demand -during execution. Agent actively decides when and what to remember. -More token-efficient (only retrieves when needed) but consumes -tool-call turns and requires agent discipline to invoke. - -#### Strategy 3: Self-Editing Memory (Future) - -Agent has structured memory blocks (core, archival, recall) it reads -AND writes during execution via dedicated tools. Core memory always -in context, archival/recall searched via tools. Most sophisticated -(Letta/MemGPT-inspired) but highest complexity and LLM overhead. - -#### Protocol - -All strategies implement `MemoryInjectionStrategy`: -- `prepare_messages(agent_id, query_text, token_budget) -> tuple[ChatMessage, ...]` -- `get_tool_definitions() -> tuple[ToolDefinition, ...]` -- `strategy_name -> str` - -Strategy selection via config: `memory.retrieval.strategy: context | tool_based | self_editing` - ---- - -## 8. HR & Workforce Management - -> **Implementation note:** Hiring pipeline (`HiringService`), offboarding pipeline -> (`OffboardingService`), onboarding checklists (`OnboardingService`), and agent registry -> (`AgentRegistryService`) are now implemented. Performance tracking subsystem -> (`hr/performance/`) complete with pluggable quality scoring, collaboration scoring, -> trend detection, and multi-window aggregation. Promotions/demotions (section 8.4) -> are implemented in `hr/promotion/` — ThresholdEvaluator (D13), SeniorityApprovalStrategy -> (D14), SeniorityModelMapping (D15), PromotionService orchestrator. - -### 8.1 Hiring Process - -The HR system manages the agent workforce dynamically: - -1. HR agent (or human) identifies skill gap or workload issue -2. HR generates **candidate cards** based on team needs: - - What skills are underrepresented? - - What seniority level is needed? - - What personality would complement the team? - - What model/provider fits the budget? -3. Candidate cards are presented for approval (to CEO or human) -4. Approved candidates are instantiated and onboarded -5. Onboarding includes: company context, project briefing, team introductions. - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D8):** -> -> - **D8.1 — Source:** Templates + LLM customization. Templates for common roles (reuses existing template system §14.1). LLM generates config for novel roles not covered by templates. Approval gate catches invalid/bad configs before instantiation. -> - **D8.2 — Persistence:** Operational store via `PersistenceBackend` (§7.6). YAML stays as bootstrap seed — operational store wins for runtime state. Enables rehiring, auditable history. -> - **D8.3 — Hot-plug:** Agents are hot-pluggable at runtime via a dedicated company/registry service (not `AgentEngine`, which remains the per-agent task runner). Thread-safe registry, wired into message bus + tools + budget. - -### 8.2 Firing / Offboarding - -1. Triggered by: budget cuts, poor performance metrics, project completion, human decision -2. Agent's memory is archived (not deleted) -3. Active tasks are reassigned -4. Team is notified - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D9, D10):** -> -> - **D9 — Task Reassignment:** Pluggable `TaskReassignmentStrategy` protocol. Initial: queue-return — tasks return to unassigned queue, existing `TaskRoutingService` (§6.4) re-routes with priority boost for reassigned tasks. Future strategies: same-department/lowest-load, manager-decides (LLM), HR agent decides. -> - **D10 — Memory Archival:** Pluggable `MemoryArchivalStrategy` protocol. Initial: full snapshot, read-only. Pipeline: retrieve all → archive to `ArchivalStore` → selectively promote semantic+procedural to `OrgMemoryBackend` (rule-based) → clean hot store → mark TERMINATED. Rehiring = restore archived memories into new `AgentIdentity`. Future strategies: selective discard, full-accessible. - -### 8.3 Performance Tracking - -```yaml -agent_metrics: - tasks_completed: 42 - tasks_failed: 2 - average_quality_score: 8.5 # from code reviews, peer feedback - average_cost_per_task: 0.45 - average_completion_time: "2h" - collaboration_score: 7.8 # peer ratings - last_review_date: "2026-02-20" -``` - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D2, D3, D11, D12):** -> -> - **D2 — Quality Scoring:** Pluggable `QualityScoringStrategy` protocol. Initial: layered combination — (1) FREE: objective CI signals (test pass/fail, lint, coverage delta), (2) ~$1/day: small-model LLM judge (different family than agent) evaluates output vs acceptance criteria, (3) on-demand: human override via API, highest weight. Start with Layer 1 only; add layers incrementally. Future strategies: CI-only, LLM-only, human-only. -> - **D3 — Collaboration Scoring:** Pluggable `CollaborationScoringStrategy` protocol. Initial: automated behavioral telemetry — `collaboration_score = weighted_average(delegation_success_rate, delegation_response_latency, conflict_resolution_constructiveness, meeting_contribution_rate, loop_prevention_score, handoff_completeness)`. Weights configurable per-role. Optional: periodic LLM sampling (1%) for calibration + human override via API. Future strategies: LLM evaluation, peer ratings, human-provided. -> - **D11 — Rolling Windows:** Pluggable `MetricsWindowStrategy` protocol. Initial: multiple simultaneous windows — 7d (acute regressions), 30d (sustained patterns), 90d (baseline/drift). Min 5 data points per window; below that, report "insufficient data." Future strategies: fixed single window, per-metric configurable. -> - **D12 — Trend Detection:** Pluggable `TrendDetectionStrategy` protocol. Initial: Theil-Sen regression slope per window + configurable thresholds classify as improving/stable/declining. Theil-Sen has 29.3% outlier breakdown (tolerates ~1 in 3 bad data points). Min 5 data points. Future strategies: period-over-period, OLS regression, threshold-only. - -### 8.4 Promotions & Demotions - -Agents can move between seniority levels based on performance: -- Promotion criteria: sustained high quality scores, task complexity handled, peer feedback -- Demotion criteria: repeated failures, quality drops, cost inefficiency -- Promotions can unlock higher tool access levels (see Progressive Trust) -- Model upgrades/downgrades may accompany level changes (configurable) - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D13, D14, D15):** -> -> - **D13 — Promotion Criteria:** Pluggable `PromotionCriteriaStrategy` protocol. Initial: configurable threshold gates. `ThresholdEvaluator` with `min_criteria_met: int` (N of M) + `required_criteria: list[str]`. Setting min=total gives AND; min=1 gives OR. Default: junior→mid = 2 of 3 criteria, mid→senior = all. Future strategies: pure AND, pure OR. -> - **D14 — Promotion Approval:** Pluggable `PromotionApprovalStrategy` protocol. Initial: senior+ requires human approval. Junior→mid auto-promotes (low cost impact: small→medium ~4x). Demotions: auto-apply for cost-saving (model downgrade), human approval for authority-reducing demotions. Future strategies: all-human, configurable-per-level. -> - **D15 — Model Mapping:** Pluggable `ModelMappingStrategy` protocol. Initial: default ON — `hr.promotions.model_follows_seniority: true`. Model changes at task boundaries only (never mid-execution, consistent with auto-downgrade §10.4). Per-agent `preferred_model` overrides seniority default. Smart routing (§9.4) still uses cheap models for simple tasks regardless of seniority. Future strategies: always-applied, opt-in-only. - ---- - -## 9. Model Provider Layer - -### 9.1 Provider Abstraction - -```text -┌─────────────────────────────────────────────┐ -│ Unified Model Interface │ -│ completion(messages, tools, config) → resp │ -├───────────┬───────────┬───────────┬─────────┤ -│ Cloud API │OpenRouter │ Ollama │ Custom │ -│ Adapter │ Adapter │ Adapter │ Adapter │ -├───────────┼───────────┼───────────┼─────────┤ -│ Direct │ 400+ LLMs│ Local LLMs│ Any API │ -│ API call │ via OR │ Self-host │ │ -└───────────┴───────────┴───────────┴─────────┘ -``` - -### 9.2 Provider Configuration - -> Note: Model IDs, pricing, and provider examples below are **illustrative**. Actual models, costs, and provider availability will be determined during implementation and should be loaded dynamically from provider APIs where possible. - -```yaml -providers: - example-provider: - api_key: "${PROVIDER_API_KEY}" - models: # example entries — real list loaded from provider - - id: "example-large-001" - alias: "large" - cost_per_1k_input: 0.015 # illustrative, verify at implementation time - cost_per_1k_output: 0.075 - max_context: 200000 - estimated_latency_ms: 1500 # optional, used by fastest strategy - - id: "example-medium-001" - alias: "medium" - cost_per_1k_input: 0.003 - cost_per_1k_output: 0.015 - max_context: 200000 - estimated_latency_ms: 500 - - id: "example-small-001" - alias: "small" - cost_per_1k_input: 0.0008 - cost_per_1k_output: 0.004 - max_context: 200000 - estimated_latency_ms: 200 - - openrouter: - api_key: "${OPENROUTER_API_KEY}" - base_url: "https://openrouter.ai/api/v1" - models: # example entries - - id: "vendor-a/model-medium" - alias: "or-medium" - - id: "vendor-b/model-pro" - alias: "or-pro" - - id: "vendor-c/model-reasoning" - alias: "or-reasoning" - - ollama: - base_url: "http://localhost:11434" - models: # example entries - - id: "llama3.3:70b" - alias: "local-llama" - cost_per_1k_input: 0.0 # free, local - cost_per_1k_output: 0.0 - - id: "qwen2.5-coder:32b" - alias: "local-coder" - cost_per_1k_input: 0.0 - cost_per_1k_output: 0.0 -``` - -> **Implementation note:** `ProviderConfig` now includes `subscription: SubscriptionConfig` and `degradation: DegradationConfig` fields for per-provider quota limits and subscription-aware degradation behavior. The default degradation strategy is `ALERT` (raise `QuotaExhaustedError`). `FALLBACK` (route to fallback providers) and `QUEUE` (delay and retry) strategies are defined in the model but **not yet implemented** — the engine currently always raises on quota exhaustion regardless of strategy. Regular quota polling / proactive alerting before quotas are hit is deferred to a follow-up issue. - -### 9.3 LiteLLM Integration - -Use **LiteLLM** as the provider abstraction layer: -- Unified API across 100+ providers -- Built-in cost tracking -- Automatic retries and fallbacks -- Load balancing across providers -- OpenAI-compatible interface (all providers normalized) - -### 9.4 Model Routing Strategy - -```yaml -routing: - strategy: "smart" # smart, cheapest, fastest, role_based, cost_aware, manual - # Strategy behaviors: - # manual — resolve an explicit model override; fails if not set - # role_based — match agent seniority level to routing rules, then catalog default - # cost_aware — match task-type rules, then pick cheapest model within budget - # cheapest — alias for cost_aware - # fastest — match task-type rules, then pick fastest model (by estimated_latency_ms) - # within budget; falls back to cheapest when no latency data is available - # smart — priority cascade: override > task-type > role > seniority > cheapest > fallback chain - rules: - - role_level: "C-Suite" - preferred_model: "large" - fallback: "medium" - - role_level: "Senior" - preferred_model: "medium" - fallback: "small" - - role_level: "Junior" - preferred_model: "small" - fallback: "local-small" - - task_type: "code_review" - preferred_model: "medium" - - task_type: "documentation" - preferred_model: "small" - - task_type: "architecture" - preferred_model: "large" - fallback_chain: - - "example-provider" - - "openrouter" - - "ollama" -``` - ---- - -## 10. Cost & Budget Management - -### 10.1 Budget Hierarchy - -```text -Company Budget ($100/month) - ├── Engineering Dept (50%) ── $50 - │ ├── Backend Team (40%) ── $20 - │ ├── Frontend Team (30%) ── $15 - │ └── DevOps Team (30%) ── $15 - ├── Quality/QA (10%) ── $10 - ├── Product Dept (15%) ── $15 - ├── Operations (10%) ── $10 - └── Reserve (15%) ── $15 -``` - -> Note: Percentages are illustrative defaults. All allocations are configurable per company. - -### 10.2 Cost Tracking - -Every API call is tracked (illustrative schema): - -```json -{ - "agent_id": "sarah_chen", - "task_id": "task-123", - "provider": "example-provider", - "model": "example-medium-001", - "input_tokens": 4500, - "output_tokens": 1200, - "cost_usd": 0.0315, - "timestamp": "2026-02-27T10:30:00Z" -} -``` - -> **Implementation note:** `CostRecord` stores `input_tokens` and `output_tokens`; `total_tokens` is not stored on `CostRecord` — it is a `@computed_field` property on `TokenUsage` (the model embedded in `CompletionResponse`). `_SpendingTotals` base class provides shared `total_cost_usd`, `total_input_tokens`, `total_output_tokens`, and `record_count` fields. `AgentSpending`, `DepartmentSpending`, and `PeriodSpending` extend it with their dimension-specific fields. - -### 10.3 CFO Agent Responsibilities - -> **Current state:** Budget tracking, per-task cost recording, and cost controls (§10.4) are enforced by `BudgetEnforcer` (a service the engine composes, not an agent). CFO cost optimization is implemented via `CostOptimizer`. - -The CFO agent (when enabled) acts as a cost management system: - -- Monitors real-time spending across all agents -- Alerts when departments approach budget limits -- Suggests model downgrades when budget is tight -- Reports daily/weekly spending summaries -- Recommends hiring/firing based on cost efficiency -- Blocks tasks that would exceed remaining budget -- Optimizes model routing for cost/quality balance - -> **Implementation note:** `CostOptimizer` service (`budget/optimizer.py`) -> implements anomaly detection (sigma + spike factor), per-agent efficiency -> analysis, model downgrade recommendations (via `ModelResolver`), routing -> optimization suggestions (cost + context-window comparison), and operation -> approval evaluation. `ReportGenerator` service (`budget/reports.py`) -> produces multi-dimensional spending reports with task/provider/model -> breakdowns and period-over-period comparison. - -### 10.4 Cost Controls - -> **Minimal config:** -> -> ```yaml -> budget: -> total_monthly: 100.00 -> ``` -> -> All other fields below have sensible defaults. - -```yaml -budget: - total_monthly: 100.00 - reset_day: 1 - alerts: - warn_at: 75 # percent - critical_at: 90 - hard_stop_at: 100 - per_task_limit: 5.00 - per_agent_daily_limit: 10.00 - auto_downgrade: - enabled: true - threshold: 85 # percent of budget used - boundary: "task_assignment" # task_assignment only — NEVER mid-execution - downgrade_map: # ordered pairs — aliases reference configured models - - ["large", "medium"] - - ["medium", "small"] - - ["small", "local-small"] -``` - -> **Auto-downgrade boundary:** Model downgrades apply only at **task assignment time**, never mid-execution. An agent halfway through an architecture review cannot be switched to a cheaper model — the task completes on its assigned model. The next task assignment respects the downgrade threshold. This prevents quality degradation from mid-thought model switches. - -> **Implementation note:** `BudgetEnforcer` composes `CostTracker` + -> `BudgetConfig` + optional `QuotaTracker` + optional `ModelResolver` to -> provide three enforcement layers: (1) pre-flight checks via -> `check_can_execute` (monthly hard stop + per-agent daily limit + provider -> quota enforcement when `QuotaTracker` is present), (2) in-flight budget -> checking via a sync `BudgetChecker` closure with pre-computed baselines -> (task + monthly + daily limits, alert deduplication), and (3) -> task-boundary auto-downgrade via `resolve_model`. Billing periods are -> scoped by `billing_period_start(reset_day)`. `DailyLimitExceededError` -> is a subclass of `BudgetExhaustedError` for granular error handling. - -### 10.5 LLM Call Analytics - -> **Current state:** Proxy metrics, call categorization + coordination metric data models, and error taxonomy classification pipeline are implemented. Runtime collection pipeline for coordination metrics and full analytics layer are planned. - -Every LLM provider call is tracked with comprehensive metadata for financial reporting, debugging, and orchestration overhead analysis. The analytics system builds incrementally. - -#### Per-Call Tracking + Proxy Overhead Metrics - -Every completion call produces a `CompletionResponse` with `TokenUsage` (token counts and cost). The engine layer creates a `CostRecord` (with agent/task context) and records it into `CostTracker` — the provider itself does not have agent/task context. The engine additionally logs **proxy overhead metrics** at task completion: - -- `turns_per_task` — number of LLM turns to complete the task (from `AgentRunResult.total_turns`) -- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost.total_tokens`) -- `cost_per_task` — total USD cost (from `AgentContext.accumulated_cost.cost_usd` via `AgentRunResult.total_cost_usd`) -- `duration_seconds` — wall-clock execution time in seconds (from `AgentRunResult.duration_seconds`) -- `prompt_tokens` — estimated system prompt tokens (from `SystemPrompt.estimated_tokens`) -- `prompt_token_ratio` — ratio of prompt tokens to total tokens (overhead indicator, `@computed_field`; warns when >0.3) - -These are natural overhead indicators — a task consuming 15 turns and 50k tokens for a one-line fix signals a problem. - -These metrics are captured in `TaskCompletionMetrics` (in `engine/metrics.py`), a frozen Pydantic model with a `from_run_result()` factory method. The engine logs these metrics at task completion via the `EXECUTION_ENGINE_TASK_METRICS` event. - -#### Call Categorization + Orchestration Ratio - -> **Current state:** Data models (`LLMCallCategory`, `CategoryBreakdown`, `OrchestrationRatio`, `CostRecord.call_category`) and query methods (`CostTracker.get_category_breakdown`, `get_orchestration_ratio`) are implemented. Runtime categorization logic (automatic tagging of calls during multi-agent execution) is planned. - -When multi-agent coordination exists, each `CostRecord` is tagged with a **call category**: - -| Category | Description | Examples | -|----------|-------------|---------| -| `productive` | Direct task work — tool calls, code generation, task output | Agent writing code, running tests | -| `coordination` | Inter-agent communication — delegation, reviews, meetings | Manager reviewing work, agent presenting in meeting | -| `system` | Framework overhead — system prompt injection, context loading | Initial prompt, memory retrieval injection | - -The **orchestration ratio** (`coordination / total`) is surfaced in metrics and alerts. If coordination tokens consistently exceed productive tokens, the company configuration needs tuning (fewer approval layers, simpler meeting protocols, etc.). - -#### Coordination Metrics Suite - -A comprehensive suite of coordination metrics derived from empirical agent scaling research ([Kim et al., 2025](https://arxiv.org/abs/2512.08296)). These metrics explain coordination dynamics and enable data-driven tuning of multi-agent configurations. - -| Metric | Symbol | Definition | What It Signals | -|--------|--------|------------|-----------------| -| **Coordination efficiency** | `Ec` | `success_rate / (turns / turns_sas)` — success normalized by relative turn count vs single-agent baseline | Overall coordination ROI. Low Ec = coordination costs exceed benefits | -| **Coordination overhead** | `O%` | `(turns_mas - turns_sas) / turns_sas × 100%` — relative turn increase | Communication cost. Optimal band: 200–300%. Above 400% = over-coordination | -| **Error amplification** | `Ae` | `error_rate_mas / error_rate_sas` — relative failure probability | Whether MAS corrects or propagates errors. Centralized ≈ 4.4×, Independent ≈ 17.2× | -| **Message density** | `c` | Inter-agent messages per reasoning turn | Communication intensity. Performance saturates at ≈ 0.39 messages/turn | -| **Redundancy rate** | `R` | Mean cosine similarity of agent output embeddings | Agent agreement. Optimal at ≈ 0.41 (balances fusion with independence) | - -> **Configurable collection:** All 5 metrics are opt-in via `coordination_metrics.enabled` in analytics config. `Ec` and `O%` are cheap (turn counting). `Ae` requires baseline comparison data. `c` and `R` require semantic analysis of agent outputs (embedding computation). Enable selectively based on data-gathering needs. - -```yaml -coordination_metrics: - enabled: false # opt-in — enable for data gathering - collect: - - efficiency # cheap — turn counting - - overhead # cheap — turn counting - - error_amplification # requires SAS baseline data - - message_density # requires message counting infrastructure - - redundancy # requires embedding computation on outputs - baseline_window: 50 # number of SAS runs to establish baseline for Ae - error_taxonomy: - enabled: false # opt-in — enable for targeted diagnosis - categories: - - logical_contradiction - - numerical_drift - - context_omission - - coordination_failure -``` - -#### Full Analytics Layer (Planned) - -Expanded per-call metadata for comprehensive financial and operational reporting: - -```yaml -call_analytics: - track: - - call_category # productive, coordination, system - - success # true/false - - retry_count # 0 = first attempt succeeded - - retry_reason # rate_limit, timeout, internal_error - - latency_ms # wall-clock time for the call (not estimated_latency_ms from config) - - finish_reason # stop, tool_use, max_tokens, error - - cache_hit # prompt caching hit/miss (provider-dependent) - aggregation: - - per_agent_daily # agent spending over time - - per_task # total cost per task - - per_department # department-level rollups - - per_provider # provider reliability and cost comparison - - orchestration_ratio # coordination vs productive tokens - alerts: - orchestration_ratio: - info: 0.30 # info if coordination > 30% of total - warn: 0.50 # warn if coordination > 50% of total - critical: 0.70 # critical if coordination > 70% of total - retry_rate_warn: 0.1 # warn if > 10% of calls need retries -``` - -> **Design principle:** Analytics metadata is append-only and never blocks execution. Failed analytics writes are logged and skipped — the agent's task is never delayed by telemetry. All analytics data flows through the existing `CostRecord` and structured logging infrastructure. - -#### Coordination Error Taxonomy - -> **Current state:** Error taxonomy classification pipeline is implemented in `engine/classification/`. Four heuristic-based detectors (logical contradiction, numerical drift, context omission, coordination failure) run post-execution when enabled via `error_taxonomy_config`. Integrated into `AgentEngine`. Classification results are log-only; programmatic access is planned. Full semantic analysis detectors are planned. - -When coordination metrics collection is enabled, the system can optionally classify coordination errors into structured categories. This enables targeted diagnosis — e.g., if coordination failures spike, the topology may be too complex; if context omissions spike, the orchestrator's synthesis is insufficient. - -| Error Category | Description | Detection Method | -|---------------|-------------|-----------------| -| **Logical contradiction** | Agent asserts both "X is true" and "X is false", or derives conclusions violating its stated premises | Semantic contradiction detection on agent outputs | -| **Numerical drift** | Accumulated computational errors from cascading rounding or unit conversion (>5% deviation) | Numerical comparison against ground truth or cross-agent verification | -| **Context omission** | Failure to reference previously established entities, relationships, or state required for current reasoning | Missing-reference detection across agent conversation history | -| **Coordination failure** | MAS-specific: message misinterpretation, task allocation conflicts, state synchronization errors between agents | Protocol-level error detection in orchestration layer | - -> **Configurable and opt-in:** Error taxonomy classification requires semantic analysis of agent outputs and is expensive. Enable via `coordination_metrics.error_taxonomy.enabled: true` only when actively gathering data for system tuning. The classification pipeline runs post-execution (never blocks agent work) and logs structured events to the observability layer. This configuration is part of the main `coordination_metrics` block defined in the Coordination Metrics Suite section above. - -> **Reference:** Error categories derived from [Kim et al., 2025](https://arxiv.org/abs/2512.08296) and the Multi-Agent System Failure Taxonomy (MAST) by Cemri et al. (2025). Architecture-specific patterns: centralized coordination reduces logical contradictions by 36.4% and context omissions by 66.8% via orchestrator synthesis; hybrid topology introduces 12.4% coordination failures due to protocol complexity. - ---- - -## 11. Tool & Capability System - -### 11.1 Tool Categories - -| Category | Tools | Typical Roles | -|----------|-------|---------------| -| **File System** | Read, write, edit, list, delete files | All developers, writers | -| **Code Execution** | Run code in sandboxed environments | Developers, QA | -| **Version Control** | Git operations, PR management | Developers, DevOps | -| **Web** | HTTP requests, web scraping, search | Researchers, analysts | -| **Database** | Query, migrate, admin | Backend devs, DBAs | -| **Terminal** | Shell commands (sandboxed) | DevOps, senior devs | -| **Design** | Image generation, mockup tools | Designers | -| **Communication** | Email, Slack, notifications | PMs, executives | -| **Analytics** | Metrics, dashboards, reporting | Data analysts, CFO | -| **Deployment** | CI/CD, container management | DevOps, SRE | -| **MCP Servers** | Any MCP-compatible tool | Configurable per agent | - -### 11.1.1 Tool Execution Model - -When the LLM requests multiple tool calls in a single turn, `ToolInvoker.invoke_all` executes them **concurrently** using `asyncio.TaskGroup`. An optional `max_concurrency` parameter (default unbounded) limits parallelism via `asyncio.Semaphore`. Recoverable errors are captured as `ToolResult(is_error=True)` without aborting sibling invocations; non-recoverable errors (`MemoryError`, `RecursionError`) are collected and re-raised after all tasks complete (bare exception for one, `ExceptionGroup` for multiple). - -`BaseTool.parameters_schema` deep-copies the caller-supplied schema at construction and wraps it in `MappingProxyType` for read-only enforcement; the property returns a deep copy on access to prevent mutation of internal state. `ToolInvoker` deep-copies arguments at the tool execution boundary before passing them to `tool.execute()`. `MappingProxyType` wrapping is also used in `ToolRegistry` for its internal collections. - -**Permission checking:** Each `BaseTool` carries a `category: ToolCategory` attribute used for access-level gating. `ToolInvoker` accepts an optional `ToolPermissionChecker` which enforces the agent's `ToolPermissions.access_level` (see §11.2). Permission checking occurs after tool lookup but before parameter validation: - -1. `get_permitted_definitions()` filters tool definitions sent to the LLM — the agent only sees tools it is permitted to use. -2. At invocation time, denied tools return `ToolResult(is_error=True)` with a descriptive denial reason (defense-in-depth against LLM hallucinating unpresented tools). - -The `ToolPermissionChecker` resolves permissions using a priority-based system: denied list (highest) → allowed list → access-level categories → deny (default). `AgentEngine._make_tool_invoker()` creates a permission-aware invoker from the agent's `ToolPermissions` at the start of each `run()` call. Note: the current implementation provides category-level gating only; the granular sub-constraints described in §11.2 (workspace scope, network mode) are planned for when sandboxing is implemented. - -> **Implementation note — Built-in git tools:** Six workspace-scoped git tools are implemented in `tools/git_tools.py` with a shared `_BaseGitTool` base class in `tools/_git_base.py`: `GitStatusTool`, `GitLogTool`, `GitDiffTool`, `GitBranchTool`, `GitCommitTool`, and `GitCloneTool`. The base class enforces workspace boundary security (path traversal prevention via `resolve()` + `relative_to()`) and provides a common `_run_git()` helper using `asyncio.create_subprocess_exec` (never `shell=True`). Security hardening includes: `GIT_TERMINAL_PROMPT=0` to prevent credential prompts, `GIT_CONFIG_NOSYSTEM=1`, `GIT_CONFIG_GLOBAL=os.devnull`, and `GIT_PROTOCOL_FROM_USER=0` to restrict config/protocol attack surfaces, rejection of flag-like argument values (starting with `-`) for refs, branch names, author filters, date strings, and other git arguments, URL scheme validation on clone (only `https://`, `ssh://`, `git://`, and SCP-like syntax — plain `http://` rejected for security) with `--` separator before positional URL argument, and clone URLs starting with `-` are rejected. All tools return `ToolExecutionResult` for errors rather than raising exceptions. When a `SandboxBackend` is injected, `_run_git()` delegates subprocess management to the sandbox via `_run_git_sandboxed()` — the sandbox handles environment filtering and workspace-scoped cwd enforcement, while `_validate_path` independently enforces workspace boundaries for git path arguments. Git hardening env vars are passed as `env_overrides` to the sandbox, and `SandboxResult` is converted to `ToolExecutionResult` via `_sandbox_result_to_execution_result`. Without a sandbox, the direct-subprocess path is used (backward compatible). Both paths explicitly close the subprocess transport on Windows (via `tools/_process_cleanup.py`) to prevent `ResourceWarning` on `ProactorEventLoop`. **Future:** Consider adding host/IP allowlisting for clone URLs to prevent SSRF against internal networks (loopback, link-local, private ranges). - -### 11.1.2 Tool Sandboxing - -Tool execution requires safety boundaries proportional to the risk of each tool category. The framework uses a **layered sandboxing strategy** with a pluggable `SandboxBackend` protocol — new backends can be added without modifying existing ones. The default configuration uses lighter isolation for low-risk tools and stronger isolation for high-risk tools. - -> **MVP: Subprocess sandbox for file/git tools. Docker optional for code execution.** K8s is future. -> -> **Decision ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D16):** Docker MVP only via `aiodocker` (async-native, Python 3.14 support). Pre-built image (Python 3.14 + Node.js LTS + basic utils, <500MB) + user-configurable via `docker.image` config. **Fail with clear error** if Docker unavailable — no unsafe subprocess fallback for code execution (file/git tools already use `SubprocessSandbox`). gVisor (`--runtime=runsc`) as free config-level hardening upgrade. WASM/Firecracker evaluation planned. `SandboxBackend` protocol makes adding backends trivial. - -#### Sandbox Backends - -| Backend | Isolation | Latency | Dependencies | Status | -|---------|-----------|---------|--------------|--------| -| `SubprocessSandbox` | Process-level: env filtering (allowlist + denylist), restricted PATH (configurable via `extra_safe_path_prefixes`), workspace-scoped cwd, timeout + process-group kill, library injection var blocking, explicit transport cleanup on Windows | ~ms | None | **Implemented** | -| `DockerSandbox` | Container-level: ephemeral container, mounted workspace, no network, resource limits (CPU/memory/time) | ~1-2s cold start | Docker | **Implemented** | -| `K8sSandbox` | Pod-level: per-agent containers, namespace isolation, resource quotas, network policies | ~2-5s | Kubernetes | Future | - -#### Default Layered Configuration - -```yaml -sandboxing: - default_backend: "subprocess" # subprocess, docker, k8s - overrides: # per-category backend overrides - file_system: "subprocess" # low risk — fast, no deps - git: "subprocess" # low risk — workspace-scoped - web: "docker" # medium risk — needs network isolation - code_execution: "docker" # high risk — strong isolation required - terminal: "docker" # high risk — arbitrary commands - database: "docker" # high risk — data mutation; see network note below - subprocess: - timeout_seconds: 30 - workspace_only: true # restrict filesystem access to project dir - restricted_path: true # strip dangerous binaries from PATH - docker: - image: "synthorg-sandbox:latest" # pre-built image with common runtimes - network: "none" # no network by default; per-category overrides below - network_overrides: # category-specific network policies - database: "bridge" # database tools need TCP access to DB host - web: "egress-only" # web tools need outbound HTTP; no inbound - allowed_hosts: [] # allowlist of host:port pairs (e.g. ["db:5432"]) - memory_limit: "512m" - cpu_limit: "1.0" - timeout_seconds: 120 - mount_mode: "ro" # read-only by default; workspace mounted separately - auto_remove: true # ephemeral — container removed after execution - k8s: # future — per-agent pod isolation - namespace: "synthorg-agents" - resource_requests: - cpu: "250m" - memory: "256Mi" - resource_limits: - cpu: "1" - memory: "1Gi" - network_policy: "deny-all" # default deny, allowlist per tool -``` - -> **User experience:** Docker is optional — only required when code execution, terminal, web, or database tools are enabled. File system and git tools work out of the box with subprocess isolation. This keeps the "local first" experience lightweight while providing strong isolation where it matters. - -> **Scaling path:** In a future Kubernetes deployment (§18.2 Phase 3-4), each agent can run in its own pod via `K8sSandbox`. At that point, the layered configuration becomes less relevant — all tools execute within the agent's isolated pod. The `SandboxBackend` protocol makes this transition seamless. - -### 11.1.3 MCP Integration - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D17, D18):** -> -> - **D17 — MCP SDK:** Official `mcp` Python SDK, pinned `==1.26.0`. Thin `MCPBridgeTool` adapter layer isolates the rest of the codebase from SDK API changes. Support **stdio** (local/dev) and **Streamable HTTP** (remote/production) transports. Skip deprecated SSE. v2 migration planned — pin range prevents accidental breaking upgrade. -> - **D18 — MCP Result Mapping:** Adapter in `MCPBridgeTool` keeps `ToolResult` as-is. Mapping: text blocks → concatenate to `content: str`; image/audio → `[image: {mimeType}]` placeholder + base64 in `metadata["attachments"]`; `structuredContent` → `metadata["structured_content"]`; `isError` → `is_error` (1:1). Future: extend `ToolResult` with optional `attachments` when multi-modal LLM tool results are needed. - -### 11.1.4 Action Type System - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D1):** -> -> Action types classify agent actions for use by autonomy presets (§12.2), SecOps validation (§12.3), tiered timeout policies (§12.4), and progressive trust (§11.3). Three sub-decisions: -> -> - **D1.1 — Registry:** `StrEnum` for ~25 built-in action types (type safety, autocomplete, typos caught at compile time) + `ActionTypeRegistry` for custom types via explicit registration. Unknown strings rejected at config load time. Critical for security — a typo in `human_approval` list silently means "skip approval." -> - **D1.2 — Granularity:** Two-level `category:action` hierarchy. Category shortcuts: `auto_approve: ["code"]` expands to all `code:*` actions. Fine-grained: `human_approval: ["code:create"]`. -> -> **Proposed taxonomy (~25 leaf types):** -> -> ```text -> code:read, code:write, code:create, code:delete, code:refactor -> test:write, test:run -> docs:write -> vcs:read, vcs:commit, vcs:push, vcs:branch -> deploy:staging, deploy:production -> comms:internal, comms:external -> budget:spend, budget:exceed -> org:hire, org:fire, org:promote -> db:query, db:mutate, db:admin -> arch:decide -> ``` -> -> - **D1.3 — Classification:** Static tool metadata. Each `BaseTool` declares its `action_type`. Default mapping from `ToolCategory` → action type. Non-tool actions (`org:hire`, `budget:spend`) triggered by engine-level operations. No LLM in the security classification path. - -### 11.2 Tool Access Levels - -```yaml -tool_access: - levels: - sandboxed: - description: "No external access. Isolated workspace." - file_system: "workspace_only" - code_execution: "containerized" - network: "none" - git: "local_only" - - restricted: - description: "Limited external access with approval." - file_system: "project_directory" - code_execution: "containerized" - network: "allowlist_only" - git: "read_and_branch" - requires_approval: ["deployment", "database_write"] - - standard: - description: "Normal development access." - file_system: "project_directory" - code_execution: "containerized" - network: "open" - git: "full" - terminal: "restricted_commands" - - elevated: - description: "Full access for senior/trusted agents." - file_system: "full" - code_execution: "host" - network: "open" - git: "full" - terminal: "full" - deployment: true - - custom: - description: "Per-agent custom configuration." -``` - -> **Implementation note:** The current `ToolPermissionChecker` implements **category-level gating only** — each access level maps to a set of permitted `ToolCategory` values (e.g., `STANDARD` permits `file_system`, `code_execution`, `version_control`, `web`, `terminal`, `analytics`). `SubprocessSandbox` provides workspace-scoped cwd enforcement and env filtering (see §11.1.2). The granular sub-constraints shown above (network mode, containerization) are planned for Docker/K8s sandbox backends. - -### 11.3 Progressive Trust - -Agents can earn higher tool access over time through configurable trust strategies. The trust system implements a `TrustStrategy` protocol, making it extensible. Multiple strategies are available, selectable via config. - -> **Current state:** All four strategies are implemented behind the `TrustStrategy` protocol: `DisabledTrustStrategy`, `WeightedTrustStrategy`, `PerCategoryTrustStrategy`, `MilestoneTrustStrategy`. Default is disabled (static access) — agents receive their configured access level at hire time. -> -> **Security invariant (all strategies):** The `standard_to_elevated` promotion **always** requires human approval. No agent can auto-gain production access regardless of trust strategy. - -#### Strategy: Disabled (Static Access) — Default - -Trust is disabled. Agents receive their configured access level at hire time and it never changes. Simplest option — useful when the human manages permissions manually. - -```yaml -trust: - strategy: "disabled" # disabled, weighted, per_category, milestone - initial_level: "standard" # fixed access level for all agents -``` - -#### Strategy: Weighted Score (Single Track) - -A single trust score computed from weighted factors: task difficulty completed, error rate, time active, and human feedback. One global trust level per agent, applied to all tool categories. - -```yaml -trust: - strategy: "weighted" - initial_level: "sandboxed" - weights: - task_difficulty: 0.3 # harder tasks completed = more trust - completion_rate: 0.25 - error_rate: 0.25 # inverse — fewer errors = more trust - human_feedback: 0.2 - promotion_thresholds: - sandboxed_to_restricted: 0.4 - restricted_to_standard: 0.6 - standard_to_elevated: - score: 0.8 - requires_human_approval: true # always human-gated -``` - -- Simple model, easy to understand. One number to track -- Too coarse — an agent trusted for file edits shouldn't auto-get deployment access - -#### Strategy: Per-Category Trust Tracks - -Separate trust tracks per tool category (filesystem, git, deployment, database, network). An agent can be "standard" for files but "sandboxed" for deployment. Promotion criteria differ per category. Human approval gate required for any production-touching category. - -```yaml -trust: - strategy: "per_category" - initial_levels: - file_system: "restricted" - git: "restricted" - code_execution: "sandboxed" - deployment: "sandboxed" - database: "sandboxed" - terminal: "sandboxed" - promotion_criteria: - file_system: - restricted_to_standard: - tasks_completed: 10 - quality_score_min: 7.0 - deployment: - sandboxed_to_restricted: - tasks_completed: 20 - quality_score_min: 8.5 - requires_human_approval: true # always human-gated for deployment -``` - -- Granular. Matches real security models (IAM roles). Prevents gaming via easy tasks -- More complex data model. Trust state is a matrix per agent, not a scalar - -#### Strategy: Milestone Gates (ATF-Inspired) - -Explicit capability milestones aligned with the Cloud Security Alliance Agentic Trust Framework. Automated promotion for low-risk levels. Human approval gates for elevated access. Trust is time-bound and subject to periodic re-verification — trust decays if the agent is idle for extended periods or error rate increases. - -```yaml -trust: - strategy: "milestone" - initial_level: "sandboxed" - milestones: - sandboxed_to_restricted: - tasks_completed: 5 - quality_score_min: 7.0 - auto_promote: true # no human needed - restricted_to_standard: - tasks_completed: 20 - quality_score_min: 8.0 - time_active_days: 7 - auto_promote: true - standard_to_elevated: - requires_human_approval: true # always human-gated - clean_history_days: 14 # no errors in last 14 days - re_verification: - enabled: true - interval_days: 90 # re-verify every 90 days - decay_on_idle_days: 30 # demote one level if idle 30+ days - decay_on_error_rate: 0.15 # demote if error rate exceeds 15% -``` - -- Industry-aligned. Re-verification prevents stale trust. Human gates where it matters -- Most complex. Trust decay may need tuning to avoid frustrating users - ---- - -## 12. Security & Approval System - -### 12.1 Approval Workflow - -```text - ┌──────────────┐ - │ Task/Action │ - └──────┬───────┘ - │ - ┌──────▼───────┐ - │ Security Ops │ - │ Agent │ - └──────┬───────┘ - ╱ ╲ - ┌─────▼─┐ ┌───▼────┐ - │APPROVE │ │ DENY │ - │(auto) │ │+ reason│ - └────┬───┘ └───┬────┘ - │ │ - Execute ┌───▼────────┐ - │ Human Queue │ - │ (Dashboard) │ - └───┬────────┘ - ╱ ╲ - ┌─────▼─┐ ┌───▼──────┐ - │Override│ │Alternative│ - │Approve │ │Suggested │ - └────────┘ └──────────┘ -``` - -### 12.2 Autonomy Levels - -> **Planned minimal config (not yet implemented — current schema uses a float):** -> -> ```yaml -> autonomy: -> level: "semi" -> ``` -> -> All presets below are built-in. Most users only set the level. - -```yaml -autonomy: - level: "semi" # full, semi, supervised, locked - presets: - full: - description: "Agents work independently. Human notified of results only." - auto_approve: ["all"] - human_approval: [] - - semi: - description: "Most work is autonomous. Major decisions need approval." - auto_approve: ["code", "test", "docs", "comms:internal"] - human_approval: ["deploy", "comms:external", "budget:exceed", "org:hire"] - security_agent: true - - supervised: - description: "Human approves major steps. Agents handle details." - auto_approve: ["code:write", "comms:internal"] - human_approval: ["arch", "code:create", "deploy", "vcs:push"] - security_agent: true - - locked: - description: "Human must approve every action." - auto_approve: [] - human_approval: ["all"] - security_agent: true # still runs for audit logging, but human is approval authority -``` - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D6, D7):** -> -> - **D6 — Autonomy Scope:** Three-level resolution chain: per-agent → per-department → company default. Optional `autonomy_level` on `AgentIdentity` and department config. Resolution: `agent.autonomy_level or department.autonomy_level or company.autonomy.level`. Seniority validation: Juniors/Interns cannot be set to `full`. -> - **D7 — Autonomy Changes at Runtime:** Pluggable `AutonomyChangeStrategy` protocol. Initial: **(a+c hybrid)** — human-only promotion via REST API (no agent including CEO can escalate privileges) **plus** automatic downgrade on: high error rate → one level down, budget exhausted → supervised, security incident → locked. Recovery from auto-downgrade: human-only. Precedent: no real-world security system automatically grants higher privileges. Future strategies: fully configurable conditions. - -### 12.3 Security Operations Agent - -A special meta-agent that reviews all actions before execution: - -- Evaluates safety of proposed actions -- Checks for data leaks, credential exposure, destructive operations -- Validates actions against company policies -- Maintains an audit log of all approvals/denials -- Escalates uncertain cases to human queue with explanation -- **Cannot be overridden by other agents** (only human can override) - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D4, D5):** -> -> - **D4 — LLM vs Rule-based:** Hybrid approach. Rule engine for known patterns (credentials, path traversal, destructive ops) — sub-ms, covers ~95% of cases. LLM fallback only for uncertain cases (~5%). Full autonomy mode: rules + audit logging only, no LLM path. Hard safety rules (credential exposure, data destruction) **never bypass** regardless of autonomy level. Precedent: AWS GuardDuty, LlamaFirewall, NeMo Guardrails all use hybrid. -> - **D5 — Integration Point:** Pluggable `SecurityInterceptionStrategy` protocol. Initial: before every tool invocation — slots into existing `ToolInvoker` between permission check and tool execution. Policy strictness (not interception point) configurable per autonomy level. Add post-tool-call scanning for sensitive data in outputs. Performance: sub-ms rule check is invisible against seconds of LLM inference. Future strategies: batch-level (before task step), assignment-only. - -#### Output Scan Response Policies - -After the output scanner detects sensitive data, a pluggable **`OutputScanResponsePolicy`** protocol decides how to handle the findings. Four built-in policies ship behind the protocol: - -| Policy | Behavior | Default for | -|--------|----------|-------------| -| **Redact** (default) | Return scanner's redacted content as-is | `SEMI`, `SUPERVISED` autonomy | -| **Withhold** | Clear redacted content — fail-closed, no partial data returned | `LOCKED` autonomy | -| **Log-only** | Discard findings (logs at WARNING), pass original output through | `FULL` autonomy | -| **Autonomy-tiered** | Delegate to a sub-policy based on effective autonomy level | Composite policy | - -Policy selection is declarative via `SecurityConfig.output_scan_policy_type` (`OutputScanPolicyType` enum). A factory function (`build_output_scan_policy`) resolves the enum to a concrete policy instance. Runtime constructor injection on `SecOpsService` is also supported for full flexibility. The policy is applied *after* audit recording, preserving audit fidelity regardless of policy outcome. - -### 12.4 Approval Timeout Policy - -When an action requires human approval (per autonomy level in §12.2), the agent must wait. The framework provides configurable timeout policies that determine what happens when a human doesn't respond. All policies implement a `TimeoutPolicy` protocol. The policy is configurable per autonomy level and per action risk tier. - -> **Current state:** All four timeout policies are implemented: `WaitForeverPolicy`, `AutoDenyPolicy`, `TieredPolicy`, `EscalationChainPolicy`. Park/resume service, risk tier classifier, and timeout checker are complete. - -During any wait — regardless of policy — the agent **parks** the blocked task (saving its full serialized `AgentContext` state: conversation, progress, accumulated cost, turn count — i.e., the complete persisted context, distinct from the compact `AgentContextSnapshot` used for telemetry) and picks up other available tasks from its queue. When approval eventually arrives, the agent **resumes** the original context exactly where it left off. This mirrors real company behavior: a junior developer starts another task while waiting for a code review, then returns to the original work when feedback arrives. - -#### Policy 1: Wait Forever (Default for Critical Actions) - -The action stays in the human queue indefinitely. No timeout, no auto-resolution. The agent is aware the task is parked awaiting approval and works on other tasks in the meantime. - -```yaml -approval_timeout: - policy: "wait" # wait, deny, tiered, escalation -``` - -- Safest — no risk of unauthorized actions. Mirrors "awaiting review" in real workflows -- Can stall tasks indefinitely if human is unavailable. Queue can grow unbounded - -#### Policy 2: Deny on Timeout - -All unapproved actions auto-deny after a configurable timeout. The agent receives a denial reason ("approval timeout — human did not respond within window") and can retry with a different approach or escalate explicitly. - -```yaml -approval_timeout: - policy: "deny" - timeout_minutes: 240 # 4 hours -``` - -- Industry consensus default ("fail closed"). Agent learns to prefer auto-approvable paths -- May stall legitimate work if human is consistently slow - -#### Policy 3: Tiered Timeout - -Different timeout behavior based on action risk level. Low-risk actions auto-approve after a short wait. Medium-risk actions auto-deny. High-risk/security-critical actions wait forever. - -```yaml -approval_timeout: - policy: "tiered" - tiers: - low_risk: - timeout_minutes: 60 - on_timeout: "approve" # auto-approve low-risk after 1 hour - actions: ["code:write", "comms:internal", "test"] - medium_risk: - timeout_minutes: 240 - on_timeout: "deny" # auto-deny medium-risk after 4 hours - actions: ["code:create", "vcs:push", "arch:decide"] - high_risk: - timeout_minutes: null # wait forever - on_timeout: "wait" - actions: ["deploy", "db:admin", "comms:external", "org:hire"] -``` - -- Pragmatic — low-risk stuff doesn't stall, critical stuff stays safe -- Auto-approve on timeout carries risk. Tuning tier boundaries requires experience - -#### Policy 4: Escalation Chain - -On timeout, the approval request escalates to the next human in a configured chain (e.g., primary reviewer → manager → VP → board). If the entire chain times out, the action is denied. - -```yaml -approval_timeout: - policy: "escalation" - chain: - - role: "direct_manager" - timeout_minutes: 120 - - role: "department_head" - timeout_minutes: 240 - - role: "ceo_or_board" - timeout_minutes: 480 - on_chain_exhausted: "deny" # deny if entire chain times out -``` - -- Mirrors real orgs — if your boss is out, their boss covers. Multiple chances for approval -- Requires configuring an escalation chain. More humans involved. Complex to implement - -> **Task Suspension and Resumption:** The park/resume mechanism relies on `AgentContext` snapshots (frozen Pydantic models). When a task is parked, the full context is persisted. When approval arrives, the framework loads the snapshot, restores the agent's conversation and state, and resumes execution from the exact point of suspension. This works naturally with the `model_copy(update=...)` immutability pattern — the snapshot is a complete, self-contained state. - -> **Decisions ([ADR-002](decisions/ADR-002-design-decisions-batch-1.md) D19, D20, D21):** -> -> - **D19 — Risk Tier Classification:** Pluggable `RiskTierClassifier` protocol. Initial: configurable YAML mapping — `RiskTierMapping` config model with `dict[str, ApprovalRiskLevel]`. Sensible defaults matching examples above (e.g. `code:write` → low, `deploy:production` → critical). Unknown action types default to HIGH (fail-safe). Hot-reloadable. Leaves door open for future SecOps override. Future strategies: SecOps-assigned, fixed-per-type. -> - **D20 — Context Serialization:** Pydantic JSON via persistence backend. `ParkedContext` model with metadata columns (`execution_id`, `agent_id`, `task_id`, `parked_at`) + `context_json` blob. `ParkedContextRepository` protocol via existing `PersistenceBackend` (§7.6). Conversation stored **verbatim** — summarization is a context window management concern at resume time, not a persistence concern. -> - **D21 — Resume Injection:** Tool result injection. Approval requests modeled as tool calls (`request_human_approval`). Approval decision returned as `ToolResult` — semantically correct (approval IS the tool's return value). LLM conversation protocol requires a tool result after a tool call. Fallback: system message injection for engine-initiated parking (exception path). - ---- - -## 13. Human Interaction Layer - -### 13.1 Architecture: API-First - -The REST/WebSocket API is the **primary interface** for all consumers. The Web UI and any future CLI tool are thin clients that call the API — they contain no business logic. - -```text -┌─────────────────────────────────────────────┐ -│ SynthOrg Engine │ -│ (Core Logic, Agent Orchestration, Tasks) │ -└──────────────────┬──────────────────────────┘ - │ - ┌────────▼────────┐ - │ REST/WS API │ ← primary interface - │ (Litestar) │ - └───┬─────────┬───┘ - │ │ - ┌───────▼──┐ ┌───▼────────┐ - │ Web UI │ │ CLI Tool │ - │ (Future) │ │ (Future) │ - └──────────┘ └────────────┘ -``` - -> **CLI Tool (Future):** If needed, a thin CLI utility wrapping the REST API with terminal formatting (Typer + Rich or similar). Not a priority — the API is fully self-sufficient. To be determined whether a dedicated CLI is warranted or whether `curl`/`httpie` and the interactive Scalar docs at `/docs/api` suffice. - -### 13.2 API Surface - -```text -/api/v1/ - ├── /health # Health check, readiness - ├── /auth # Authentication: setup, login, password change, me - ├── /company # CRUD company config - ├── /agents # List, hire, fire, modify agents - ├── /departments # Department management - ├── /projects # Project CRUD - ├── /tasks # Task management - ├── /messages # Communication log - ├── /meetings # Schedule, view meeting outputs - ├── /artifacts # Browse produced artifacts (code, docs, etc.) - ├── /budget # Spending, limits, projections - ├── /approvals # Pending human approvals queue - ├── /analytics # Performance metrics, dashboards - ├── /providers # Model provider status, config - └── /ws # WebSocket for real-time updates -``` - -### 13.3 Web UI Features - -- **Dashboard**: Real-time company overview, active tasks, spending -- **Org Chart**: Visual hierarchy, click to inspect any agent -- **Task Board**: Kanban/list view of all tasks across projects -- **Message Feed**: Real-time feed of agent communications -- **Approval Queue**: Pending approvals with context and recommendations -- **Agent Profiles**: Detailed view of each agent's identity, history, metrics -- **Budget Panel**: Spending charts, projections, alerts -- **Meeting Logs**: Transcripts and outcomes of all agent meetings -- **Artifact Browser**: Browse and inspect all produced work -- **Settings**: Company config, autonomy levels, provider settings - -### 13.4 Human Roles - -The human can interact as: - -| Role | Access | Description | -|------|--------|-------------| -| **Board Member** | Observe + major approvals only | Minimal involvement, strategic oversight | -| **CEO** | Full authority, replaces CEO agent | Human IS the CEO, agents are the team | -| **Manager** | Department-level authority | Manages one team/department directly | -| **Observer** | Read-only | Watch the company operate, no intervention | -| **Pair Programmer** | Direct collaboration with one agent | Work alongside a specific agent in real-time | - ---- - -## 14. Templates & Builder - -### 14.1 Template System - -Templates are YAML/JSON files defining a complete company setup: - -```yaml -# templates/startup.yaml (simplified — real templates also declare -# variables, departments, min_agents/max_agents, and tags) -template: - name: "Tech Startup" - description: "Small team for building MVPs and prototypes" - version: "1.0" - - company: - type: "startup" - budget_monthly: "{{ budget | default(50.00) }}" - autonomy: 0.5 - - agents: - - role: "CEO" - name: "{{ ceo_name | auto }}" - model: "large" - personality_preset: "visionary_leader" - - - role: "Full-Stack Developer" - merge_id: "fullstack-senior" - name: "{{ dev1_name | auto }}" - level: "senior" - model: "medium" - personality_preset: "pragmatic_builder" - - - role: "Full-Stack Developer" - merge_id: "fullstack-mid" - name: "{{ dev2_name | auto }}" - level: "mid" - model: "small" - personality_preset: "eager_learner" - - - role: "Product Manager" - name: "{{ pm_name | auto }}" - model: "medium" - personality_preset: "strategic_planner" - - workflow: "agile_kanban" - communication: "hybrid" - - workflow_handoffs: - - from_department: "engineering" - to_department: "qa" - trigger: "pr_ready" - - escalation_paths: - - from_department: "engineering" - to_department: "security" - condition: "vulnerability_found" -``` - -**Template Inheritance** — Templates can extend other templates using `extends`: - -```yaml -template: - name: "Extended Startup" - extends: "startup" # inherits all agents, departments, config - agents: - - role: "QA Engineer" # appended to parent agents - level: "mid" - - role: "Full-Stack Developer" - merge_id: "fullstack-mid" - department: "engineering" - _remove: true # removes matching parent agent by key -``` - -Inheritance resolves parent→child chains up to 10 levels deep. Merge semantics: -- **Scalars** (`company_name`, `company_type`): child wins if present. -- **`config`** dict: deep-merged (child keys override parent). -- **`agents`** list: merged by `(role, department, merge_id)` key. When `merge_id` is omitted, it defaults to an empty string, making the key `(role, department, "")`. Child can override, append, or remove (`_remove: true`) parent agents. -- **`departments`** list: merged by name (case-insensitive). Child dept replaces parent entirely. -- **`workflow_handoffs`**, **`escalation_paths`**: child replaces entirely if present. - -Circular inheritance is detected via chain tracking and raises `TemplateInheritanceError`. - -### 14.2 Company Builder (Future) - -> **Deferred.** The template system (§14.1) already supports creating companies from YAML configs. An interactive wizard is a nice-to-have after the REST API exists — it could be a thin CLI utility or a web form that POSTs to `/api/v1/company`. To be determined. - -### 14.3 Community Marketplace (Future) - -- Share company templates -- Share custom role definitions -- Share workflow configurations -- Rating and review system -- Import/export in standard format - ---- - -## 15. Technical Architecture - -### 15.1 High-Level Architecture - -```text -┌──────────────────────────────────────────────────────────────┐ -│ SynthOrg Engine │ -│ │ -│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ -│ │ Company Mgr │ │ Agent Engine │ │ Task/Workflow Eng. │ │ -│ │ (Config, │ │ (Lifecycle, │ │ (Queue, Routing, │ │ -│ │ Templates, │ │ Personality, │ │ Dependencies, │ │ -│ │ Hierarchy) │ │ Execution) │ │ Scheduling) │ │ -│ └──────────────┘ └──────────────┘ └────────────────────┘ │ -│ │ -│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ -│ │ Comms Layer │ │ Memory Layer │ │ Tool/Capability │ │ -│ │ (Message Bus,│ │ (Pluggable, │ │ System (MCP, │ │ -│ │ Meetings, │ │ Retrieval, │ │ Sandboxing, │ │ -│ │ A2A) │ │ Archive) │ │ Permissions) │ │ -│ └──────────────┘ └──────────────┘ └────────────────────┘ │ -│ │ -│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ -│ │ Provider Lyr │ │ Budget/Cost │ │ Security/Approval │ │ -│ │ (Unified, │ │ Engine │ │ System │ │ -│ │ Routing, │ │ (Tracking, │ │ (SecOps Agent, │ │ -│ │ Fallbacks) │ │ Limits, │ │ Audit Log, │ │ -│ │ │ │ CFO Agent) │ │ Human Queue) │ │ -│ └──────────────┘ └──────────────┘ └────────────────────┘ │ -│ │ -│ ┌────────────────────────────────────────────────────────┐ │ -│ │ API Layer (Async Framework + WebSocket) │ │ -│ └────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌──────────────────────┐ ┌─────────────────────────────┐ │ -│ │ Web UI (Local) │ │ CLI Tool │ │ -│ │ Web Dashboard │ │ synthorg │ │ -│ └──────────────────────┘ └─────────────────────────────┘ │ -└──────────────────────────────────────────────────────────────┘ -``` - -### 15.2 Technology Stack - -| Component | Technology | Rationale | -|-----------|-----------|-----------| -| **Language** | Python 3.14+ | Best AI/ML ecosystem, all major frameworks use it, LiteLLM/MCP and memory layer candidates all Python-native. PEP 649 native lazy annotations, PEP 758 except syntax. | -| **API Framework** | Litestar | Async-native, built-in channels (pub/sub WebSocket), auto OpenAPI 3.1 docs, class-based controllers, native route guards, built-in rate limiting / CSRF / compression middleware, explicit DI, Pydantic v2 support via plugin. Chosen over FastAPI — see §15.4 | -| **LLM Abstraction** | LiteLLM | 100+ providers, unified API, built-in cost tracking, retries/fallbacks | -| **Agent Memory** | Mem0 (Qdrant + SQLite) → custom (Neo4j + Qdrant) | Mem0 in-process as initial backend behind pluggable `MemoryBackend` protocol ([ADR-001](decisions/ADR-001-memory-layer.md)). Qdrant embedded + SQLite for persistence. Custom stack (Neo4j + Qdrant external) as future upgrade. Config-driven backend selection | -| **Message Bus** | Internal (async queues) → Redis | Start with Python asyncio queues, upgrade to Redis for multi-process/distributed | -| **Task Queue** | Internal → Celery/Redis | Start simple, scale with Celery when needed | -| **Database** | SQLite (aiosqlite) → PostgreSQL / MariaDB | Pluggable `PersistenceBackend` protocol (§7.6). SQLite ships first via aiosqlite async driver. PostgreSQL, MariaDB as future backends — swap via config, no app code changes | -| **Web UI** | Vue 3 + Vite | Modern, fast, good ecosystem. Simpler than React for dashboards | -| **Real-time** | WebSocket (Litestar channels plugin) | Built-in pub/sub broadcasting, per-channel history, backpressure management. Real-time agent activity, task updates, chat feed | -| **Containerization** | Docker + Docker Compose | Production container packaging: Chainguard Python distroless runtime (non-root UID 65532, CIS Docker Benchmark v1.6.0 hardened, minimal attack surface, continuously scanned in CI), `nginxinc/nginx-unprivileged` web tier, GHCR registry, cosign image signing, Trivy + Grype vulnerability scanning, SBOM + SLSA provenance. Also used for isolated code execution sandboxing | -| **Docker API** | aiodocker | Async-native Docker API client for `DockerSandbox` backend | -| **Tool Integration** | MCP SDK (`mcp`) | Industry standard for LLM-to-tool integration | -| **Agent Comms** | A2A Protocol compatible | Future-proof inter-agent communication | -| **Authentication** | PyJWT + argon2-cffi | JWT (HMAC HS256/384/512) for session tokens, Argon2id for password hashing, HMAC-SHA256 for API key storage (keyed with server secret) | -| **Config Format** | YAML + Pydantic validation | Human-readable config with strict validation | -| **CLI** | TBD (future, if needed) | Thin wrapper around the REST API for terminal use. May not be needed — interactive Scalar docs at `/docs/api` and `curl`/`httpie` may suffice | - -### 15.3 Project Structure - -Files marked with `(planned)` do not exist yet — only stub `__init__.py` files are present. All other files listed below exist in the codebase. - -```text -synthorg/ -├── src/ -│ └── ai_company/ -│ ├── __init__.py -│ ├── constants.py # Top-level constants -│ ├── py.typed # PEP 561 type marker -│ ├── config/ # Configuration loading & validation -│ │ ├── schema.py # Pydantic models for all config -│ │ ├── loader.py # YAML/JSON config loader -│ │ ├── defaults.py # Default configurations -│ │ ├── errors.py # Config error classes -│ │ └── utils.py # Config utilities -│ ├── core/ # Core domain models -│ │ ├── agent.py # AgentIdentity (frozen) -│ │ ├── types.py # Shared validated types (NotBlankStr, etc.) -│ │ ├── company.py # Company structure -│ │ ├── approval.py # ApprovalItem domain model (approval queue) -│ │ ├── enums.py # Core enumerations -│ │ ├── task.py # Task model & state machine -│ │ ├── task_transitions.py # Task state transitions -│ │ ├── project.py # Project management -│ │ ├── artifact.py # Produced work items -│ │ ├── role.py # Role model -│ │ ├── role_catalog.py # Role catalog -│ │ ├── personality.py # Personality compatibility scoring -│ │ └── resilience_config.py # RetryConfig, RateLimiterConfig (shared by config.schema + providers.resilience) -│ ├── engine/ # Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, task lifecycle, recovery, shutdown, workspace isolation, coordination error classification, and prompt policy validation -│ │ ├── errors.py # Engine error hierarchy -│ │ ├── prompt.py # System prompt builder -│ │ ├── prompt_template.py # System prompt Jinja2 templates -│ │ ├── task_execution.py # TaskExecution + StatusTransition -│ │ ├── context.py # AgentContext + AgentContextSnapshot -│ │ ├── loop_protocol.py # ExecutionLoop protocol + result models -│ │ ├── metrics.py # TaskCompletionMetrics proxy overhead model -│ │ ├── policy_validation.py # Org policy quality heuristics (non-inferable principle) -│ │ ├── react_loop.py # ReAct loop implementation -│ │ ├── plan_models.py # Plan step, plan, and plan-execute config models -│ │ ├── plan_execute_loop.py # Plan-and-Execute loop implementation -│ │ ├── plan_parsing.py # Plan extraction from LLM responses (JSON + text fallback) -│ │ ├── loop_helpers.py # Shared stateless helpers for all loop implementations -│ │ ├── recovery.py # Crash recovery strategies (RecoveryStrategy protocol) -│ │ ├── cost_recording.py # Per-turn cost recording helpers -│ │ ├── run_result.py # AgentRunResult outcome model -│ │ ├── _validation.py # Input validation helpers for AgentEngine -│ │ ├── agent_engine.py # Agent execution engine -│ │ ├── parallel.py # Parallel agent executor (TaskGroup + Semaphore) -│ │ ├── parallel_models.py # AgentAssignment, ParallelExecutionGroup, AgentOutcome, ParallelExecutionResult, ParallelProgress -│ │ ├── resource_lock.py # ResourceLock protocol + InMemoryResourceLock -│ │ ├── shutdown.py # Graceful shutdown strategy & manager -│ │ ├── classification/ # Coordination error taxonomy classification (§10.5) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # ErrorSeverity, ErrorFinding, ClassificationResult -│ │ │ ├── detectors.py # Per-category detection heuristics -│ │ │ └── pipeline.py # classify_execution_errors orchestrator -│ │ ├── assignment/ # Task assignment subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # AssignmentRequest, AssignmentResult, AssignmentCandidate, AgentWorkload -│ │ │ ├── protocol.py # TaskAssignmentStrategy protocol -│ │ │ ├── service.py # TaskAssignmentService (orchestrates strategy + validation) -│ │ │ ├── registry.py # STRATEGY_MAP + build_strategy_map factory -│ │ │ └── strategies.py # All 6 strategy implementations -│ │ ├── decomposition/ # Task decomposition subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── classifier.py # TaskStructureClassifier (sequential/parallel/mixed) -│ │ │ ├── dag.py # DependencyGraph (validation, topo sort, parallel groups) -│ │ │ ├── llm.py # LlmDecompositionStrategy (LLM-based decomposition with tool calling) -│ │ │ ├── llm_prompt.py # Prompt building and response parsing for LLM decomposition -│ │ │ ├── manual.py # ManualDecompositionStrategy -│ │ │ ├── models.py # SubtaskDefinition, DecompositionPlan, DecompositionResult, SubtaskStatusRollup, DecompositionContext -│ │ │ ├── protocol.py # DecompositionStrategy protocol -│ │ │ ├── rollup.py # StatusRollup (compute subtask status aggregation) -│ │ │ └── service.py # DecompositionService (orchestrates strategy + classifier + DAG) -│ │ ├── workspace/ # Workspace isolation subsystem (§6.8) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── config.py # PlannerWorktreesConfig, WorkspaceIsolationConfig -│ │ │ ├── git_worktree.py # PlannerWorktreeStrategy (git worktree backend) -│ │ │ ├── merge.py # MergeOrchestrator (sequential merge with conflict escalation) -│ │ │ ├── models.py # Workspace, WorkspaceRequest, MergeResult, MergeConflict, WorkspaceGroupResult -│ │ │ ├── protocol.py # WorkspaceIsolationStrategy protocol -│ │ │ └── service.py # WorkspaceIsolationService (lifecycle orchestrator) -│ │ ├── routing/ # Task routing subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # RoutingCandidate, RoutingDecision, RoutingResult, AutoTopologyConfig -│ │ │ ├── scorer.py # AgentTaskScorer (skill/role/seniority matching) -│ │ │ ├── service.py # TaskRoutingService (routes subtasks to agents) -│ │ │ └── topology_selector.py # TopologySelector (auto coordination topology) -│ ├── hr/ # HR engine: hiring, firing, onboarding, offboarding, agent registry, performance tracking -│ │ ├── __init__.py # Package exports -│ │ ├── enums.py # HR enumerations (HiringRequestStatus, FiringReason, OnboardingStep, LifecycleEventType, TrendDirection, PromotionDirection) -│ │ ├── errors.py # HR error hierarchy -│ │ ├── models.py # CandidateCard, HiringRequest, FiringRequest, OnboardingChecklist, OffboardingRecord, AgentLifecycleEvent -│ │ ├── registry.py # AgentRegistryService (agent lifecycle registry) -│ │ ├── hiring_service.py # HiringService (request → generate candidate → approval → instantiate) -│ │ ├── onboarding_service.py # OnboardingService (checklist management) -│ │ ├── offboarding_service.py # OffboardingService (reassign → archive → notify → terminate) -│ │ ├── archival_protocol.py # MemoryArchivalStrategy protocol -│ │ ├── full_snapshot_strategy.py # FullSnapshotArchivalStrategy -│ │ ├── reassignment_protocol.py # TaskReassignmentStrategy protocol -│ │ ├── queue_return_strategy.py # QueueReturnReassignmentStrategy -│ │ ├── persistence_protocol.py # HR-specific repository protocols -│ │ └── performance/ # Performance tracking subsystem -│ │ ├── __init__.py # Package exports -│ │ ├── models.py # TaskMetricRecord, CollaborationMetricRecord, WindowMetrics, TrendResult, etc. -│ │ ├── config.py # PerformanceConfig -│ │ ├── tracker.py # PerformanceTracker service -│ │ ├── quality_protocol.py # QualityScorer protocol -│ │ ├── ci_quality_strategy.py # CiQualityScorer (CI-based quality scoring) -│ │ ├── collaboration_protocol.py # CollaborationScorer protocol -│ │ ├── behavioral_collaboration_strategy.py # BehavioralCollaborationScorer -│ │ ├── trend_protocol.py # TrendDetector protocol -│ │ ├── theil_sen_strategy.py # TheilSenTrendDetector (robust trend detection) -│ │ ├── window_protocol.py # WindowAggregator protocol -│ │ └── multi_window_strategy.py # MultiWindowAggregator (multi-window rolling metrics) -│ │ └── promotion/ # Promotion/demotion subsystem (D14) -│ │ ├── config.py # PromotionConfig, PromotionCriteriaConfig, PromotionApprovalConfig, ModelMappingConfig -│ │ ├── models.py # CriterionResult, PromotionEvaluation, PromotionApprovalDecision, PromotionRecord, PromotionRequest -│ │ ├── criteria_protocol.py # PromotionCriteriaStrategy protocol -│ │ ├── approval_protocol.py # PromotionApprovalStrategy protocol -│ │ ├── model_mapping_protocol.py # ModelMappingStrategy protocol -│ │ ├── threshold_evaluator.py # ThresholdEvaluator (criteria evaluation) -│ │ ├── seniority_approval_strategy.py # SeniorityApprovalStrategy (approval decisions) -│ │ ├── seniority_model_mapping.py # SeniorityModelMapping (model resolution) -│ │ └── service.py # PromotionService orchestrator (evaluate, request, apply) -│ ├── communication/ # Inter-agent communication -│ │ ├── bus_memory.py # InMemoryMessageBus implementation -│ │ ├── bus_protocol.py # MessageBus protocol interface -│ │ ├── channel.py # Channel model -│ │ ├── config.py # Communication config -│ │ ├── conflict_resolution/ # Conflict resolution subsystem (§5.6) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _helpers.py # Shared utility (find_losers, pick_highest_seniority) -│ │ │ ├── authority_strategy.py # AuthorityResolver (Strategy 1) -│ │ │ ├── config.py # ConflictResolutionConfig, DebateConfig, HybridConfig -│ │ │ ├── debate_strategy.py # DebateResolver (Strategy 2) -│ │ │ ├── human_strategy.py # HumanEscalationResolver (Strategy 3) -│ │ │ ├── hybrid_strategy.py # HybridResolver (Strategy 4) -│ │ │ ├── models.py # Conflict, ConflictPosition, ConflictResolution, DissentRecord -│ │ │ ├── protocol.py # ConflictResolver, JudgeEvaluator protocols -│ │ │ └── service.py # ConflictResolutionService (orchestrator) -│ │ ├── delegation/ # Hierarchical delegation subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── authority.py # AuthorityValidator + AuthorityCheckResult -│ │ │ ├── hierarchy.py # HierarchyResolver (org hierarchy from Company) -│ │ │ ├── models.py # DelegationRequest, DelegationResult, DelegationRecord -│ │ │ └── service.py # DelegationService (orchestrates delegation flow) -│ │ ├── dispatcher.py # MessageDispatcher + DispatchResult -│ │ ├── enums.py # Communication enums -│ │ ├── errors.py # Communication + delegation error hierarchy -│ │ ├── handler.py # MessageHandler protocol, FunctionHandler, HandlerRegistration -│ │ ├── loop_prevention/ # Delegation loop prevention mechanisms -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _pair_key.py # Canonical agent-pair key utility -│ │ │ ├── ancestry.py # Ancestry cycle detection (pure function) -│ │ │ ├── circuit_breaker.py # DelegationCircuitBreaker, CircuitBreakerState -│ │ │ ├── dedup.py # DelegationDeduplicator (time-windowed) -│ │ │ ├── depth.py # Max delegation depth check (pure function) -│ │ │ ├── guard.py # DelegationGuard (orchestrates all mechanisms) -│ │ │ ├── models.py # GuardCheckOutcome -│ │ │ └── rate_limit.py # DelegationRateLimiter (per-pair) -│ │ ├── message.py # Message model -│ │ ├── meeting/ # Meeting protocol subsystem -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _parsing.py # Shared helpers for parsing decisions and action items -│ │ │ ├── _prompts.py # LLM prompt templates for meeting phases -│ │ │ ├── _token_tracker.py # TokenTracker for duration_tokens enforcement -│ │ │ ├── config.py # MeetingProtocolConfig, protocol-specific config models -│ │ │ ├── enums.py # MeetingProtocolType, MeetingPhase enums -│ │ │ ├── errors.py # Meeting error hierarchy -│ │ │ ├── models.py # MeetingRecord, MeetingAgendaItem, ActionItem, etc. -│ │ │ ├── orchestrator.py # MeetingOrchestrator (runs meetings end-to-end) -│ │ │ ├── position_papers.py # PositionPapersProtocol implementation -│ │ │ ├── protocol.py # MeetingProtocol protocol interface -│ │ │ ├── round_robin.py # RoundRobinProtocol implementation -│ │ │ └── structured_phases.py # StructuredPhasesProtocol implementation -│ │ ├── messenger.py # AgentMessenger per-agent facade -│ │ └── subscription.py # Subscription + DeliveryEnvelope models -│ ├── memory/ # Agent memory system — protocols, models, config, factory, retrieval pipeline (ranking, injection, context formatting, non-inferable filtering) -│ │ ├── __init__.py # Re-exports -│ │ ├── capabilities.py # MemoryCapabilities protocol -│ │ ├── config.py # CompanyMemoryConfig, MemoryStorageConfig, MemoryOptionsConfig -│ │ ├── errors.py # Memory error hierarchy (MemoryError and subclasses) -│ │ ├── factory.py # create_memory_backend() factory -│ │ ├── formatter.py # format_memory_context() — ranked memories to ChatMessage(s) -│ │ ├── injection.py # MemoryInjectionStrategy protocol, InjectionStrategy enum, TokenEstimator -│ │ ├── models.py # MemoryEntry, MemoryMetadata, MemoryQuery, MemoryStoreRequest -│ │ ├── protocol.py # MemoryBackend protocol -│ │ ├── ranking.py # ScoredMemory model, rank_memories(), scoring functions -│ │ ├── retrieval_config.py # MemoryRetrievalConfig (weights, thresholds, strategy selection) -│ │ ├── filter.py # MemoryFilterStrategy protocol, TagBasedMemoryFilter, PassthroughMemoryFilter -│ │ ├── retriever.py # ContextInjectionStrategy (full retrieval → rank → format pipeline) -│ │ ├── store_guard.py # Advisory non-inferable tag enforcement at store boundary -│ │ ├── shared.py # SharedKnowledgeStore protocol -│ │ ├── consolidation/ # Memory consolidation — strategies, retention, archival -│ │ │ ├── __init__.py -│ │ │ ├── archival.py # ArchivalStore protocol -│ │ │ ├── config.py # ConsolidationConfig, ArchivalConfig, RetentionConfig -│ │ │ ├── models.py # ConsolidationResult, ArchivalEntry, RetentionRule -│ │ │ ├── retention.py # RetentionEnforcer -│ │ │ ├── service.py # MemoryConsolidationService -│ │ │ ├── simple_strategy.py # SimpleConsolidationStrategy -│ │ │ └── strategy.py # ConsolidationStrategy protocol -│ │ └── org/ # Shared organizational memory (§7.4) -│ │ ├── __init__.py -│ │ ├── access_control.py # Write access control -│ │ ├── config.py # OrgMemoryConfig -│ │ ├── errors.py # OrgMemory error hierarchy -│ │ ├── factory.py # create_org_memory_backend() -│ │ ├── hybrid_backend.py # HybridPromptRetrievalBackend -│ │ ├── models.py # OrgFact, OrgFactAuthor, OrgMemoryQuery -│ │ ├── protocol.py # OrgMemoryBackend protocol -│ │ └── store.py # OrgFactStore protocol, SQLiteOrgFactStore -│ ├── persistence/ # Operational data persistence (§7.6) -│ │ ├── __init__.py # Package exports -│ │ ├── protocol.py # PersistenceBackend protocol -│ │ ├── repositories.py # Repository protocols: TaskRepository, CostRecordRepository, MessageRepository, ParkedContextRepository, AuditRepository, UserRepository, ApiKeyRepository -│ │ ├── config.py # PersistenceConfig model -│ │ ├── errors.py # Persistence error hierarchy -│ │ ├── factory.py # create_backend() factory -│ │ └── sqlite/ # SQLite backend (initial) -│ │ ├── __init__.py # Package exports -│ │ ├── backend.py # SQLitePersistenceBackend -│ │ ├── repositories.py # SQLite repository implementations -│ │ ├── hr_repositories.py # SQLite HR repositories (LifecycleEvent, TaskMetricRecord, CollaborationMetricRecord) -│ │ ├── parked_context_repo.py # SQLiteParkedContextRepository (park/resume serialized agent state) -│ │ ├── audit_repository.py # SQLiteAuditRepository (append-only audit entry persistence) -│ │ ├── user_repo.py # SQLiteUserRepository + SQLiteApiKeyRepository -│ │ └── migrations.py # Schema migrations (user_version pragma, v1–v5) -│ ├── observability/ # Structured logging & correlation -│ │ ├── __init__.py # get_logger() entry point -│ │ ├── _logger.py # Logger configuration -│ │ ├── config.py # Observability config -│ │ ├── correlation.py # Correlation ID tracking -│ │ ├── enums.py # Log-related enums -│ │ ├── events/ # Per-domain event constants -│ │ │ ├── __init__.py # Package marker with usage docs; no re-exports -│ │ │ ├── api.py # API_* event constants -│ │ │ ├── autonomy.py # AUTONOMY_* constants -│ │ │ ├── budget.py # BUDGET_* constants -│ │ │ ├── cfo.py # CFO_* constants -│ │ │ ├── classification.py # CLASSIFICATION_* constants -│ │ │ ├── consolidation.py # CONSOLIDATION_* and RETENTION_* constants -│ │ │ ├── company.py # COMPANY_* constants -│ │ │ ├── communication.py # COMM_* constants -│ │ │ ├── conflict.py # CONFLICT_* constants -│ │ │ ├── config.py # CONFIG_* constants -│ │ │ ├── delegation.py # DELEGATION_* constants -│ │ │ ├── correlation.py # CORRELATION_* constants -│ │ │ ├── decomposition.py # DECOMPOSITION_* constants -│ │ │ ├── execution.py # EXECUTION_* constants -│ │ │ ├── git.py # GIT_* constants -│ │ │ ├── hr.py # HR_* constants -│ │ │ ├── meeting.py # MEETING_* constants -│ │ │ ├── memory.py # MEMORY_* constants -│ │ │ ├── org_memory.py # ORG_MEMORY_* constants -│ │ │ ├── parallel.py # PARALLEL_* constants -│ │ │ ├── performance.py # PERF_* constants -│ │ │ ├── persistence.py # PERSISTENCE_* constants -│ │ │ ├── personality.py # PERSONALITY_* constants -│ │ │ ├── prompt.py # PROMPT_* constants -│ │ │ ├── quota.py # QUOTA_* event constants -│ │ │ ├── provider.py # PROVIDER_* constants -│ │ │ ├── role.py # ROLE_* constants -│ │ │ ├── routing.py # ROUTING_* constants -│ │ │ ├── sandbox.py # SANDBOX_* constants -│ │ │ ├── security.py # SECURITY_* constants -│ │ │ ├── task.py # TASK_* constants -│ │ │ ├── task_assignment.py # TASK_ASSIGNMENT_* constants -│ │ │ ├── task_routing.py # TASK_ROUTING_* constants -│ │ │ ├── template.py # TEMPLATE_* constants -│ │ │ ├── timeout.py # TIMEOUT_* constants -│ │ │ ├── tool.py # TOOL_* constants -│ │ │ ├── workspace.py # WORKSPACE_* constants -│ │ │ ├── code_runner.py # CODE_RUNNER_* constants -│ │ │ ├── docker.py # DOCKER_* constants -│ │ │ ├── mcp.py # MCP_* constants -│ │ │ ├── trust.py # Trust event constants -│ │ │ └── promotion.py # Promotion event constants -│ │ ├── processors.py # Log processors -│ │ ├── setup.py # Logging setup -│ │ └── sinks.py # Log output backends -│ ├── providers/ # LLM provider abstraction -│ │ ├── base.py # BaseCompletionProvider (retry + rate limiting) -│ │ ├── protocol.py # Provider protocol (abstract interface) -│ │ ├── models.py # CompletionConfig/Response, TokenUsage, ToolCall/Result -│ │ ├── capabilities.py # Provider capability registry -│ │ ├── registry.py # Provider registry -│ │ ├── enums.py # Provider enumerations -│ │ ├── errors.py # Provider error hierarchy -│ │ ├── drivers/ # Provider driver implementations -│ │ │ ├── litellm_driver.py # LiteLLM adapter -│ │ │ └── mappers.py # Request/response mappers -│ │ ├── routing/ # Model routing (5 strategies) -│ │ │ ├── _strategy_helpers.py # Shared routing helper functions -│ │ │ ├── errors.py # Routing errors -│ │ │ ├── models.py # Routing models (candidates, results) -│ │ │ ├── resolver.py # Model resolver -│ │ │ ├── router.py # Router orchestrator -│ │ │ └── strategies.py # Routing strategies -│ │ └── resilience/ # Resilience patterns -│ │ ├── errors.py # RetryExhaustedError -│ │ ├── rate_limiter.py # Token bucket rate limiter -│ │ └── retry.py # RetryHandler with backoff -│ ├── tools/ # Tool/capability system -│ │ ├── base.py # BaseTool ABC, ToolExecutionResult -│ │ ├── registry.py # Immutable tool registry (MappingProxyType) -│ │ ├── invoker.py # Tool invocation (concurrent via TaskGroup) -│ │ ├── permissions.py # ToolPermissionChecker (access-level gating) -│ │ ├── errors.py # Tool error hierarchy (incl. ToolPermissionDeniedError) -│ │ ├── examples/ # Example tool implementations -│ │ │ ├── __init__.py # Package exports -│ │ │ └── echo.py # Echo tool (for testing) -│ │ ├── file_system/ # Built-in file system tools -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── _base_fs_tool.py # BaseFileSystemTool ABC -│ │ │ ├── _path_validator.py # Workspace path validation -│ │ │ ├── delete_file.py # DeleteFileTool -│ │ │ ├── edit_file.py # EditFileTool -│ │ │ ├── list_directory.py # ListDirectoryTool -│ │ │ ├── read_file.py # ReadFileTool -│ │ │ └── write_file.py # WriteFileTool -│ │ ├── _git_base.py # Base class for git tools (workspace, subprocess, sandbox integration) -│ │ ├── _process_cleanup.py # Subprocess transport cleanup utility (Windows ResourceWarning prevention) -│ │ ├── git_tools.py # Git operations — 6 built-in tools (sandbox-aware) -│ │ ├── code_runner.py # Code execution tool -│ │ ├── web_tools.py # HTTP, search (planned) -│ │ ├── sandbox/ # Sandbox backends subpackage -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── config.py # Subprocess sandbox configuration -│ │ │ ├── docker_config.py # Docker sandbox configuration -│ │ │ ├── docker_sandbox.py # DockerSandbox backend (aiodocker) -│ │ │ ├── errors.py # Sandbox error hierarchy -│ │ │ ├── protocol.py # SandboxBackend protocol -│ │ │ ├── result.py # SandboxResult model -│ │ │ ├── sandboxing_config.py # Top-level sandboxing config -│ │ │ └── subprocess_sandbox.py # SubprocessSandbox backend -│ │ └── mcp/ # MCP bridge subpackage -│ │ ├── __init__.py # Package exports -│ │ ├── bridge_tool.py # MCPBridgeTool (BaseTool integration) -│ │ ├── cache.py # MCP result cache (TTL + LRU) -│ │ ├── client.py # MCP client wrapper -│ │ ├── config.py # MCP server/bridge config models -│ │ ├── errors.py # MCP error hierarchy -│ │ ├── factory.py # MCPToolFactory (parallel connect) -│ │ ├── models.py # MCP domain models -│ │ └── result_mapper.py # MCP result → ToolExecutionResult mapping -│ ├── security/ # Security & approval -│ │ ├── action_type_mapping.py # Default ToolCategory → ActionType mapping -│ │ ├── action_types.py # ActionTypeCategory registry and validation -│ │ ├── audit.py # Append-only AuditLog with configurable eviction -│ │ ├── config.py # SecurityConfig, SecurityPolicyRule, RuleEngineConfig, OutputScanPolicyType -│ │ ├── models.py # SecurityVerdict, SecurityContext, AuditEntry, OutputScanResult -│ │ ├── output_scan_policy.py # Output scan response policies (redact/withhold/log-only/autonomy-tiered) -│ │ ├── output_scan_policy_factory.py # build_output_scan_policy() factory -│ │ ├── output_scanner.py # Post-tool output scanning (regex-based redaction) -│ │ ├── protocol.py # SecurityInterceptionStrategy protocol -│ │ ├── service.py # SecOpsService — meta-agent coordinating security -│ │ ├── autonomy/ # Autonomy levels, presets, resolver, change strategy (§12.2) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── models.py # AutonomyLevel enum, AutonomyPreset, AutonomyConfig, AutonomyChangeEvent -│ │ │ ├── protocol.py # AutonomyChangeStrategy protocol -│ │ │ ├── change_strategy.py # Rule-based auto-downgrade + human-only promotion strategy -│ │ │ └── resolver.py # AutonomyResolver (agent → department → company chain) -│ │ ├── timeout/ # Approval timeout policies, park/resume, risk tier classifier (§12.4) -│ │ │ ├── __init__.py # Package exports -│ │ │ ├── config.py # TimeoutPolicyConfig -│ │ │ ├── factory.py # build_timeout_policy() factory -│ │ │ ├── models.py # TimeoutDecision, RiskTier -│ │ │ ├── park_service.py # ParkResumeService (park/resume blocked tasks) -│ │ │ ├── parked_context.py # ParkedContext model (serialized agent state) -│ │ │ ├── policies.py # WaitForeverPolicy, AutoDenyPolicy, TieredPolicy, EscalationChainPolicy -│ │ │ ├── protocol.py # TimeoutPolicy protocol -│ │ │ ├── risk_tier_classifier.py # RiskTierClassifier (ActionType → RiskTier) -│ │ │ └── timeout_checker.py # TimeoutChecker (polls pending approvals) -│ │ └── rules/ # Rule engine and detectors -│ │ ├── engine.py # RuleEngine (soft-allow + hard-deny, fail-closed) -│ │ ├── protocol.py # SecurityRule protocol -│ │ ├── policy_validator.py # Policy list validation rule (hard-deny/auto-approve) -│ │ ├── risk_classifier.py # RiskClassifier (ActionType → ApprovalRiskLevel) -│ │ ├── credential_detector.py # Credential/secret pattern detection (API keys, tokens) -│ │ ├── data_leak_detector.py # Data leak detection (PII, sensitive file paths) -│ │ ├── destructive_op_detector.py # Destructive operation detection (rm -rf, DROP TABLE) -│ │ ├── path_traversal_detector.py # Path traversal attack detection (../, null bytes) -│ │ └── _utils.py # walk_string_values utility (recursive argument scanning) -│ │ └── trust/ # Progressive trust subsystem (§11.3) -│ │ ├── config.py # TrustConfig, strategy-specific sub-configs -│ │ ├── enums.py # TrustStrategyType, TrustChangeReason -│ │ ├── errors.py # TrustEvaluationError -│ │ ├── levels.py # Shared trust level ordering and transition constants -│ │ ├── models.py # TrustState, TrustEvaluationResult, TrustChangeRecord -│ │ ├── protocol.py # TrustStrategy protocol -│ │ ├── service.py # TrustService orchestrator (state, evaluation, decay, approval) -│ │ ├── disabled_strategy.py # DisabledTrustStrategy (passthrough) -│ │ ├── weighted_strategy.py # WeightedTrustStrategy (weighted score → thresholds) -│ │ ├── per_category_strategy.py # PerCategoryTrustStrategy (per-tool-category tracks) -│ │ └── milestone_strategy.py # MilestoneTrustStrategy (milestone gates + decay) -│ ├── budget/ # Cost management -│ │ ├── _optimizer_helpers.py # CostOptimizer shared helper functions -│ │ ├── config.py # Budget configuration models -│ │ ├── cost_record.py # CostRecord model (frozen) -│ │ ├── cost_tiers.py # Cost tier definitions, classification, and built-in tiers -│ │ ├── call_category.py # LLM call category enums (productive, coordination, system) -│ │ ├── category_analytics.py # Per-category cost breakdown + orchestration ratio -│ │ ├── coordination_config.py # Coordination metrics config models -│ │ ├── coordination_metrics.py # Five coordination metric models + computation -│ │ ├── tracker.py # CostTracker service (records + queries) -│ │ ├── spending_summary.py # _SpendingTotals base + spending summary models -│ │ ├── hierarchy.py # BudgetHierarchy, BudgetConfig -│ │ ├── enums.py # Budget-related enums -│ │ ├── billing.py # Billing period computation utilities -│ │ ├── enforcer.py # BudgetEnforcer service (pre-flight, in-flight, auto-downgrade) -│ │ ├── errors.py # BudgetExhaustedError, DailyLimitExceededError, QuotaExhaustedError -│ │ ├── optimizer.py # CostOptimizer service — anomaly detection, efficiency analysis, downgrade recommendations, approval decisions -│ │ ├── optimizer_models.py # CostOptimizer domain models — anomaly, efficiency, downgrade, approval, config -│ │ ├── quota.py # Quota/subscription models, degradation config, quota snapshots -│ │ ├── quota_tracker.py # QuotaTracker service: per-provider request/token quota enforcement -│ │ └── reports.py # Spending reports -│ ├── api/ # REST + WebSocket API -│ │ ├── app.py # Litestar application factory, lifecycle hooks -│ │ ├── approval_store.py # In-memory approval queue storage -│ │ ├── auth/ # JWT + API key authentication subsystem -│ │ │ ├── config.py # AuthConfig (frozen Pydantic, JWT HMAC algorithm, exclude paths) -│ │ │ ├── controller.py # AuthController (setup, login, change-password, me) -│ │ │ ├── middleware.py # ApiAuthMiddleware (JWT-first, API key fallback) -│ │ │ ├── models.py # User, ApiKey, AuthenticatedUser, AuthMethod -│ │ │ ├── secret.py # JWT secret resolution (env var → persistence → auto-generate) -│ │ │ └── service.py # AuthService (Argon2id password hashing, JWT ops, HMAC-SHA256 API key hashing) -│ │ ├── bus_bridge.py # Message-bus → WebSocket bridge -│ │ ├── channels.py # WebSocket channel definitions -│ │ ├── config.py # API configuration models (ServerConfig, CorsConfig) -│ │ ├── controllers/ # 14 class-based controllers + 1 WebSocket handler (15 route modules) -│ │ ├── dto.py # Request/response DTOs and envelopes -│ │ ├── errors.py # API error hierarchy (ApiError, NotFoundError, UnauthorizedError, etc.) -│ │ ├── exception_handlers.py # Litestar exception handler registration -│ │ ├── guards.py # Route guards — role-based read/write access control (HumanRole enum) -│ │ ├── middleware.py # Request logging, CSP middleware -│ │ ├── pagination.py # Cursor-free offset/limit pagination -│ │ ├── server.py # Uvicorn server runner -│ │ ├── state.py # Typed AppState container with service access (deferred auth init) -│ │ └── ws_models.py # WebSocket event models (WsEvent, WsEventType) -│ ├── cli/ # CLI interface (future, if needed) -│ │ ├── __init__.py -│ │ └── commands/ -│ │ └── __init__.py -│ └── templates/ # Company templates -│ ├── schema.py # Template schema models -│ ├── loader.py # Template loader -│ ├── renderer.py # Template renderer -│ ├── merge.py # Template config merging for inheritance -│ ├── presets.py # Personality presets + auto-name generation -│ ├── errors.py # Template errors -│ └── builtins/ # Pre-built company templates -│ ├── agency.yaml -│ ├── dev_shop.yaml -│ ├── full_company.yaml -│ ├── product_team.yaml -│ ├── research_lab.yaml -│ ├── solo_founder.yaml -│ └── startup.yaml -├── tests/ -│ ├── unit/ -│ ├── integration/ -│ └── e2e/ -├── mkdocs.yml # MkDocs configuration -├── docs/ -│ ├── index.md # Documentation landing page -│ ├── getting_started.md -│ ├── overrides/ # MkDocs theme overrides -│ ├── architecture/ -│ │ ├── index.md # Architecture overview -│ │ └── decisions.md # ADR index -│ ├── api/ # Auto-generated API reference (mkdocstrings) -│ │ ├── index.md # API reference landing -│ │ ├── core.md, engine.md, providers.md, budget.md, ... -│ │ └── tools.md -│ └── decisions/ -│ ├── ADR-001-memory-layer.md -│ ├── ADR-002-design-decisions-batch-1.md -│ └── ADR-003-documentation-architecture.md -├── site/ # Astro landing page (synthorg.io root) -│ ├── astro.config.mjs -│ ├── package.json -│ ├── tsconfig.json -│ ├── public/ -│ │ └── favicon.svg -│ └── src/ -│ ├── layouts/Base.astro -│ └── pages/index.astro -├── docker/ -│ ├── backend/ -│ │ └── Dockerfile # 3-stage: python:3.14-slim → chainguard/python-dev → chainguard/python (distroless) -│ ├── sandbox/ -│ │ └── Dockerfile # Code execution sandbox (Python + Node.js, non-root) -│ ├── web/ -│ │ └── Dockerfile # nginxinc/nginx-unprivileged (non-root) -│ ├── compose.yml # CIS-hardened orchestration -│ ├── compose.override.yml # Local dev overrides (debug logging) -│ └── .env.example # Environment variable reference -├── web/ -│ ├── app.js # Dashboard JavaScript -│ ├── index.html # Placeholder dashboard with health check -│ ├── nginx.conf # SPA routing + API/WebSocket proxy -│ └── style.css # Dashboard styles -├── .github/ -│ ├── workflows/ -│ │ ├── ci.yml # Lint + type-check + test (parallel) -│ │ ├── docker.yml # Build → scan → push → sign (GHCR) -│ │ ├── dependency-review.yml # License allow-list on PRs -│ │ ├── release.yml # Release Please (automated versioning + GitHub Releases) -│ │ ├── secret-scan.yml # Gitleaks on push/PR + weekly -│ │ ├── pages.yml # Build Astro + MkDocs → deploy GitHub Pages -│ │ ├── pages-preview.yml # PR preview → Cloudflare Pages -│ │ └── zizmor.yml # Workflow security analysis (zizmor) -│ ├── actions/ -│ │ └── setup-python-uv/ # Composite action: Python + uv install -│ ├── dependabot.yml # uv + github-actions + docker updates -│ ├── CHANGELOG.md # Release changelog (managed by Release Please) -│ ├── CONTRIBUTING.md -│ ├── SECURITY.md -│ ├── .grype.yaml # Grype CVE ignore list (synced with .trivyignore.yaml) -│ └── .trivyignore.yaml # Trivy CVE ignore list (structured YAML format) -├── .dockerignore # Consolidated Docker build context exclusions -├── .gitleaks.toml # Gitleaks config (test file allowlist) -├── DESIGN_SPEC.md # This document -├── README.md -├── pyproject.toml -└── CLAUDE.md -``` - -### 15.4 Key Design Decisions (Preliminary - Subject to Research) - -| Decision | Choice | Alternatives Considered | Rationale | -|----------|--------|------------------------|-----------| -| Language | Python 3.14+ | TypeScript, Go, Rust | AI ecosystem, LiteLLM/MCP and memory layer candidates are Python-native, PEP 649 lazy annotations, PEP 758 except syntax | -| API | Litestar | FastAPI, Flask, Django, aiohttp | Built-in channels (pub/sub WebSocket), class-based controllers, native route guards, middleware (rate limiting, CSRF, compression), explicit DI. FastAPI considered but Litestar offers more batteries-included for less custom code — see rationale below | -| LLM Layer | LiteLLM | Direct APIs, OpenRouter only | 100+ providers, cost tracking, fallbacks, load balancing built-in | -| Memory | Mem0 (initial) → custom stack (future) + SQLite | Graphiti, Letta, Cognee, custom | Mem0 in-process as initial backend behind pluggable `MemoryBackend` protocol ([ADR-001](decisions/ADR-001-memory-layer.md)). Custom stack (Neo4j + Qdrant) as future upgrade. Must support episodic, semantic, procedural memory types (§7.1–7.3). Org memory served via `OrgMemoryBackend` protocol (§7.4) | -| Message Bus | asyncio queues → Redis | Kafka, RabbitMQ, NATS | Start simple, Redis well-supported, Kafka overkill for local | -| Config | YAML + Pydantic | JSON, TOML, Python dicts | Human-friendly, strict validation, good IDE support | -| CLI | Deferred (TBD) | Typer, Click, argparse | Thin API wrapper if needed. Scalar interactive docs at `/docs/api` + `curl`/`httpie` may suffice | -| Web UI | Vue 3 | React, Svelte, HTMX | Simpler than React for dashboards | -| Persistence | Pluggable protocol + repository protocols | ORM (SQLAlchemy), raw SQL, hybrid | Same frozen Pydantic models in and out (no DTOs), async throughout, backend-swappable via config. Repository protocols decouple app code from storage engine. See §7.6 | -| Sandboxing | Layered: subprocess + Docker | Docker-only, subprocess-only, WASM | Risk-proportionate: fast subprocess for file/git, Docker isolation for code execution. Pluggable `SandboxBackend` protocol enables K8s migration later | -| Container Packaging | Chainguard distroless + GHCR | Alpine, Debian-slim, scratch, Docker Hub | Chainguard Python distroless: no shell/package-manager (minimal attack surface), non-root by default, continuously scanned in CI. GHCR over Docker Hub: tighter GitHub integration, no rate limits for public images, native OIDC token auth. cosign keyless signing for supply-chain integrity. Trivy + Grype dual scanning for comprehensive CVE coverage | - -### 15.5 Engineering Conventions - -These conventions are used throughout the codebase. **Adopted** conventions are already in use. **Planned** conventions are approved design decisions not yet implemented. - -| Convention | Status | Decision | Rationale | -|------------|--------|----------|-----------| -| **Immutability strategy** | Adopted | `copy.deepcopy()` at construction + `MappingProxyType` wrapping for non-Pydantic internal collections (registries, `BaseTool`). For Pydantic frozen models: `frozen=True` prevents field reassignment; `copy.deepcopy()` at system boundaries (tool execution, LLM provider serialization) prevents nested mutation. No MappingProxyType inside Pydantic models (serialization friction). | Deep-copy at construction fully isolates nested structures; `MappingProxyType` enforces read-only access. Boundary-copy for Pydantic models is simple, centralized, and Pydantic-native. A future CPython built-in immutable mapping type (e.g. `frozendict`) would provide zero-friction field-level immutability when available. | -| **Config vs runtime split** | Adopted | Frozen models for config/identity; `model_copy(update=...)` for runtime state transitions | `TaskExecution` and `AgentContext` (in `engine/`) are frozen Pydantic models that use `model_copy(update=...)` for copy-on-write state transitions without re-running validators (per Pydantic `model_copy` semantics). Config layer (`AgentIdentity`, `Task`) remains unchanged. | -| **Derived fields** | Adopted | `@computed_field` instead of stored + validated | Eliminates redundant storage and impossible-to-fail validators. `TokenUsage.total_tokens` migrated from stored `Field` + `@model_validator` to `@computed_field` property. | -| **String validation** | Adopted | `NotBlankStr` type from `core.types` for all identifiers | Eliminates per-model `@model_validator` boilerplate for whitespace checks. All identifier/name fields use `NotBlankStr`; optional identifiers use `NotBlankStr \| None`; tuple fields use `tuple[NotBlankStr, ...]` for per-element validation. | -| **Shared field groups** | Adopted | Extracted common field sets into base models (e.g. `_SpendingTotals`) | Prevents field duplication across spending summary models. `_SpendingTotals` provides shared aggregation fields; `AgentSpending`, `DepartmentSpending`, `PeriodSpending` extend it. | -| **Event constants** | Adopted (per-domain) | Per-domain submodules under `events/` package (e.g. `events.provider`, `events.budget`). Import directly: `from ai_company.observability.events. import CONSTANT` | Split by domain for discoverability, co-location with domain logic, and reduced merge conflicts as constants grow. `__init__.py` serves as package marker with usage documentation; no re-exports. | -| **Parallel tool execution** | Adopted | `asyncio.TaskGroup` in `ToolInvoker.invoke_all` with optional `max_concurrency` semaphore | Structured concurrency with proper cancellation semantics. Fatal errors collected via guarded wrapper and re-raised after all tasks complete. | -| **Parallel agent execution** | Adopted | `ParallelExecutor` coordinates concurrent `AgentEngine.run()` calls via `asyncio.TaskGroup` + optional `Semaphore` concurrency limit + `_run_guarded()` error isolation. `ResourceLock` protocol with `InMemoryResourceLock` for exclusive file-path claims. Progress tracking via `ProgressCallback`. Shutdown-aware via `ShutdownManager` task registration. Fail-fast mode cancels sibling tasks on first failure; all errors are surfaced via `ParallelExecutionResult` outcomes. | Follows the `ToolInvoker.invoke_all()` pattern (parallel tool execution above). Composition over inheritance — wraps `AgentEngine`. Structured concurrency with proper cancellation. See §6.3 Parallel Execution. | -| **Tool permission checking** | Adopted | `ToolPermissionChecker` enforces category-level gating based on `ToolAccessLevel` (sandboxed → restricted → standard → elevated, plus custom). Priority-based resolution: denied list → allowed list → level categories → deny. Case-insensitive name matching. `ToolInvoker` filters definitions for prompt and checks at invocation time. | Defense-in-depth: agents only see permitted tools in the LLM prompt, and invocations are re-checked at execution time. Explicit allow/deny lists provide per-agent overrides. See §11.1.1. | -| **Tool sandboxing** | Adopted (incremental) | File system tools use in-process `PathValidator` for workspace-scoped path validation (symlink resolution + containment check). `BaseFileSystemTool` ABC provides shared `ToolCategory.FILE_SYSTEM` and `PathValidator` integration — all file system tools extend this base. `SandboxBackend` protocol with `SubprocessSandbox` implemented — git tools accept optional `SandboxBackend` injection and delegate subprocess management to it (env filtering, workspace enforcement, timeout + process-group kill). `DockerSandbox` planned for code_runner, terminal, web, and database tools. `K8sSandbox` planned for future container deployments. Config-driven per-category backend selection planned for engine wiring. | File system tools use defence-in-depth path validation; subprocess sandbox provides lightweight isolation for git tools; heavier Docker/K8s isolation reserved for higher-risk tool categories (code execution, network). See §11.1.2. | -| **Crash recovery** | Adopted | Pluggable `RecoveryStrategy` protocol. Current: `FailAndReassignStrategy` (catch at engine boundary, log snapshot, mark FAILED / eligible for reassignment). Planned: `CheckpointStrategy` (persist `AgentContext` per turn, resume from last checkpoint). | Immutable `model_copy` pattern makes checkpoint serialization trivial to add later. Fail-and-reassign is sufficient for short tasks. See §6.6. | -| **Personality compatibility scoring** | Adopted | Weighted composite: 60% Big Five similarity (openness, conscientiousness, agreeableness, stress_response → 1−\|diff\|; extraversion → tent-function peaking at 0.3 diff), 20% collaboration alignment (ordinal adjacency: INDEPENDENT↔PAIR↔TEAM), 20% conflict approach (constructive pairs score 1.0, destructive pairs 0.2, mixed 0.4–0.6). `itertools.combinations` for team-level averaging. Result clamped to [0, 1]. | Covers behavioral diversity (extraversion complement), task alignment (conscientiousness similarity), and interpersonal friction (conflict approach). Weights are configurable module constants. | -| **Agent behavior testing** | Planned | Scripted `FakeProvider` for unit tests (deterministic turn sequences); behavioral outcome assertions for integration tests (task completed, tools called, cost within budget). | Leverages existing `FakeProvider` and `CompletionResponseFactory` fixtures. Precise engine testing without brittle response-matching at integration level. | -| **LLM call analytics** | Adopted (incremental) | Proxy metrics (`turns_per_task`, `tokens_per_task`) — adopted. Data models for call categorization (`productive`, `coordination`, `system`), category analytics, coordination metrics, orchestration ratio — adopted. Runtime collection pipeline and full analytics: planned. | Append-only, never blocks execution. Builds on existing `CostRecord` infrastructure. Detects orchestration overhead early. See §10.5. | -| **Cost tiers & quota tracking** | Adopted | Configurable `CostTierDefinition` definitions with merge/override semantics via `resolve_tiers(config: CostTiersConfig)`. `SubscriptionConfig` + `QuotaLimit` model per-provider subscription plans. `QuotaTracker` enforces per-provider request/token quotas with window-based rotation. `DegradationConfig` controls behavior when quotas are exhausted (default: `ALERT` — raise error; `FALLBACK` and `QUEUE` strategies defined but not yet implemented). | Enables cost classification without hardcoding vendor tiers. Quota tracking prevents surprise overages at the provider level. Window-based rotation aligns quota resets with billing periods. See §10.4. | -| **Shared org memory** | Adopted | `OrgMemoryBackend` protocol (pluggable) with `HybridPromptRetrievalBackend` (Backend 1). `OrgFactStore` protocol with `SQLiteOrgFactStore` for persistent fact storage. Seniority-based write access control via `CategoryWriteRule`. Core policies injected into system prompts; extended facts retrieved on demand via `OrgMemoryQuery`. `OrgFact` model with `OrgFactAuthor` provenance tracking. Config-driven via `OrgMemoryConfig`. | Pluggable backend mirrors `MemoryBackend` pattern. Hybrid prompt+retrieval balances always-available core policies with on-demand extended knowledge. Seniority-based access control prevents junior agents from overwriting organizational knowledge. See §7.4. | -| **Memory consolidation** | Adopted | `ConsolidationStrategy` protocol with `SimpleConsolidationStrategy` (deduplication + summarization). `RetentionEnforcer` for per-category age-based cleanup via `RetentionRule` policies. `ArchivalStore` protocol for cold storage before deletion. `MemoryConsolidationService` orchestrates retention → consolidation → max-memories enforcement pipeline. `ConsolidationResult` tracks statistics. Config-driven via `ConsolidationConfig` + `RetentionConfig` + `ArchivalConfig`. | Prevents unbounded memory growth. Pluggable strategy enables different consolidation approaches (simple dedup now, LLM-based summarization later). Retention + archival ensures compliance with data lifecycle policies. See §7.4. | -| **State coordination** | Planned | Centralized single-writer: `TaskEngine` owns all task/project mutations via `asyncio.Queue`. Agents submit requests, engine applies `model_copy(update=...)` sequentially and publishes snapshots. `version: int` field on state models for future optimistic concurrency if multi-process scaling is needed. | Prevents lost updates by design. Trivial in single-threaded asyncio (no locks). Perfect audit trail. Industry consensus: MetaGPT, CrewAI, AutoGen all use prevention-by-design, not conflict resolution. See §6.8 State Coordination table. | -| **Workspace isolation** | Adopted | Pluggable `WorkspaceIsolationStrategy` protocol. Default: planner + git worktrees. Each agent works in an isolated worktree; sequential merge on completion. Textual conflicts detected by git; semantic conflicts reviewed by agent or human. Runtime multi-agent coordination wiring is planned. | Industry standard (Codex, Cursor, Claude Code, VS Code). Maximum parallelism. Leverages mature git infrastructure. See §6.8. | -| **Graceful shutdown** | Adopted | Pluggable `ShutdownStrategy` protocol. Default: cooperative with 30s timeout. Agents check shutdown event at turn boundaries. Force-cancel after timeout. `INTERRUPTED` status for force-cancelled tasks. Planned: upgrade to checkpoint-and-stop. | Cross-platform (Windows `signal.signal()` fallback). Bounded shutdown time. Mirrors cooperative shutdown in §6.7. | -| **Template inheritance** | Adopted | `extends` field on `CompanyTemplate` triggers parent resolution at render time. `merge.py` merges configs by field type: scalars (child wins), config dicts (deep merge), agents (by `(role, department, merge_id)` key with `_remove` support), departments (by name). `_ParentEntry` dataclass tracks merge state. `DEFAULT_MERGE_DEPARTMENT = "engineering"` shared between merge and renderer. Circular chains detected via `frozenset` tracking; max depth = 10. | Enables template composition without copy-paste. Merge-by-key preserves parent order. `_remove` directive enables clean agent removal without workarounds. | -| **Pydantic alias for YAML directives** | Adopted | `Field(alias="_remove")` in `TemplateAgentConfig` — YAML uses `_remove: true`, Python accesses `agent.remove`. Keeps the YAML-facing name (underscore prefix signals internal directive) separate from the Python attribute name. | Underscore-prefixed YAML keys signal merge directives vs regular fields. Pydantic alias bridges the naming convention gap cleanly. | -| **Communication foundation** | Adopted | `MessageBus` protocol with `InMemoryMessageBus` backend (asyncio queues, pull-model `receive()` with shutdown signaling via `asyncio.Event`). `MessageDispatcher` routes to concurrent handlers via `asyncio.TaskGroup` with pre-allocated error collection. `AgentMessenger` per-agent facade auto-fills sender/timestamp/ID; deterministic direct-channel naming `@{sorted_a}:{sorted_b}`. `DeliveryEnvelope` for delivery tracking. `NotBlankStr` validation on all protocol boundary identifiers. | Pull-model avoids callback complexity and enables agents to consume at their own pace. Protocol + backend split enables future persistent/distributed bus implementations. Deterministic DM channel names prevent duplicates. See §5. | -| **Delegation & loop prevention** | Adopted | `HierarchyResolver` resolves org hierarchy from `Company` at construction (cycle-detected, `MappingProxyType`-frozen). `AuthorityValidator` checks chain-of-command + role permissions. `DelegationGuard` orchestrates five mechanisms (ancestry, depth, dedup, rate limit, circuit breaker) in sequence, short-circuiting on first rejection. `DelegationService` is synchronous (CPU-only); messaging integration deferred. Stateful mechanisms use injectable clock for deterministic testing. Task model extended with `parent_task_id` and `delegation_chain` fields. | Synchronous delegation avoids async complexity for CPU-only validation. Five-mechanism guard provides defence-in-depth against all loop patterns. Injectable clocks enable deterministic testing. See §5.4, §5.5. | -| **Task assignment** | Adopted | `TaskAssignmentStrategy` protocol with six concrete strategies: Manual (pre-designated), RoleBased (capability scoring via `AgentTaskScorer`), LoadBalanced (workload-aware with score tiebreaker), CostOptimized (cheapest-agent with score tiebreaker), Hierarchical (subordinate delegation via `HierarchyResolver`), Auction (bid = score × availability). `TaskAssignmentService` orchestrates with status validation, structured logging, and `STRATEGY_MAP` registry (`MappingProxyType`-wrapped singletons; five strategies — Hierarchical requires `build_strategy_map(hierarchy=...)`). Inactive agents filtered during scoring. | Pluggable strategies behind a protocol mirror the execution loop and conflict resolution patterns. Reuses `AgentTaskScorer` from routing subsystem. `MappingProxyType` registry matches existing immutability conventions. See §6.4. | -| **Conflict resolution** | Adopted | `ConflictResolver` protocol with async `resolve()` + sync `build_dissent_records()` split (resolve may call LLM, dissent record is pure construction). Four strategies: `AuthorityResolver` (seniority comparison iterating all N positions, hierarchy proximity tiebreaker via `get_lowest_common_manager`), `DebateResolver` (LLM judge via `JudgeEvaluator` protocol, authority fallback when absent), `HumanEscalationResolver` (stub, returns `ESCALATED_TO_HUMAN`), `HybridResolver` (LLM review + ambiguity escalation/authority fallback). `ConflictResolutionService` follows `DelegationService` pattern (`__slots__`, keyword-only constructor, `MappingProxyType`-wrapped resolver mapping, audit trail). `DissentRecord` preserves losing agent's reasoning. `Conflict.is_cross_department` is a `@computed_field` derived from positions. `HierarchyResolver` extended with `get_lowest_common_manager()` and `get_delegation_depth()`. | Protocol + strategy pattern enables adding new resolution approaches without modifying existing code. Async resolve accommodates LLM calls; sync dissent record avoids unnecessary async overhead. Shared `find_losers` utility prevents code duplication across strategies. See §5.6. | - ---- - -## 16. Research & Prior Art - -### 16.1 Existing Frameworks Comparison - -| Framework | Stars | Architecture | Roles | Models | Memory | Custom Roles | Production Ready | -|-----------|-------|-------------|-------|--------|--------|-------------|-----------------| -| **MetaGPT** | 64.5k | SOP-driven pipeline | PM, Architect, Engineer, QA | OpenAI, Ollama, Groq, Azure | Limited | Partial | Research → MGX commercial | -| **ChatDev 2.0** | 31.2k | Zero-code visual workflows | CEO, CTO, Programmer, Tester, Designer | Multiple via config | Limited | Yes (YAML) | Improving (v2.0 Jan 2026) | -| **CrewAI** | ~50k+ | Role-based crews + flows | Fully custom | Multi-provider | Basic (crew memory) | Yes | Yes (100k+ developers) | -| **AutoGen** | ~40k+ | Conversation-driven async | Custom agents | OpenAI primary, others | Session-based | Yes | Transitioning to MS Agent Framework | -| **LangGraph** | Large | Graph-based DAG | Custom nodes | LangChain ecosystem | Stateful graphs | Yes (nodes) | Yes | -| **Smolagents** | Growing | Code-centric minimal | Code agent | HuggingFace ecosystem | Minimal | Yes | Rapid prototyping | - -### 16.2 What Exists vs What We Need - -| Feature | MetaGPT | ChatDev | CrewAI | **SynthOrg (Ours)** | -|---------|---------|---------|--------|----------------------| -| Full company simulation | Partial | Partial | No | **Yes - complete** | -| HR (hiring/firing) | No | No | No | **Yes** | -| Budget management (CFO) | No | No | No | **Yes** | -| Persistent agent memory | No | No | Basic | **Yes (Mem0 initial, custom stack future — ADR-001)** | -| Agent personalities | Basic | Basic | Basic | **Deep - traits, styles, evolution** | -| Dynamic team scaling | No | No | Manual | **Yes - auto + manual** | -| Multiple company types | No | No | Manual | **Yes - templates + builder** | -| Security ops agent | No | No | No | **Yes** | -| Configurable autonomy | No | No | Limited | **Yes - full spectrum** | -| Local + cloud providers | Partial | Partial | Partial | **Yes - unified abstraction (LiteLLM candidate)** | -| Cost tracking per agent | No | No | No | **Yes - full budget system** | -| Progressive trust | No | No | No | **Yes** | -| Performance metrics | No | No | No | **Yes** | -| MCP tool integration | No | No | Partial | **Yes** | -| A2A protocol support | No | No | No | **Planned** | -| Community marketplace | MGX (commercial) | No | No | **Planned (backlog)** | - -### 16.3 Agent Scaling Research - -[Kim et al., "Towards a Science of Scaling Agent Systems" (2025)](https://arxiv.org/abs/2512.08296) — 180 controlled experiments across 3 LLM families (OpenAI, Google, Anthropic), 4 agentic benchmarks, 5 coordination topologies. Key findings informing our design: - -- **Task decomposability is the #1 predictor** of multi-agent success. Parallelizable tasks gain up to +81%, sequential tasks degrade -39% to -70% under all MAS variants. Informs §6.9. -- **Coordination metrics suite** (efficiency, overhead, error amplification, message density, redundancy) explains 52.4% of performance variance (R²=0.524). Adopted in §10.5. -- **Tiered coordination overhead** (`O%`): optimal band 200–300%, over-coordination above 400%. Informs §10.5 interpretation of the `O%` metric. Note: the `orchestration_ratio` tiered alerts (info/warn/critical) measure a different ratio (coordination tokens / total tokens). -- **Error taxonomy** (logical contradiction, numerical drift, context omission, coordination failure) with architecture-specific patterns. Adopted as opt-in classification in §10.5. -- **Auto topology selection** achieves 87% accuracy from measurable task properties. Informs §6.9 auto topology selector. -- **Centralized verification** contains error amplification to 4.4× vs 17.2× for independent agents. Supports §6.9's centralized-topology guidance and §10.5's `Ae` metric interpretation. -- **Context:** Paper tested identical agents on individual tasks; our architecture uses role-differentiated agents in an organizational structure. Thresholds (e.g., 45% capability ceiling, 3–4 agent sweet spot) are directional — to be validated empirically in our context. - -### 16.4 Build vs Fork Decision - -**Recommendation: Build from scratch, leverage libraries.** - -Rationale: -- No existing framework covers even 50% of our requirements -- Our core differentiators (HR, budget, security ops, deep personalities, progressive trust) don't exist in any framework -- Forking MetaGPT or CrewAI would mean fighting their architecture while adding our features -- **LiteLLM**, **Litestar**, **MCP**, and **Mem0** (memory layer — ADR-001) give us battle-tested components for the hard parts -- The "company simulation" layer on top is our unique value and must be purpose-built - -What we **plan to leverage** (not fork) — subject to evaluation: -- **LiteLLM** (selected) - Provider abstraction -- **Mem0** (selected, ADR-001) - Agent memory (initial backend; custom stack future) -- **Litestar** (selected) - API layer (see §15.4 rationale) -- **MCP** - Tool integration standard (strong candidate, emerging industry standard) -- **Pydantic** (selected) - Config validation and data models -- **Web UI framework** - TBD (Vue 3, React, Svelte, HTMX all under consideration) - -> **Why Litestar over FastAPI?** Both are async-native Python frameworks with auto-generated OpenAPI docs and Pydantic support. FastAPI has a larger ecosystem and more community resources. However, Litestar provides significantly more built-in functionality that we would otherwise need to write and maintain ourselves: -> -> 1. **Channels plugin** — pub/sub WebSocket broadcasting with per-channel subscriptions, backpressure management, and subscriber backlog. FastAPI requires hand-rolling all WebSocket connection management. -> 2. **Class-based controllers** — group routes with shared guards, middleware, and configuration. Our 13 route groups map naturally to controllers. FastAPI only supports loose functions on routers. -> 3. **Native route guards** — declarative authorization at controller/route level. Essential for the approval queue and future security features. FastAPI requires `Depends()` on every route. -> 4. **Built-in middleware** — rate limiting, CSRF protection, GZip/Brotli compression, session handling, request logging. FastAPI requires third-party packages or custom code for each. -> 5. **Explicit dependency injection** — pytest-style named dependencies with scope control. Matches our testing approach. FastAPI's DI is implicit (function parameter magic). -> -> The ecosystem size gap is acceptable: our API is an internal orchestration interface, not a public web service. The bottleneck is LLM latency (seconds), not framework overhead (microseconds). Litestar's ~2x performance advantage in micro-benchmarks is a bonus, not the deciding factor. Python 3.14 is supported by both. - ---- - -## 17. Open Questions & Risks - -### 17.1 Open Questions - -| # | Question | Impact | Status | Notes | -|---|----------|--------|--------|-------| -| 1 | How deep should agent personality affect output? | Medium | Open | Too deep = inconsistent, too shallow = all agents feel the same | -| 2 | What is the optimal meeting format for multi-agent? | High | **Resolved** | Multiple configurable protocols — see §5.7 Meeting Protocol | -| 3 | How to handle context window limits for long tasks? | High | Open | Agents may lose track of complex multi-file changes | -| 4 | Should agents be able to create/modify other agents? | Medium | Open | CTO "hires" a dev by creating a new agent config | -| 5 | How to handle conflicting agent opinions? | High | **Resolved** | Multiple configurable strategies — see §5.6 Conflict Resolution Protocol | -| 6 | What metrics define "good" agent performance? | Medium | Open | Needed for HR/hiring/firing decisions | -| 7 | How to prevent agent communication loops? | High | **Resolved** | Implemented in §5.5 Loop Prevention | -| 8 | Optimal message bus for local-first architecture? | Medium | Open | asyncio queues vs Redis vs embedded broker | -| 9 | How to handle code execution safely? | High | **Resolved** | Layered sandboxing behind `SandboxBackend` protocol — see §11.1.2 Tool Sandboxing | -| 10 | What's the minimum viable meeting set? | Low | Open | Standup + planning + review as minimum? | -| 11 | What is the agent execution loop architecture? | High | **Resolved** | Multiple configurable loops — see §6.5 Agent Execution Loop | -| 12 | How should shared organizational memory work? | High | **Resolved** | Modular backends behind protocol — see §7.4 Shared Organizational Memory | -| 13 | What happens when humans don't respond to approvals? | High | **Resolved** | Configurable timeout policies with task suspension — see §12.4 Approval Timeout | -| 14 | Which memory layer library to use? | Medium | **Resolved** | Mem0 (initial) → custom stack (future) behind pluggable `MemoryBackend` protocol — see [ADR-001](decisions/ADR-001-memory-layer.md) | -| 15 | How to handle agent crashes mid-task? | High | **Resolved** | Pluggable `RecoveryStrategy` protocol — see §6.6 Agent Crash Recovery | -| 16 | How to test non-deterministic agent behavior? | High | **Resolved** | Scripted providers for unit tests + behavioral assertions for integration — see §15.5 Engineering Conventions | -| 17 | How to detect orchestration overhead? | Medium | **Resolved** | Incremental LLM call analytics with proxy metrics → full categorization — see §10.5 | - -### 17.2 Technical Risks - -| Risk | Severity | Mitigation | -|------|----------|------------| -| Context window exhaustion on complex tasks | High | Memory summarization, task decomposition, working memory management | -| Cost explosion from agent loops | High | Budget hard stops, loop detection, max iterations per task | -| Agent quality degradation with cheap models | Medium | Quality gates, minimum model requirements per task type | -| Third-party library breaking changes | Medium | Pin versions, integration tests, abstraction layers | -| Memory retrieval quality | Medium | Mem0 selected as initial backend (ADR-001). Protocol layer enables backend swap if retrieval quality insufficient. Pin version, test 3.14 compat in CI | -| Agent personality inconsistency | Low | Strong system prompts, few-shot examples, personality tests | -| WebSocket scaling | Low | Start local, add Redis pub/sub when needed | - -### 17.3 Architecture Risks - -| Risk | Severity | Mitigation | -|------|----------|------------| -| Over-engineering the MVP | High | Start with minimal viable company (3-5 agents), add complexity iteratively | -| Config format becoming unwieldy | Medium | Good defaults, layered config (base + overrides), validation | -| Agent execution bottlenecks | Medium | Async execution, parallel agent processing, queue-based | -| Data loss on crash | Medium | WAL mode SQLite, `RecoveryStrategy` protocol (§6.6): fail-and-reassign implemented, checkpoint recovery planned | -| Orchestration overhead exceeds productive work | Medium | LLM call analytics (§10.5): proxy metrics implemented, call categorization + orchestration ratio alerts planned | - ---- - -## 18. Backlog & Future Vision - -### 18.1 Future Features (Not for MVP) - -| Feature | Priority | Description | -|---------|----------|-------------| -| Community marketplace | Medium | Share/download company templates, roles, workflows | -| Network hosting | Medium | Expose on LAN/internet, multi-user access | -| Agent evolution | Medium | Agents improve over time based on feedback | -| Inter-company communication | Low | Two AI companies collaborating on a project | -| Voice interface | Low | Talk to your AI company via voice | -| Mobile app | Low | Monitor your company from phone | -| Plugin system | High | Third-party plugins for new tools, roles, providers | -| Benchmarking suite | Medium | Compare company configurations on standard tasks | -| Visual workflow editor | Medium | Drag-and-drop workflow design in Web UI | -| Multi-project support | High | Company handles multiple projects simultaneously | -| Client simulation | Low | AI "clients" that give requirements and review output | -| Training mode | Medium | New agents learn from senior agents' past work | -| ~~Conflict resolution protocol~~ | ~~High~~ | ~~Moved to core — see §5.6~~ | -| Agent promotions | Medium | Junior → Mid → Senior based on performance | -| Shift system | Low | Agents "work" in shifts, different agents for different hours | -| Reporting system | Medium | Weekly/monthly automated company reports | -| Integration APIs | Medium | Connect to real Slack, GitHub, Jira, Linear | -| Self-improving company | High | The AI company developing AI company (meta!) | - -### 18.2 Scaling Path - -```text -Phase 1: Local Single-Process - └── Async runtime, embedded DB, in-memory bus, 1-10 agents - -Phase 2: Local Multi-Process - └── External message bus, production DB, sandboxed execution, 10-30 agents - -Phase 3: Network/Server - └── Full API, multi-user, distributed agents, 30-100 agents - -Phase 4: Cloud/Hosted - └── Container orchestration, horizontal scaling, marketplace, 100+ agents -``` - ---- - -## Appendix A: Industry Standards Reference - -| Standard | Owner | Purpose | Our Usage | -|----------|-------|---------|-----------| -| **MCP** (Model Context Protocol) | Anthropic → Linux Foundation (AAIF) | LLM ↔ Tool integration | Tool system backbone | -| **A2A** (Agent-to-Agent Protocol) | Google → Linux Foundation | Agent ↔ Agent communication | Future agent interop | -| **OpenAI API format** | OpenAI (de facto standard) | LLM API interface | Via provider abstraction layer (LiteLLM candidate) | - -## Appendix B: Research Sources - -- [MetaGPT](https://github.com/FoundationAgents/MetaGPT) - Multi-agent SOP framework (64.5k stars) -- [ChatDev 2.0](https://github.com/openbmb/ChatDev) - Zero-code multi-agent platform (31.2k stars) -- [CrewAI](https://github.com/crewAIInc/crewAI) - Role-based agent collaboration framework -- [AutoGen](https://github.com/microsoft/autogen) - Microsoft async multi-agent framework -- [LiteLLM](https://github.com/BerriAI/litellm) - Unified LLM API gateway (100+ providers) -- [Mem0](https://github.com/mem0ai/mem0) - Universal memory layer for AI agents -- [A2A Protocol](https://github.com/a2aproject/A2A) - Agent-to-Agent protocol (Linux Foundation) -- [MCP Specification](https://modelcontextprotocol.io/specification/2025-11-25) - Model Context Protocol -- [Langfuse Agent Comparison](https://langfuse.com/blog/2025-03-19-ai-agent-comparison) - Framework comparison -- [Confluent Event-Driven Patterns](https://www.confluent.io/blog/event-driven-multi-agent-systems/) - Multi-agent architecture patterns -- [Microsoft Multi-Agent Reference Architecture](https://microsoft.github.io/multi-agent-reference-architecture/) - Enterprise patterns -- [OpenRouter](https://openrouter.ai/) - Multi-model API gateway -- [Kim et al., "Towards a Science of Scaling Agent Systems" (2025)](https://arxiv.org/abs/2512.08296) - Empirical agent scaling research (180 experiments, 3 LLM families) -- [Cemri et al., "Multi-Agent System Failure Taxonomy (MAST)" (2025)] - MAS coordination error classification -- [Gloaguen et al., "Evaluating AGENTS.md" (2026)](https://arxiv.org/abs/2602.11988) - Context files reduce success rates; non-inferable-only principle for system prompts diff --git a/docs/getting_started.md b/docs/getting_started.md index 586d292c67..84ebe6543a 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -124,7 +124,7 @@ synthorg/ web/ # Web UI scaffold (nginx + placeholder) .github/ # CI workflows, dependabot, actions pyproject.toml # Project config (deps, tools, linters) - DESIGN_SPEC.md # Full high-level design specification + DESIGN_SPEC.md # Pointer to design specification pages CLAUDE.md # AI assistant quick reference ``` @@ -152,4 +152,4 @@ VS Code should auto-detect the `.venv` directory. If not, use **Python: Select I - [CONTRIBUTING.md](https://github.com/Aureliolo/synthorg/blob/main/.github/CONTRIBUTING.md) — branch, commit, and PR workflow - [CLAUDE.md](https://github.com/Aureliolo/synthorg/blob/main/CLAUDE.md) — code conventions and quick command reference -- [Design Specification](design_spec.md) — full high-level design specification +- [Design Specification](design/index.md) — full high-level design specification diff --git a/docs/index.md b/docs/index.md index 0005134d49..1b6c660da5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,6 +4,11 @@ SynthOrg lets you define agents with roles, hierarchy, budgets, and tools, then orchestrate them to collaborate on complex tasks as a virtual organization. +!!! warning "Under Active Development" + + SynthOrg is under active development. Many features described in the design specification + are planned but not yet implemented. See the [Roadmap](roadmap/index.md) for current status. + --- ## Get Started @@ -30,6 +35,72 @@ SynthOrg lets you define agents with roles, hierarchy, budgets, and tools, then --- +## Design Specification + +The design spec covers the full architecture of SynthOrg — from agent identity to budget enforcement: + +
+ +- **Design Overview** + + --- + + Vision, principles, core concepts, and glossary. + + [:octicons-arrow-right-24: Design Overview](design/index.md) + +- **Agents & HR** + + --- + + Agent identity, roles, hiring, performance tracking, promotions. + + [:octicons-arrow-right-24: Agents](design/agents.md) + +- **Organization & Templates** + + --- + + Company types, hierarchy, departments, template system. + + [:octicons-arrow-right-24: Organization](design/organization.md) + +- **Communication** + + --- + + Message bus, delegation, conflict resolution, meeting protocols. + + [:octicons-arrow-right-24: Communication](design/communication.md) + +- **Task & Workflow Engine** + + --- + + Task lifecycle, execution loops, routing, recovery, shutdown. + + [:octicons-arrow-right-24: Engine](design/engine.md) + +- **Memory & Persistence** + + --- + + Memory types, backends, retrieval pipeline, operational data. + + [:octicons-arrow-right-24: Memory](design/memory.md) + +- **Operations** + + --- + + LLM providers, budget, tools, security, human interaction. + + [:octicons-arrow-right-24: Operations](design/operations.md) + +
+ +--- + ## Key Features - **Agent Orchestration** — Define agents with roles, models, and tools. The engine handles task decomposition, routing, and collaboration. @@ -43,19 +114,19 @@ SynthOrg lets you define agents with roles, hierarchy, budgets, and tools, then --- -## Documentation +## Further Reading | Section | Description | |---------|-------------| -| [User Guide](user_guide.md) | Install, configure, and run SynthOrg | -| [Developer Setup](getting_started.md) | Clone, test, lint, and contribute | -| [Architecture](architecture/index.md) | System overview, design principles | +| [Architecture](architecture/index.md) | System overview, module map, design principles | +| [Tech Stack](architecture/tech-stack.md) | Technology choices and engineering conventions | +| [Decision Log](architecture/decisions.md) | All design decisions, organized by domain | | [API Reference](api/index.md) | Auto-generated from docstrings | +| [Roadmap](roadmap/index.md) | Status, open questions, future vision | --- ## Links - [GitHub Repository](https://github.com/Aureliolo/synthorg) -- [Design Specification](design_spec.md) - [License](https://github.com/Aureliolo/synthorg/blob/main/LICENSE) (BSL 1.1 → Apache 2.0 on 2030-02-27) diff --git a/docs/reference/research.md b/docs/reference/research.md new file mode 100644 index 0000000000..89a64f1c10 --- /dev/null +++ b/docs/reference/research.md @@ -0,0 +1,97 @@ +# Research & Prior Art + +## Existing Frameworks Comparison + +The following table compares major multi-agent frameworks that informed the design of SynthOrg. Star counts and version information as of March 2026. + +| Framework | Stars | Architecture | Roles | Models | Memory | Custom Roles | Production Ready | +|-----------|-------|-------------|-------|--------|--------|-------------|-----------------| +| **MetaGPT** | 64.5k | SOP-driven pipeline | PM, Architect, Engineer, QA | OpenAI, Ollama, Groq, Azure | Limited | Partial | Research; MGX commercial | +| **ChatDev 2.0** | 31.2k | Zero-code visual workflows | CEO, CTO, Programmer, Tester, Designer | Multiple via config | Limited | Yes (YAML) | Improving (v2.0 Jan 2026) | +| **CrewAI** | ~50k+ | Role-based crews + flows | Fully custom | Multi-provider | Basic (crew memory) | Yes | Yes (100k+ developers) | +| **AutoGen** | ~40k+ | Conversation-driven async | Custom agents | OpenAI primary, others | Session-based | Yes | Transitioning to MS Agent Framework | +| **LangGraph** | Large | Graph-based DAG | Custom nodes | LangChain ecosystem | Stateful graphs | Yes (nodes) | Yes | +| **Smolagents** | Growing | Code-centric minimal | Code agent | HuggingFace ecosystem | Minimal | Yes | Rapid prototyping | + +--- + +## What Exists vs What SynthOrg Provides + +| Feature | MetaGPT | ChatDev | CrewAI | **SynthOrg** | +|---------|---------|---------|--------|--------------| +| Full company simulation | Partial | Partial | No | **Yes -- complete** | +| HR (hiring/firing) | No | No | No | **Yes** | +| Budget management (CFO) | No | No | No | **Yes** | +| Persistent agent memory | No | No | Basic | **Yes (Mem0 initial, custom stack future)** | +| Agent personalities | Basic | Basic | Basic | **Deep -- traits, styles, evolution** | +| Dynamic team scaling | No | No | Manual | **Yes -- auto + manual** | +| Multiple company types | No | No | Manual | **Yes -- templates + builder** | +| Security ops agent | No | No | No | **Yes** | +| Configurable autonomy | No | No | Limited | **Yes -- full spectrum** | +| Local + cloud providers | Partial | Partial | Partial | **Yes -- unified abstraction (LiteLLM)** | +| Cost tracking per agent | No | No | No | **Yes -- full budget system** | +| Progressive trust | No | No | No | **Yes** | +| Performance metrics | No | No | No | **Yes** | +| MCP tool integration | No | No | Partial | **Yes** | +| A2A protocol support | No | No | No | **Planned** | +| Community marketplace | MGX (commercial) | No | No | **Planned** | + +--- + +## Agent Scaling Research + +[Kim et al., "Towards a Science of Scaling Agent Systems" (2025)](https://arxiv.org/abs/2512.08296) conducted 180 controlled experiments across 3 LLM families and 4 agentic benchmarks with 5 coordination topologies. Key findings that informed the SynthOrg design: + +- **Task decomposability is the primary predictor** of multi-agent success. Parallelizable tasks gain up to +81%, while sequential tasks degrade -39% to -70% under all multi-agent system variants. This directly informs the task decomposition subsystem. +- **Coordination metrics suite** (efficiency, overhead, error amplification, message density, redundancy) explains 52.4% of performance variance (R^2=0.524). Adopted in the LLM call analytics system. +- **Tiered coordination overhead** (`O%`): optimal band is 200--300%, with over-coordination above 400%. Informs the orchestration ratio metric interpretation. +- **Error taxonomy** (logical contradiction, numerical drift, context omission, coordination failure) with architecture-specific patterns. Adopted as opt-in classification in the coordination error classification pipeline. +- **Auto topology selection** achieves 87% accuracy from measurable task properties. Informs the auto topology selector in the task routing subsystem. +- **Centralized verification** contains error amplification to 4.4x vs 17.2x for independent agents. + +!!! note "Applicability" + + The paper tested identical agents on individual tasks. SynthOrg uses role-differentiated agents in an organizational structure. Thresholds (e.g., 45% capability ceiling, 3--4 agent sweet spot) are directional and will be validated empirically in this context. + +--- + +## Build vs Fork Decision + +**Decision: Build from scratch, leverage libraries.** + +No existing framework covers even 50% of SynthOrg's requirements. The core differentiators -- HR, budget management, security ops, deep personalities, progressive trust -- do not exist in any framework. Forking MetaGPT or CrewAI would mean fighting their architecture while adding these features. + +The "company simulation" layer on top is the unique value and must be purpose-built. + +### Libraries Leveraged + +Rather than forking a framework, SynthOrg builds on battle-tested libraries: + +| Library | Role | +|---------|------| +| **LiteLLM** | Provider abstraction (100+ providers, unified API) | +| **Mem0** | Agent memory (initial backend; custom stack future) | +| **Litestar** | API layer (see [Tech Stack](../architecture/tech-stack.md#why-litestar-over-fastapi) for rationale) | +| **MCP** | Tool integration standard | +| **Pydantic** | Config validation and data models | +| **Vue 3** | Web UI framework (see [Tech Stack](../architecture/tech-stack.md)) | + +--- + +## Sources + +- [MetaGPT](https://github.com/FoundationAgents/MetaGPT) -- Multi-agent SOP framework (64.5k stars) +- [ChatDev 2.0](https://github.com/openbmb/ChatDev) -- Zero-code multi-agent platform (31.2k stars) +- [CrewAI](https://github.com/crewAIInc/crewAI) -- Role-based agent collaboration framework +- [AutoGen](https://github.com/microsoft/autogen) -- Microsoft async multi-agent framework +- [LiteLLM](https://github.com/BerriAI/litellm) -- Unified LLM API gateway (100+ providers) +- [Mem0](https://github.com/mem0ai/mem0) -- Universal memory layer for AI agents +- [A2A Protocol](https://github.com/a2aproject/A2A) -- Agent-to-Agent protocol (Linux Foundation) +- [MCP Specification](https://modelcontextprotocol.io/specification/2025-11-25) -- Model Context Protocol +- [Langfuse Agent Comparison](https://langfuse.com/blog/2025-03-19-ai-agent-comparison) -- Framework comparison +- [Confluent Event-Driven Patterns](https://www.confluent.io/blog/event-driven-multi-agent-systems/) -- Multi-agent architecture patterns +- [Microsoft Multi-Agent Reference Architecture](https://microsoft.github.io/multi-agent-reference-architecture/) -- Enterprise patterns +- [OpenRouter](https://openrouter.ai/) -- Multi-model API gateway +- [Kim et al., "Towards a Science of Scaling Agent Systems" (2025)](https://arxiv.org/abs/2512.08296) -- Empirical agent scaling research (180 experiments, 3 LLM families) +- Cemri et al., "Multi-Agent System Failure Taxonomy (MAST)" (2025) -- MAS coordination error classification +- [Gloaguen et al., "Evaluating AGENTS.md" (2026)](https://arxiv.org/abs/2602.11988) -- Context files reduce success rates; non-inferable-only principle for system prompts diff --git a/docs/reference/standards.md b/docs/reference/standards.md new file mode 100644 index 0000000000..57a6dae728 --- /dev/null +++ b/docs/reference/standards.md @@ -0,0 +1,41 @@ +# Industry Standards + +SynthOrg aligns with emerging industry standards for agent-to-tool and agent-to-agent communication. This page describes the standards used and how they integrate into the framework. + +## Standards Overview + +| Standard | Owner | Purpose | SynthOrg Usage | +|----------|-------|---------|----------------| +| **MCP** (Model Context Protocol) | Anthropic, now Linux Foundation (AAIF) | Standardized LLM-to-tool integration | Tool system backbone | +| **A2A** (Agent-to-Agent Protocol) | Google, now Linux Foundation | Agent-to-agent communication | Future agent interoperability | +| **OpenAI API format** | OpenAI (de facto standard) | LLM API | Via provider abstraction layer (LiteLLM) | + +--- + +## Model Context Protocol (MCP) + +MCP provides a standardized interface for LLM agents to discover and invoke external tools. SynthOrg uses the official MCP SDK (`mcp` Python package) as the backbone of its tool integration system. + +The MCP bridge subsystem (`tools/mcp/`) connects to MCP-compliant tool servers, discovers available tools at runtime, and exposes them through the same `BaseTool` interface used by built-in tools. This means agents interact with MCP tools identically to native tools -- through the `ToolInvoker` with the same permission checking and sandboxing applied. + +Key integration points: + +- **`MCPToolFactory`** connects to configured MCP servers in parallel and creates `MCPBridgeTool` wrappers +- **`MCPBridgeTool`** implements `BaseTool`, mapping MCP tool schemas to the internal tool interface +- **Result caching** with configurable TTL and LRU eviction reduces redundant tool calls + +--- + +## Agent-to-Agent Protocol (A2A) + +The A2A protocol defines how autonomous agents discover each other's capabilities and delegate tasks across organizational boundaries. SynthOrg's communication layer is designed to be A2A-compatible for future inter-agent interoperability. + +The framework currently uses an internal message bus for inter-agent communication within a single organization. A2A support is planned for scenarios where multiple synthetic organizations need to collaborate, or where SynthOrg agents need to interact with agents from other frameworks. + +--- + +## OpenAI API Format + +The OpenAI chat completions API format has become the de facto standard for LLM interactions. SynthOrg accesses this format through LiteLLM, which provides a unified interface across 100+ providers that all speak the OpenAI API format (or are translated to it). + +This means SynthOrg is not coupled to any single LLM provider. Switching between providers is a configuration change, not a code change. The provider abstraction layer handles request/response mapping, cost tracking, retries, fallbacks, and rate limiting transparently. diff --git a/docs/roadmap/future-vision.md b/docs/roadmap/future-vision.md new file mode 100644 index 0000000000..ca4592ee4a --- /dev/null +++ b/docs/roadmap/future-vision.md @@ -0,0 +1,47 @@ +# Future Vision + +These features are not part of the MVP. They represent the longer-term direction for SynthOrg once the core framework is stable. + +## Future Features + +| Feature | Priority | Description | +|---------|----------|-------------| +| Plugin system | High | Third-party plugins for new tools, roles, and providers. | +| Multi-project support | High | Company handles multiple projects simultaneously. | +| Self-improving company | High | The AI company developing the AI company framework (meta). | +| Community marketplace | Medium | Share and download company templates, roles, and workflows. | +| Network hosting | Medium | Expose on LAN/internet with multi-user access. | +| Agent evolution | Medium | Agents improve over time based on feedback. | +| Benchmarking suite | Medium | Compare company configurations on standard tasks. | +| Visual workflow editor | Medium | Drag-and-drop workflow design in the Web UI. | +| Agent promotions (extended) | Medium | Advanced promotion features: peer review integration, multi-dimensional criteria weighting, team-wide calibration. Core promotion system is [implemented](../design/agents.md#promotions-demotions). | +| Reporting system | Medium | Weekly/monthly automated company reports. | +| Training mode | Medium | New agents learn from senior agents' past work. | +| Integration APIs | Medium | Connect to real Slack, GitHub, Jira, Linear. | +| Inter-company communication | Low | Two AI companies collaborating on a project. | +| Voice interface | Low | Talk to the AI company via voice. | +| Mobile app | Low | Monitor the company from a phone. | +| Client simulation | Low | AI "clients" that give requirements and review output. | +| Shift system | Low | Agents "work" in shifts, different agents for different hours. | + +--- + +## Scaling Path + +SynthOrg is designed to scale incrementally from a local single-process deployment to a fully hosted cloud platform. + +```text +Phase 1: Local Single-Process + └── Async runtime, embedded DB, in-memory bus, 1-10 agents + +Phase 2: Local Multi-Process + └── External message bus, production DB, sandboxed execution, 10-30 agents + +Phase 3: Network/Server + └── Full API, multi-user, distributed agents, 30-100 agents + +Phase 4: Cloud/Hosted + └── Container orchestration, horizontal scaling, marketplace, 100+ agents +``` + +Each phase builds on the previous one. The pluggable protocol interfaces throughout the codebase (persistence, memory, message bus, sandbox) are designed to make these transitions configuration changes rather than rewrites. diff --git a/docs/roadmap/index.md b/docs/roadmap/index.md new file mode 100644 index 0000000000..39d0419fca --- /dev/null +++ b/docs/roadmap/index.md @@ -0,0 +1,37 @@ +# Roadmap + +## Current Status + +The SynthOrg core framework is complete. The following subsystems are built and tested: + +- Provider abstraction layer (LiteLLM adapter, routing, resilience) +- Budget and cost management (tracking, enforcement, CFO optimization, quotas) +- Agent engine (execution loops, parallel execution, task decomposition, routing, assignment, recovery, shutdown) +- Communication layer (message bus, delegation, loop prevention, conflict resolution, meeting protocol) +- Memory system (pluggable backend protocol, retrieval pipeline, shared org memory, consolidation) +- Security and approval system (rule engine, output scanning, progressive trust, autonomy levels, timeout policies) +- Tool system (file system, git, code runner, MCP bridge, sandboxing, permissions) +- HR engine (hiring, firing, onboarding, offboarding, registry, performance tracking, promotions) +- REST and WebSocket API (Litestar controllers, JWT + API key auth, WebSocket channels) +- Persistence layer (pluggable protocol, SQLite backend, repository protocols) +- Observability (structured logging, correlation tracking, per-domain event constants) +- Configuration (YAML loading, Pydantic validation, company templates with inheritance) +- Container packaging (Docker, Chainguard distroless, CI/CD pipelines) + +## Remaining Work + +| Area | Description | +|------|-------------| +| **Mem0 adapter** | Concrete `MemoryBackend` implementation using the Mem0 library | +| **Approval workflow gates** | Runtime wiring for human-in-the-loop approval queues | +| **CLI** | Terminal interface wrapping the REST API (may not be needed) | +| **Web dashboard** | Vue 3 frontend for monitoring and managing the synthetic organization | + +## Tracking + +Implementation issues are tracked on the [GitHub issue tracker](https://github.com/Aureliolo/synthorg/issues) and prioritized by dependency order. + +## Further Reading + +- [Open Questions & Risks](open-questions.md) -- unresolved design questions and identified risks +- [Future Vision](future-vision.md) -- post-MVP features and the scaling path diff --git a/docs/roadmap/open-questions.md b/docs/roadmap/open-questions.md new file mode 100644 index 0000000000..22e80e0a81 --- /dev/null +++ b/docs/roadmap/open-questions.md @@ -0,0 +1,42 @@ +# Open Questions & Risks + +## Open Questions + +The following design questions remain unresolved. Each carries potential impact on architecture or behavior and will be addressed as the project progresses. + +Numbers are stable identifiers — resolved questions are removed without renumbering to preserve cross-references. + +| # | Question | Impact | Notes | +|---|----------|--------|-------| +| 1 | How deep should agent personality affect output? | Medium | Too deep leads to inconsistency; too shallow makes all agents feel the same. | +| 3 | How to handle context window limits for long tasks? | High | Agents may lose track of complex multi-file changes. | +| 4 | Should agents be able to create/modify other agents? | Medium | For example, a CTO "hires" a developer by creating a new agent config. | +| 6 | What metrics define "good" agent performance? | Medium | Needed for HR/hiring/firing decisions. | +| 8 | Optimal message bus for local-first architecture? | Medium | asyncio queues vs Redis vs embedded broker. | +| 10 | What is the minimum viable meeting set? | Low | Standup + planning + review as a starting point? | + +--- + +## Technical Risks + +| Risk | Severity | Mitigation | +|------|----------|------------| +| Context window exhaustion on complex tasks | High | Memory summarization, task decomposition, working memory management. | +| Cost explosion from agent loops | High | Budget hard stops, loop detection, max iterations per task. | +| Agent quality degradation with cheap models | Medium | Quality gates, minimum model requirements per task type. | +| Third-party library breaking changes | Medium | Pin versions, integration tests, abstraction layers. | +| Memory retrieval quality | Medium | Mem0 selected as initial backend (see [Decision Log](../architecture/decisions.md)). Protocol layer enables backend swap if retrieval quality is insufficient. Pin version, test Python 3.14 compatibility in CI. | +| Agent personality inconsistency | Low | Strong system prompts, few-shot examples, personality tests. | +| WebSocket scaling | Low | Start local, add Redis pub/sub when needed. | + +--- + +## Architecture Risks + +| Risk | Severity | Mitigation | +|------|----------|------------| +| Over-engineering the MVP | High | Start with a minimal viable company (3--5 agents), add complexity iteratively. | +| Config format becoming unwieldy | Medium | Good defaults, layered config (base + overrides), validation. | +| Agent execution bottlenecks | Medium | Async execution, parallel agent processing, queue-based architecture. | +| Data loss on crash | Medium | WAL mode SQLite. `RecoveryStrategy` protocol: fail-and-reassign implemented, checkpoint recovery planned. | +| Orchestration overhead exceeds productive work | Medium | LLM call analytics: proxy metrics implemented, call categorization and orchestration ratio alerts planned. | diff --git a/docs/user_guide.md b/docs/user_guide.md index d0c7876ae5..1b4a59e791 100644 --- a/docs/user_guide.md +++ b/docs/user_guide.md @@ -11,12 +11,13 @@ How to run SynthOrg. ```bash git clone https://github.com/Aureliolo/synthorg cd synthorg +cp docker/.env.example docker/.env docker compose -f docker/compose.yml up -d ``` The web dashboard is at [http://localhost:3000](http://localhost:3000). -All configuration — LLM provider keys, organization setup, templates — is managed through the dashboard. +Container configuration (ports, storage paths, log level) is defined in `docker/.env`. Organization setup and templates will be configurable through the dashboard once available. !!! danger "Work in Progress" SynthOrg is under active development. The web dashboard, templates, and many features described here are **not yet available**. Check the [GitHub repository](https://github.com/Aureliolo/synthorg) for current status. @@ -43,4 +44,4 @@ docker compose -f docker/compose.yml down - Templates — Full list of pre-built configurations (coming soon) - REST API — Interact with your org via the API (coming soon) -- [Design Specification](design_spec.md) — Full architecture details +- [Design Specification](design/index.md) — Full architecture details diff --git a/mkdocs.yml b/mkdocs.yml index 7971f0b431..e60ed26dd0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -102,10 +102,25 @@ nav: - Home: index.md - User Guide: user_guide.md - Developer Setup: getting_started.md - - Design Specification: design_spec.md + - Design: + - design/index.md + - Agents: design/agents.md + - Organization: design/organization.md + - Communication: design/communication.md + - Engine: design/engine.md + - Memory: design/memory.md + - Operations: design/operations.md - Architecture: - architecture/index.md - - Design Decisions: architecture/decisions.md + - Tech Stack: architecture/tech-stack.md + - Decision Log: architecture/decisions.md + - Roadmap: + - roadmap/index.md + - Open Questions: roadmap/open-questions.md + - Future Vision: roadmap/future-vision.md + - Reference: + - reference/research.md + - reference/standards.md - API Reference: - api/index.md - Core: api/core.md @@ -124,6 +139,7 @@ nav: - Observability: api/observability.md extra: + homepage: / social: - icon: fontawesome/brands/github link: https://github.com/Aureliolo/synthorg diff --git a/site/src/layouts/Base.astro b/site/src/layouts/Base.astro index 8ea6286bee..8ce6b50836 100644 --- a/site/src/layouts/Base.astro +++ b/site/src/layouts/Base.astro @@ -29,5 +29,7 @@ const { + + diff --git a/site/src/pages/index.astro b/site/src/pages/index.astro index 6734428a4a..214f52abb7 100644 --- a/site/src/pages/index.astro +++ b/site/src/pages/index.astro @@ -31,7 +31,7 @@ import Base from "../layouts/Base.astro";

- @@ -249,17 +250,18 @@ import Base from "../layouts/Base.astro";

Open Source

- BSL 1.1 — converts to Apache 2.0 on 2030-02-27. - Our 3500-line design specification is public. + Everything is open — code, design spec, + architecture docs, and + roadmap. + Licensed under BSL 1.1, converting to Apache 2.0 on 2030-02-27.

-
- - Star on GitHub - +
+ +
+ Star + Fork +
like a team. -
diff --git a/src/ai_company/budget/__init__.py b/src/ai_company/budget/__init__.py index 2972039f27..bd798a96d5 100644 --- a/src/ai_company/budget/__init__.py +++ b/src/ai_company/budget/__init__.py @@ -2,7 +2,7 @@ This module provides the domain models for budget configuration, cost tracking, budget hierarchy, and spending summaries as described in -DESIGN_SPEC Section 10. +the Operations design page. """ from ai_company.budget.billing import billing_period_start, daily_period_start diff --git a/src/ai_company/budget/config.py b/src/ai_company/budget/config.py index b345770a10..eb10e23bc4 100644 --- a/src/ai_company/budget/config.py +++ b/src/ai_company/budget/config.py @@ -1,6 +1,6 @@ """Budget configuration models. -Implements DESIGN_SPEC Section 10.4: cost controls including alert +Implements the Cost Controls section of the Operations design page: alert thresholds, per-task and per-agent limits, and automatic model downgrade. """ @@ -75,7 +75,7 @@ class AutoDowngradeConfig(BaseModel): threshold: Budget percent that triggers downgrade. downgrade_map: Ordered pairs of (from_alias, to_alias). boundary: When to apply downgrade (task_assignment only, - never mid-execution per DESIGN_SPEC §10.4). + never mid-execution per the Operations design page). """ model_config = ConfigDict(frozen=True) diff --git a/src/ai_company/budget/coordination_metrics.py b/src/ai_company/budget/coordination_metrics.py index ef70ba4b2f..3fbc68e754 100644 --- a/src/ai_company/budget/coordination_metrics.py +++ b/src/ai_company/budget/coordination_metrics.py @@ -1,7 +1,7 @@ """Coordination metrics for multi-agent system tuning. Pure computation functions for five coordination metrics defined in -DESIGN_SPEC (Coordination Metrics): efficiency, overhead, error +the Operations design page (Coordination Metrics): efficiency, overhead, error amplification, message density, and redundancy rate. """ diff --git a/src/ai_company/budget/cost_record.py b/src/ai_company/budget/cost_record.py index c3106f0529..de11490808 100644 --- a/src/ai_company/budget/cost_record.py +++ b/src/ai_company/budget/cost_record.py @@ -1,7 +1,8 @@ """Cost record model for per-API-call tracking. -Implements DESIGN_SPEC Section 10.2: every API call is tracked as an -immutable cost record (append-only pattern). +Implements the Cost Tracking section of the Operations design page: +every API call is tracked as an immutable cost record +(append-only pattern). """ from typing import Self diff --git a/src/ai_company/budget/enforcer.py b/src/ai_company/budget/enforcer.py index 02c645b7ab..0db70e6e99 100644 --- a/src/ai_company/budget/enforcer.py +++ b/src/ai_company/budget/enforcer.py @@ -3,7 +3,7 @@ Composes :class:`~ai_company.budget.tracker.CostTracker` and :class:`~ai_company.budget.config.BudgetConfig` to provide pre-flight checks, in-flight budget checking, and task-boundary auto-downgrade -as described in DESIGN_SPEC Section 10.4. +as described in the Cost Controls section of the Operations design page. """ from typing import TYPE_CHECKING, NamedTuple diff --git a/src/ai_company/budget/hierarchy.py b/src/ai_company/budget/hierarchy.py index 03f7bd419c..9c1e1df5d9 100644 --- a/src/ai_company/budget/hierarchy.py +++ b/src/ai_company/budget/hierarchy.py @@ -1,6 +1,6 @@ """Budget hierarchy models. -Implements DESIGN_SPEC Section 10.1: budget allocation hierarchy from +Implements the Budget Hierarchy section of the Operations design page: Company to Department to Team, with percentage-based allocation at each level. """ @@ -94,7 +94,7 @@ class BudgetHierarchy(BaseModel): """Company-wide budget hierarchy. Maps the Company -> Department -> Team nesting from a budget - allocation perspective (DESIGN_SPEC 10.1). Department budget + allocation perspective (see Operations design page). Department budget percentages may sum to less than 100% to allow for an unallocated reserve at the company level. diff --git a/src/ai_company/budget/optimizer.py b/src/ai_company/budget/optimizer.py index fd60502b3d..02be8a08af 100644 --- a/src/ai_company/budget/optimizer.py +++ b/src/ai_company/budget/optimizer.py @@ -8,7 +8,7 @@ queries — the advisory complement to :class:`~ai_company.budget.enforcer.BudgetEnforcer`. -Service layer backing the CFO role (DESIGN_SPEC Section 10.3). +Service layer backing the CFO role (see Operations design page). """ import asyncio diff --git a/src/ai_company/budget/reports.py b/src/ai_company/budget/reports.py index 914009ed24..77f379aaaa 100644 --- a/src/ai_company/budget/reports.py +++ b/src/ai_company/budget/reports.py @@ -5,7 +5,7 @@ :class:`~ai_company.budget.tracker.CostTracker` and :class:`~ai_company.budget.config.BudgetConfig`. -Service layer backing CFO reporting (DESIGN_SPEC Section 10.3). +Service layer backing CFO reporting (see Operations design page). """ import math diff --git a/src/ai_company/budget/spending_summary.py b/src/ai_company/budget/spending_summary.py index 3ac1ab8cb4..66c13beb38 100644 --- a/src/ai_company/budget/spending_summary.py +++ b/src/ai_company/budget/spending_summary.py @@ -2,7 +2,7 @@ Provides the aggregation data structures used by :class:`~ai_company.budget.tracker.CostTracker` for cost reporting and -designed for consumption by the CFO agent (DESIGN_SPEC Section 10.3). +designed for consumption by the CFO agent (see Operations design page). Views of :class:`~ai_company.budget.cost_record.CostRecord` data are aggregated by agent, department, and time period. """ diff --git a/src/ai_company/budget/tracker.py b/src/ai_company/budget/tracker.py index 1a0bbed677..68c80c786d 100644 --- a/src/ai_company/budget/tracker.py +++ b/src/ai_company/budget/tracker.py @@ -3,7 +3,7 @@ Provides an append-only in-memory store for :class:`CostRecord` entries and aggregation queries consumed by the CFO agent and budget monitoring. -Service layer for the cost tracking schema defined in DESIGN_SPEC Section 10.2. +Service layer for the cost tracking schema defined in the Operations design page. The current implementation is purely in-memory; persistence integration is planned. """ diff --git a/src/ai_company/communication/bus_memory.py b/src/ai_company/communication/bus_memory.py index c2a3f593de..9d8da2f7da 100644 --- a/src/ai_company/communication/bus_memory.py +++ b/src/ai_company/communication/bus_memory.py @@ -1,4 +1,4 @@ -"""In-memory message bus implementation (DESIGN_SPEC Section 5.4). +"""In-memory message bus implementation (see Communication design page). Default backend using asyncio primitives. Suitable for single-process deployments and testing. diff --git a/src/ai_company/communication/bus_protocol.py b/src/ai_company/communication/bus_protocol.py index 6293049529..6a768079ab 100644 --- a/src/ai_company/communication/bus_protocol.py +++ b/src/ai_company/communication/bus_protocol.py @@ -1,4 +1,4 @@ -"""Message bus protocol (DESIGN_SPEC Section 5.4). +"""Message bus protocol (see Communication design page). Defines the swappable interface for message bus backends. The default implementation is :class:`InMemoryMessageBus` in diff --git a/src/ai_company/communication/config.py b/src/ai_company/communication/config.py index 3610df36de..cdbe21c4d2 100644 --- a/src/ai_company/communication/config.py +++ b/src/ai_company/communication/config.py @@ -1,4 +1,4 @@ -"""Communication configuration models (DESIGN_SPEC Sections 5.4, 5.5).""" +"""Communication configuration models (see Communication design page).""" from collections import Counter from typing import Literal, Self @@ -18,7 +18,7 @@ validate_unique_strings, ) -# Default channels from DESIGN_SPEC Section 5.4. +# Default channels from the Communication design page. _DEFAULT_CHANNELS: tuple[str, ...] = ( "#all-hands", "#engineering", @@ -49,7 +49,7 @@ class MessageRetentionConfig(BaseModel): class MessageBusConfig(BaseModel): """Message bus backend configuration. - Maps to DESIGN_SPEC Section 5.4 ``message_bus``. + Maps to the Communication design page ``message_bus``. Attributes: backend: Transport backend to use. @@ -82,7 +82,7 @@ def _validate_channels(self) -> Self: class MeetingTypeConfig(BaseModel): """Configuration for a single meeting type. - Maps to DESIGN_SPEC Section 5.4 ``meetings.types[]``. Exactly one of + Maps to the Communication design page ``meetings.types[]``. Exactly one of ``frequency`` or ``trigger`` must be set. Attributes: @@ -139,7 +139,7 @@ def _validate_participants(self) -> Self: class MeetingsConfig(BaseModel): """Meetings subsystem configuration. - Maps to DESIGN_SPEC Section 5.4 ``meetings``. + Maps to the Communication design page ``meetings``. Attributes: enabled: Whether the meetings subsystem is active. @@ -168,7 +168,7 @@ def _validate_unique_meeting_names(self) -> Self: class HierarchyConfig(BaseModel): """Hierarchy enforcement configuration. - Maps to DESIGN_SPEC Section 5.4 ``hierarchy``. + Maps to the Communication design page ``hierarchy``. Attributes: enforce_chain_of_command: Whether chain-of-command is enforced. @@ -190,7 +190,7 @@ class HierarchyConfig(BaseModel): class RateLimitConfig(BaseModel): """Per-pair message rate limit configuration. - Maps to DESIGN_SPEC Section 5.5 ``rate_limit``. + Maps to the Communication design page ``rate_limit``. Attributes: max_per_pair_per_minute: Maximum messages per agent pair per minute. @@ -214,7 +214,7 @@ class RateLimitConfig(BaseModel): class CircuitBreakerConfig(BaseModel): """Circuit breaker configuration for agent-pair communication. - Maps to DESIGN_SPEC Section 5.5 ``circuit_breaker``. + Maps to the Communication design page ``circuit_breaker``. Attributes: bounce_threshold: Bounce count before the circuit opens. @@ -238,7 +238,7 @@ class CircuitBreakerConfig(BaseModel): class LoopPreventionConfig(BaseModel): """Loop prevention safeguards. - Maps to DESIGN_SPEC Section 5.5. ``ancestry_tracking`` is always on + Maps to the Communication design page. ``ancestry_tracking`` is always on and cannot be disabled. Attributes: @@ -278,7 +278,7 @@ class LoopPreventionConfig(BaseModel): class CommunicationConfig(BaseModel): """Top-level communication configuration. - Aggregates DESIGN_SPEC Sections 5.4 and 5.5 under a single model. + Aggregates the Communication design page sections under a single model. Attributes: default_pattern: High-level communication pattern. @@ -312,5 +312,5 @@ class CommunicationConfig(BaseModel): ) conflict_resolution: ConflictResolutionConfig = Field( default_factory=ConflictResolutionConfig, - description="Conflict resolution configuration (DESIGN_SPEC §5.6)", + description="Conflict resolution configuration (see Communication design page)", ) diff --git a/src/ai_company/communication/conflict_resolution/__init__.py b/src/ai_company/communication/conflict_resolution/__init__.py index 080f1cf037..3147909873 100644 --- a/src/ai_company/communication/conflict_resolution/__init__.py +++ b/src/ai_company/communication/conflict_resolution/__init__.py @@ -1,4 +1,4 @@ -"""Conflict resolution subsystem (DESIGN_SPEC §5.6). +"""Conflict resolution subsystem (see Communication design page). Strategy implementations (``AuthorityResolver``, ``DebateResolver``, ``HumanEscalationResolver``, ``HybridResolver``) are imported directly diff --git a/src/ai_company/communication/conflict_resolution/authority_strategy.py b/src/ai_company/communication/conflict_resolution/authority_strategy.py index b08a63fb47..527cb42d86 100644 --- a/src/ai_company/communication/conflict_resolution/authority_strategy.py +++ b/src/ai_company/communication/conflict_resolution/authority_strategy.py @@ -1,4 +1,4 @@ -"""Authority + dissent log conflict resolution strategy (DESIGN_SPEC §5.6). +"""Authority + dissent log conflict resolution strategy (see Communication design page). Strategy 1: The agent with higher seniority wins. For equal seniority, hierarchy position decides — using the lowest common manager for diff --git a/src/ai_company/communication/conflict_resolution/config.py b/src/ai_company/communication/conflict_resolution/config.py index 44198ea94e..1d3281797c 100644 --- a/src/ai_company/communication/conflict_resolution/config.py +++ b/src/ai_company/communication/conflict_resolution/config.py @@ -1,4 +1,4 @@ -"""Conflict resolution configuration models (DESIGN_SPEC §5.6).""" +"""Conflict resolution configuration models (see Communication design page).""" from pydantic import BaseModel, ConfigDict, Field diff --git a/src/ai_company/communication/conflict_resolution/debate_strategy.py b/src/ai_company/communication/conflict_resolution/debate_strategy.py index 8c5a2bb0ec..5ff89a6196 100644 --- a/src/ai_company/communication/conflict_resolution/debate_strategy.py +++ b/src/ai_company/communication/conflict_resolution/debate_strategy.py @@ -1,4 +1,6 @@ -"""Structured debate + judge conflict resolution strategy (DESIGN_SPEC §5.6). +"""Structured debate + judge conflict resolution strategy. + +See the Communication design page for background. Strategy 2: A judge evaluates both positions and picks a winner. If a ``JudgeEvaluator`` is provided, it uses LLM-based judging. diff --git a/src/ai_company/communication/conflict_resolution/human_strategy.py b/src/ai_company/communication/conflict_resolution/human_strategy.py index 63bb370d42..9994d38ad5 100644 --- a/src/ai_company/communication/conflict_resolution/human_strategy.py +++ b/src/ai_company/communication/conflict_resolution/human_strategy.py @@ -1,4 +1,4 @@ -"""Human escalation conflict resolution strategy (DESIGN_SPEC §5.6). +"""Human escalation conflict resolution strategy (see Communication design page). Strategy 3: Escalate to human for resolution. Returns a stub resolution with ``ESCALATED_TO_HUMAN`` outcome — actual human diff --git a/src/ai_company/communication/conflict_resolution/hybrid_strategy.py b/src/ai_company/communication/conflict_resolution/hybrid_strategy.py index eefef67e7b..4248163056 100644 --- a/src/ai_company/communication/conflict_resolution/hybrid_strategy.py +++ b/src/ai_company/communication/conflict_resolution/hybrid_strategy.py @@ -1,4 +1,4 @@ -"""Hybrid conflict resolution strategy (DESIGN_SPEC §5.6). +"""Hybrid conflict resolution strategy (see Communication design page). Strategy 4: Combines automated review with optional human escalation. If a ``JudgeEvaluator`` is provided and returns a clear winner, diff --git a/src/ai_company/communication/conflict_resolution/models.py b/src/ai_company/communication/conflict_resolution/models.py index 1d01d2e0f1..026764fefd 100644 --- a/src/ai_company/communication/conflict_resolution/models.py +++ b/src/ai_company/communication/conflict_resolution/models.py @@ -1,4 +1,4 @@ -"""Conflict resolution domain models (DESIGN_SPEC §5.6). +"""Conflict resolution domain models (see Communication design page). All models are frozen Pydantic v2 with ``NotBlankStr`` identifiers, following the patterns established in ``delegation/models.py``. diff --git a/src/ai_company/communication/conflict_resolution/protocol.py b/src/ai_company/communication/conflict_resolution/protocol.py index 89c9161c91..934f59ebce 100644 --- a/src/ai_company/communication/conflict_resolution/protocol.py +++ b/src/ai_company/communication/conflict_resolution/protocol.py @@ -1,4 +1,4 @@ -"""Conflict resolution protocol interfaces (DESIGN_SPEC §5.6). +"""Conflict resolution protocol interfaces (see Communication design page). Defines the pluggable strategy interface that varies per resolution approach (``resolve`` + ``build_dissent_records``). Detection logic diff --git a/src/ai_company/communication/conflict_resolution/service.py b/src/ai_company/communication/conflict_resolution/service.py index f35bdf39d9..8c9b2c0729 100644 --- a/src/ai_company/communication/conflict_resolution/service.py +++ b/src/ai_company/communication/conflict_resolution/service.py @@ -1,4 +1,4 @@ -"""Conflict resolution service orchestrator (DESIGN_SPEC §5.6). +"""Conflict resolution service orchestrator (see Communication design page). Follows the ``DelegationService`` pattern: ``__slots__``, keyword-only constructor, audit trail list, structured logging. diff --git a/src/ai_company/communication/dispatcher.py b/src/ai_company/communication/dispatcher.py index 64797e353e..3dae9affd8 100644 --- a/src/ai_company/communication/dispatcher.py +++ b/src/ai_company/communication/dispatcher.py @@ -1,6 +1,6 @@ """Message dispatcher — routes incoming messages to registered handlers. -See DESIGN_SPEC Section 5.4. +See the Communication design page. """ import asyncio diff --git a/src/ai_company/communication/enums.py b/src/ai_company/communication/enums.py index 5efdedcec5..50e1ad77d5 100644 --- a/src/ai_company/communication/enums.py +++ b/src/ai_company/communication/enums.py @@ -6,7 +6,7 @@ class MessageType(StrEnum): """Type of inter-agent message. - Maps to the ``type`` field in DESIGN_SPEC Section 5.3. + Maps to the ``type`` field in the Communication design page. """ TASK_UPDATE = "task_update" @@ -25,7 +25,7 @@ class MessagePriority(StrEnum): """Priority level for messages. Separate from :class:`ai_company.core.enums.Priority` which uses - ``"medium"``; message priority uses ``"normal"`` per DESIGN_SPEC 5.3. + ``"medium"``; message priority uses ``"normal"`` per the Communication design page. """ LOW = "low" @@ -65,7 +65,7 @@ class AttachmentType(StrEnum): class CommunicationPattern(StrEnum): """High-level communication pattern for the company. - Maps to DESIGN_SPEC Section 5.1. + Maps to the Communication design page. """ EVENT_DRIVEN = "event_driven" @@ -75,7 +75,7 @@ class CommunicationPattern(StrEnum): class ConflictType(StrEnum): - """Type of inter-agent conflict (DESIGN_SPEC §5.6). + """Type of inter-agent conflict (see Communication design page). Members: ARCHITECTURE: Disagreement on system design choices. @@ -95,7 +95,7 @@ class ConflictType(StrEnum): class ConflictResolutionStrategy(StrEnum): - """Strategy for resolving inter-agent conflicts (DESIGN_SPEC §5.6). + """Strategy for resolving inter-agent conflicts (see Communication design page). Members: AUTHORITY: Resolve by seniority/hierarchy with dissent log. @@ -113,7 +113,7 @@ class ConflictResolutionStrategy(StrEnum): class MessageBusBackend(StrEnum): """Message bus backend implementation. - Maps to DESIGN_SPEC Section 5.4 ``message_bus.backend``. + Maps to the Communication design page ``message_bus.backend``. """ INTERNAL = "internal" diff --git a/src/ai_company/communication/errors.py b/src/ai_company/communication/errors.py index a05d88e00e..1655617ebd 100644 --- a/src/ai_company/communication/errors.py +++ b/src/ai_company/communication/errors.py @@ -1,4 +1,4 @@ -"""Communication error hierarchy (DESIGN_SPEC Sections 5.4, 5.5, 5.6). +"""Communication error hierarchy (see Communication design page). All communication errors carry an immutable context mapping for structured metadata, following the same pattern as ``ToolError``. diff --git a/src/ai_company/communication/handler.py b/src/ai_company/communication/handler.py index 91db7300cb..72c53b4e32 100644 --- a/src/ai_company/communication/handler.py +++ b/src/ai_company/communication/handler.py @@ -1,4 +1,4 @@ -"""Handler protocol, adapter, and registration (DESIGN_SPEC Section 5.4).""" +"""Handler protocol, adapter, and registration (see Communication design page).""" import inspect from collections.abc import Awaitable, Callable diff --git a/src/ai_company/communication/meeting/__init__.py b/src/ai_company/communication/meeting/__init__.py index 36b3cb3e61..457c416e03 100644 --- a/src/ai_company/communication/meeting/__init__.py +++ b/src/ai_company/communication/meeting/__init__.py @@ -1,4 +1,4 @@ -"""Meeting protocol subsystem (DESIGN_SPEC Section 5.7). +"""Meeting protocol subsystem (see Communication design page). Provides pluggable meeting protocol strategies for structured multi-agent conversations: diff --git a/src/ai_company/communication/meeting/config.py b/src/ai_company/communication/meeting/config.py index cd32bf8453..eb3b9c6ff2 100644 --- a/src/ai_company/communication/meeting/config.py +++ b/src/ai_company/communication/meeting/config.py @@ -1,4 +1,4 @@ -"""Meeting protocol configuration models (DESIGN_SPEC Section 5.7).""" +"""Meeting protocol configuration models (see Communication design page).""" from pydantic import BaseModel, ConfigDict, Field diff --git a/src/ai_company/communication/meeting/enums.py b/src/ai_company/communication/meeting/enums.py index d74c0b8de3..78021cd596 100644 --- a/src/ai_company/communication/meeting/enums.py +++ b/src/ai_company/communication/meeting/enums.py @@ -1,4 +1,4 @@ -"""Meeting protocol enumerations (DESIGN_SPEC Section 5.7).""" +"""Meeting protocol enumerations (see Communication design page).""" from enum import StrEnum diff --git a/src/ai_company/communication/meeting/errors.py b/src/ai_company/communication/meeting/errors.py index d5599282c5..60218fda50 100644 --- a/src/ai_company/communication/meeting/errors.py +++ b/src/ai_company/communication/meeting/errors.py @@ -1,4 +1,4 @@ -"""Meeting protocol error hierarchy (DESIGN_SPEC Section 5.7). +"""Meeting protocol error hierarchy (see Communication design page). All meeting errors extend ``CommunicationError`` and carry an immutable context mapping for structured metadata. diff --git a/src/ai_company/communication/meeting/models.py b/src/ai_company/communication/meeting/models.py index 79bc28d86a..44a5646791 100644 --- a/src/ai_company/communication/meeting/models.py +++ b/src/ai_company/communication/meeting/models.py @@ -1,4 +1,4 @@ -"""Meeting protocol domain models (DESIGN_SPEC Section 5.7).""" +"""Meeting protocol domain models (see Communication design page).""" from typing import Self diff --git a/src/ai_company/communication/meeting/orchestrator.py b/src/ai_company/communication/meeting/orchestrator.py index 91ddc82866..fa855e1ae5 100644 --- a/src/ai_company/communication/meeting/orchestrator.py +++ b/src/ai_company/communication/meeting/orchestrator.py @@ -1,4 +1,4 @@ -"""Meeting orchestrator — lifecycle manager (DESIGN_SPEC Section 5.7). +"""Meeting orchestrator — lifecycle manager (see Communication design page). Manages the full meeting lifecycle: validates inputs, selects the configured protocol, executes the meeting, optionally creates tasks diff --git a/src/ai_company/communication/meeting/position_papers.py b/src/ai_company/communication/meeting/position_papers.py index 43384d9560..3a465e4e40 100644 --- a/src/ai_company/communication/meeting/position_papers.py +++ b/src/ai_company/communication/meeting/position_papers.py @@ -1,4 +1,4 @@ -"""Position-papers meeting protocol (DESIGN_SPEC Section 5.7). +"""Position-papers meeting protocol (see Communication design page). Each participant writes an independent position paper in parallel, then a synthesizer combines all papers into decisions and action diff --git a/src/ai_company/communication/meeting/protocol.py b/src/ai_company/communication/meeting/protocol.py index 1c0ea23929..ed931c1f13 100644 --- a/src/ai_company/communication/meeting/protocol.py +++ b/src/ai_company/communication/meeting/protocol.py @@ -1,4 +1,4 @@ -"""Meeting protocol interface (DESIGN_SPEC Section 5.7). +"""Meeting protocol interface (see Communication design page). Defines the ``MeetingProtocol`` protocol, the ``ConflictDetector`` protocol, and the ``AgentCaller`` type alias used to invoke agents diff --git a/src/ai_company/communication/meeting/round_robin.py b/src/ai_company/communication/meeting/round_robin.py index 0cb8377aba..78172e27c0 100644 --- a/src/ai_company/communication/meeting/round_robin.py +++ b/src/ai_company/communication/meeting/round_robin.py @@ -1,4 +1,4 @@ -"""Round-robin meeting protocol (DESIGN_SPEC Section 5.7). +"""Round-robin meeting protocol (see Communication design page). Participants take sequential turns with full transcript context. Each agent sees the entire conversation history when contributing, diff --git a/src/ai_company/communication/meeting/structured_phases.py b/src/ai_company/communication/meeting/structured_phases.py index eb5f24f03f..f32ae89736 100644 --- a/src/ai_company/communication/meeting/structured_phases.py +++ b/src/ai_company/communication/meeting/structured_phases.py @@ -1,4 +1,4 @@ -"""Structured-phases meeting protocol (DESIGN_SPEC Section 5.7). +"""Structured-phases meeting protocol (see Communication design page). A phased approach: agenda broadcast, parallel input gathering, optional conflict-driven discussion, and leader synthesis. The most diff --git a/src/ai_company/communication/message.py b/src/ai_company/communication/message.py index 2a739dd043..2e399fed9e 100644 --- a/src/ai_company/communication/message.py +++ b/src/ai_company/communication/message.py @@ -1,4 +1,4 @@ -"""Message domain models (DESIGN_SPEC Section 5.3).""" +"""Message domain models (see Communication design page).""" from collections import Counter from typing import Self @@ -33,7 +33,7 @@ class Attachment(BaseModel): class MessageMetadata(BaseModel): """Optional metadata carried with a message. - Extends DESIGN_SPEC Section 5.3 metadata with an additional ``extra`` + Extends the Communication design page metadata with an additional ``extra`` field for arbitrary key-value pairs. Attributes: @@ -88,7 +88,7 @@ def _validate_extra(self) -> Self: class Message(BaseModel): """An inter-agent message. - Field schema is based on DESIGN_SPEC Section 5.3 with typed refinements. + Field schema is based on the Communication design page with typed refinements. The ``sender`` field is aliased to ``"from"`` for JSON compatibility with the spec format. diff --git a/src/ai_company/communication/messenger.py b/src/ai_company/communication/messenger.py index f90765e917..0dd735107a 100644 --- a/src/ai_company/communication/messenger.py +++ b/src/ai_company/communication/messenger.py @@ -1,4 +1,4 @@ -"""Per-agent messenger facade over the message bus (DESIGN_SPEC Section 5.4).""" +"""Per-agent messenger facade over the message bus (see Communication design page).""" from datetime import UTC, datetime diff --git a/src/ai_company/communication/subscription.py b/src/ai_company/communication/subscription.py index ac5bf4eee3..7cf09ffbcb 100644 --- a/src/ai_company/communication/subscription.py +++ b/src/ai_company/communication/subscription.py @@ -1,4 +1,4 @@ -"""Subscription and delivery envelope models (DESIGN_SPEC Section 5.4).""" +"""Subscription and delivery envelope models (see Communication design page).""" from pydantic import AwareDatetime, BaseModel, ConfigDict, Field diff --git a/src/ai_company/core/enums.py b/src/ai_company/core/enums.py index 8b9ba004e7..f454d8680d 100644 --- a/src/ai_company/core/enums.py +++ b/src/ai_company/core/enums.py @@ -10,7 +10,7 @@ class SeniorityLevel(StrEnum): cost tier defined in ``ai_company.core.role_catalog.SENIORITY_INFO``. """ - # DESIGN_SPEC §3.2 lists "Intern/Junior" — collapsed to JUNIOR (approved deviation). + # Agents page lists "Intern/Junior" — collapsed to JUNIOR. JUNIOR = "junior" MID = "mid" SENIOR = "senior" @@ -347,7 +347,7 @@ class TaskStructure(StrEnum): """Classification of how a task's subtasks relate to each other. Used by the decomposition engine to determine coordination topology - and execution ordering. See DESIGN_SPEC Section 6.9. + and execution ordering. See the Engine design page. """ SEQUENTIAL = "sequential" @@ -359,7 +359,7 @@ class CoordinationTopology(StrEnum): """Coordination topology for multi-agent task execution. Determines how agents coordinate when executing decomposed subtasks. - See DESIGN_SPEC Section 6.9. + See the Engine design page. """ SAS = "sas" @@ -372,9 +372,9 @@ class CoordinationTopology(StrEnum): class ActionType(StrEnum): """Two-level action type taxonomy for security classification. - Used by autonomy presets (DESIGN_SPEC §12.2), SecOps validation - (§12.3), tiered timeout policies (§12.4), and progressive trust - (§11.3). Values follow a ``category:action`` naming convention. + Used by autonomy presets (see Operations design page), SecOps + validation, tiered timeout policies, and progressive trust. + Values follow a ``category:action`` naming convention. Custom action type strings are also accepted by models that use ``str`` for ``action_type`` fields — these enum members are @@ -460,7 +460,7 @@ class AutonomyLevel(StrEnum): """Autonomy level controlling approval routing for agents. Determines which actions an agent can execute autonomously vs. - which require human or security-agent approval (DESIGN_SPEC §12.2). + which require human or security-agent approval (see Operations design page). """ FULL = "full" @@ -503,7 +503,7 @@ class DowngradeReason(StrEnum): class TimeoutActionType(StrEnum): - """Action to take when an approval item times out (DESIGN_SPEC §12.4).""" + """Action to take when an approval item times out (see Operations design page).""" WAIT = "wait" APPROVE = "approve" diff --git a/src/ai_company/core/project.py b/src/ai_company/core/project.py index 401c69bfe8..24273abcfa 100644 --- a/src/ai_company/core/project.py +++ b/src/ai_company/core/project.py @@ -14,8 +14,8 @@ class Project(BaseModel): """A collection of related tasks with a shared goal, team, and deadline. Projects organize tasks into a coherent unit of work with budget - tracking and team assignment. Per DESIGN_SPEC Section 2.1 glossary - and Section 2.2 entity relationship tree. + tracking and team assignment. Per the Design Overview glossary + and entity relationship tree. Attributes: id: Unique project identifier (e.g. ``"proj-456"``). diff --git a/src/ai_company/core/role_catalog.py b/src/ai_company/core/role_catalog.py index 04260145bf..dac1b63f9a 100644 --- a/src/ai_company/core/role_catalog.py +++ b/src/ai_company/core/role_catalog.py @@ -1,8 +1,7 @@ """Built-in role catalog and seniority information. -Provides the canonical set of built-in roles from DESIGN_SPEC.md -section 3.3 (Role Catalog) and the seniority mapping from -section 3.2 (Seniority & Authority Levels). +Provides the canonical set of built-in roles from the Agents design page +(Role Catalog) and the seniority mapping (Seniority & Authority Levels). """ from ai_company.core.enums import ( diff --git a/src/ai_company/core/task.py b/src/ai_company/core/task.py index 90f3d18761..ab73be8d82 100644 --- a/src/ai_company/core/task.py +++ b/src/ai_company/core/task.py @@ -47,7 +47,7 @@ class Task(BaseModel): Represents a task from creation through completion, with full lifecycle tracking, dependency modeling, and acceptance criteria. - Field schema matches DESIGN_SPEC Section 6.2. + Field schema matches the Engine design page. Attributes: id: Unique task identifier (e.g. ``"task-123"``). diff --git a/src/ai_company/core/task_transitions.py b/src/ai_company/core/task_transitions.py index a5cd66e062..52fd13b34d 100644 --- a/src/ai_company/core/task_transitions.py +++ b/src/ai_company/core/task_transitions.py @@ -1,7 +1,7 @@ """Task lifecycle state machine transitions. Defines the valid state transitions for the task lifecycle, based on -DESIGN_SPEC Sections 6.1 and 6.6, extended with BLOCKED, CANCELLED, +the Engine design page, extended with BLOCKED, CANCELLED, FAILED, and INTERRUPTED transitions for completeness:: CREATED -> ASSIGNED diff --git a/src/ai_company/engine/decomposition/classifier.py b/src/ai_company/engine/decomposition/classifier.py index 146258ce27..c9cffe4434 100644 --- a/src/ai_company/engine/decomposition/classifier.py +++ b/src/ai_company/engine/decomposition/classifier.py @@ -1,7 +1,7 @@ """Task structure classifier. Infers ``TaskStructure`` from task properties using heuristics -based on DESIGN_SPEC Section 6.9. +based on the Engine design page. """ import re diff --git a/src/ai_company/engine/metrics.py b/src/ai_company/engine/metrics.py index 568361205f..b7fd09b3bc 100644 --- a/src/ai_company/engine/metrics.py +++ b/src/ai_company/engine/metrics.py @@ -1,7 +1,7 @@ """Task completion metrics model. Proxy overhead metrics for an agent run, computed from -``AgentRunResult`` data per DESIGN_SPEC §10.5. +``AgentRunResult`` data per the Operations design page. """ from typing import TYPE_CHECKING @@ -15,7 +15,7 @@ class TaskCompletionMetrics(BaseModel): - """Proxy overhead metrics for an agent run (DESIGN_SPEC §10.5). + """Proxy overhead metrics for an agent run (see Operations design page). Computed from ``AgentRunResult`` after execution to surface orchestration overhead indicators (turns, tokens, cost, duration). diff --git a/src/ai_company/engine/recovery.py b/src/ai_company/engine/recovery.py index 1644f44f09..32db5b199c 100644 --- a/src/ai_company/engine/recovery.py +++ b/src/ai_company/engine/recovery.py @@ -6,7 +6,7 @@ status, captures a redacted context snapshot, and reports whether the task can be reassigned (based on retry count vs max retries). -See DESIGN_SPEC Section 6.6 for the full crash recovery design. +See the Crash Recovery section of the Engine design page. """ from typing import Final, Protocol, runtime_checkable diff --git a/src/ai_company/engine/routing/topology_selector.py b/src/ai_company/engine/routing/topology_selector.py index eeba37c3c8..05ed4f8fe1 100644 --- a/src/ai_company/engine/routing/topology_selector.py +++ b/src/ai_company/engine/routing/topology_selector.py @@ -1,6 +1,6 @@ """Topology selection for decomposed tasks. -Implements DESIGN_SPEC Section 6.9 auto-selection heuristics +Implements the Engine design page auto-selection heuristics for coordination topologies. """ @@ -26,8 +26,8 @@ class TopologySelector: Uses explicit overrides when set, otherwise applies heuristic rules based on task structure and artifact count. - Implements the auto-selection heuristics from DESIGN_SPEC - Section 6.9. + Implements the auto-selection heuristics from the Engine design + page. """ __slots__ = ("_config",) diff --git a/src/ai_company/engine/shutdown.py b/src/ai_company/engine/shutdown.py index 2b1168c355..5f102f1e78 100644 --- a/src/ai_company/engine/shutdown.py +++ b/src/ai_company/engine/shutdown.py @@ -1,7 +1,8 @@ """Graceful shutdown strategy and manager. -Implements DESIGN_SPEC §6.7 — cooperative timeout strategy for clean -process shutdown. When SIGINT/SIGTERM is received the framework signals +Implements the Graceful Shutdown section of the Engine design page — +cooperative timeout strategy for clean process shutdown. +When SIGINT/SIGTERM is received the framework signals agents to exit at turn boundaries, waits a grace period, force-cancels stragglers, and runs cleanup callbacks. The *engine* layer is responsible for transitioning tasks to INTERRUPTED (see ``AgentEngine``). diff --git a/src/ai_company/hr/performance/collaboration_protocol.py b/src/ai_company/hr/performance/collaboration_protocol.py index f528c04e5c..fadf098b58 100644 --- a/src/ai_company/hr/performance/collaboration_protocol.py +++ b/src/ai_company/hr/performance/collaboration_protocol.py @@ -1,7 +1,7 @@ """Collaboration scoring strategy protocol. Defines the interface for pluggable collaboration scoring strategies -that evaluate agent collaboration behavior (DESIGN_SPEC §8.3, D3). +that evaluate agent collaboration behavior (see Agents design page, D3). """ from typing import Protocol, runtime_checkable diff --git a/src/ai_company/hr/performance/quality_protocol.py b/src/ai_company/hr/performance/quality_protocol.py index dec904737a..3d2a3d6784 100644 --- a/src/ai_company/hr/performance/quality_protocol.py +++ b/src/ai_company/hr/performance/quality_protocol.py @@ -1,7 +1,7 @@ """Quality scoring strategy protocol. Defines the interface for pluggable quality scoring strategies -that evaluate task completion quality (DESIGN_SPEC §8.3, D2). +that evaluate task completion quality (see Agents design page, D2). """ from typing import Protocol, runtime_checkable diff --git a/src/ai_company/hr/performance/trend_protocol.py b/src/ai_company/hr/performance/trend_protocol.py index 9b0d039b0c..8e7990d12e 100644 --- a/src/ai_company/hr/performance/trend_protocol.py +++ b/src/ai_company/hr/performance/trend_protocol.py @@ -1,7 +1,7 @@ """Trend detection strategy protocol. Defines the interface for pluggable trend detection strategies -that analyze metric time series (DESIGN_SPEC §8.3, D12). +that analyze metric time series (see Agents design page, D12). """ from typing import TYPE_CHECKING, Protocol, runtime_checkable diff --git a/src/ai_company/hr/performance/window_protocol.py b/src/ai_company/hr/performance/window_protocol.py index b33fd2c7cb..2b5a13fc33 100644 --- a/src/ai_company/hr/performance/window_protocol.py +++ b/src/ai_company/hr/performance/window_protocol.py @@ -1,7 +1,7 @@ """Metrics window strategy protocol. Defines the interface for pluggable rolling-window aggregation -strategies (DESIGN_SPEC §8.3, D11). +strategies (see Agents design page, D11). """ from typing import TYPE_CHECKING, Protocol, runtime_checkable diff --git a/src/ai_company/observability/events/conflict.py b/src/ai_company/observability/events/conflict.py index 0882943bf7..6142864a16 100644 --- a/src/ai_company/observability/events/conflict.py +++ b/src/ai_company/observability/events/conflict.py @@ -1,4 +1,4 @@ -"""Conflict resolution event constants (DESIGN_SPEC §5.6).""" +"""Conflict resolution event constants (see Communication design page).""" from typing import Final diff --git a/src/ai_company/persistence/__init__.py b/src/ai_company/persistence/__init__.py index 5caf7142e2..aea0f7652f 100644 --- a/src/ai_company/persistence/__init__.py +++ b/src/ai_company/persistence/__init__.py @@ -1,4 +1,4 @@ -"""Pluggable persistence layer for operational data (DESIGN_SPEC §7.6). +"""Pluggable persistence layer for operational data (see Memory design page). Re-exports the protocol, repository protocols, config models, factory, and error hierarchy so consumers can import from ``ai_company.persistence`` diff --git a/src/ai_company/persistence/sqlite/__init__.py b/src/ai_company/persistence/sqlite/__init__.py index 45bc9979cb..3b5d1f5a9a 100644 --- a/src/ai_company/persistence/sqlite/__init__.py +++ b/src/ai_company/persistence/sqlite/__init__.py @@ -1,4 +1,4 @@ -"""SQLite persistence backend (DESIGN_SPEC §7.6 — initial backend).""" +"""SQLite persistence backend (see Memory design page — initial backend).""" from ai_company.persistence.sqlite.audit_repository import ( SQLiteAuditRepository, diff --git a/src/ai_company/providers/models.py b/src/ai_company/providers/models.py index 49b162d7ea..49a7dff49e 100644 --- a/src/ai_company/providers/models.py +++ b/src/ai_company/providers/models.py @@ -75,7 +75,7 @@ class ToolDefinition(BaseModel): arguments at the execution boundary, so no additional caller-side copying is needed for standard tool/provider workflows. Direct consumers outside these paths should deep-copy if they intend to - modify the schema. See DESIGN_SPEC.md section 15.5. + modify the schema. See the tech stack page (docs/architecture/tech-stack.md). Attributes: name: Tool name. @@ -101,7 +101,7 @@ class ToolCall(BaseModel): ``frozen=True`` — field reassignment is prevented but nested contents can still be mutated in place. The ``ToolInvoker`` deep-copies arguments before passing them to tool - implementations. See DESIGN_SPEC.md section 15.5. + implementations. See the tech stack page (docs/architecture/tech-stack.md). Attributes: id: Provider-assigned tool call identifier. diff --git a/src/ai_company/security/audit.py b/src/ai_company/security/audit.py index 6f1eb671b0..1c5ca02896 100644 --- a/src/ai_company/security/audit.py +++ b/src/ai_company/security/audit.py @@ -23,7 +23,7 @@ class AuditLog: single event loop. When ``max_entries`` is exceeded, the oldest entries are evicted with a warning. - Future: backed by ``PersistenceBackend`` (see DESIGN_SPEC §7.6). + Future: backed by ``PersistenceBackend`` (see Memory design page). """ def __init__(self, *, max_entries: int = 100_000) -> None: diff --git a/src/ai_company/security/autonomy/models.py b/src/ai_company/security/autonomy/models.py index e74b6ab45b..bf832030ce 100644 --- a/src/ai_company/security/autonomy/models.py +++ b/src/ai_company/security/autonomy/models.py @@ -69,7 +69,7 @@ def _validate_disjoint(self) -> Self: human_approval=(), security_agent=False, ), - # SEMI extends DESIGN_SPEC §12.2 with vcs and db:query auto-approve + # SEMI extends the Operations design page with vcs and db:query auto-approve # (safe read/commit operations) and broader human_approval categories. AutonomyLevel.SEMI: AutonomyPreset( level=AutonomyLevel.SEMI, diff --git a/src/ai_company/security/autonomy/protocol.py b/src/ai_company/security/autonomy/protocol.py index 57356ebf82..a68b11091e 100644 --- a/src/ai_company/security/autonomy/protocol.py +++ b/src/ai_company/security/autonomy/protocol.py @@ -1,4 +1,4 @@ -"""Autonomy change strategy protocol (DESIGN_SPEC §12.2 D7).""" +"""Autonomy change strategy protocol (see Operations design page, D7).""" from typing import Protocol, runtime_checkable diff --git a/src/ai_company/security/timeout/protocol.py b/src/ai_company/security/timeout/protocol.py index bc3568613d..6f9f5691e3 100644 --- a/src/ai_company/security/timeout/protocol.py +++ b/src/ai_company/security/timeout/protocol.py @@ -9,7 +9,7 @@ @runtime_checkable class TimeoutPolicy(Protocol): - """Protocol for approval timeout policies (DESIGN_SPEC §12.4). + """Protocol for approval timeout policies (see Operations design page). Implementations determine what happens when a human does not respond to an approval request within a configured timeframe.