Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions DESIGN_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -813,7 +813,7 @@ Tasks can be assigned through multiple strategies:

The agent execution loop defines how an agent processes a task from start to finish. The framework provides multiple configurable loop architectures behind an `ExecutionLoop` protocol, making the system extensible. The default can vary by task complexity, and is configurable per agent or role.

> **MVP: ReAct only (Loop 1).** Plan-and-Execute and Hybrid are M4+. Auto-selection is M4+.
> **Current state (M3):** ReAct (Loop 1) and Plan-and-Execute (Loop 2) are implemented. Hybrid loop and auto-selection are M4+.

#### ExecutionLoop Protocol

Expand Down Expand Up @@ -1543,7 +1543,7 @@ budget:

### 10.5 LLM Call Analytics

> **MVP: Proxy metrics only (M3).** Call categorization is M4. Full analytics layer is M5+.
> **Current state:** Proxy metrics (M3) and call categorization + coordination metric data models (M4 models, brought forward) are implemented. Runtime collection pipeline and full analytics layer are M5+.

Every LLM provider call is tracked with comprehensive metadata for financial reporting, debugging, and orchestration overhead analysis. The analytics system builds incrementally across milestones.

Expand All @@ -1562,6 +1562,8 @@ These metrics are captured in `TaskCompletionMetrics` (in `engine/metrics.py`),

#### M4: Call Categorization + Orchestration Ratio

> **Current state:** Data models (`LLMCallCategory`, `CategoryBreakdown`, `OrchestrationRatio`, `CostRecord.call_category`) and query methods (`CostTracker.get_category_breakdown`, `get_orchestration_ratio`) are implemented. Runtime categorization logic (automatic tagging of calls during multi-agent execution) is deferred to M4 runtime integration.

When multi-agent coordination exists, each `CostRecord` is tagged with a **call category**:

| Category | Description | Examples |
Expand Down Expand Up @@ -2308,6 +2310,9 @@ ai-company/
│ │ ├── loop_protocol.py # ExecutionLoop protocol + result models
│ │ ├── metrics.py # TaskCompletionMetrics proxy overhead model
│ │ ├── react_loop.py # ReAct loop implementation
│ │ ├── plan_models.py # Plan step, plan, and plan-execute config models
│ │ ├── plan_execute_loop.py # Plan-and-Execute loop implementation
│ │ ├── loop_helpers.py # Shared stateless helpers for all loop implementations
│ │ ├── recovery.py # Crash recovery strategies (RecoveryStrategy protocol)
│ │ ├── cost_recording.py # Per-turn cost recording helpers
│ │ ├── run_result.py # AgentRunResult outcome model
Expand Down Expand Up @@ -2412,6 +2417,10 @@ ai-company/
│ ├── budget/ # Cost management
│ │ ├── config.py # Budget configuration models
│ │ ├── cost_record.py # CostRecord model (frozen)
│ │ ├── call_category.py # LLM call category enums (productive, coordination, system)
│ │ ├── category_analytics.py # Per-category cost breakdown + orchestration ratio
│ │ ├── coordination_config.py # Coordination metrics config models
│ │ ├── coordination_metrics.py # Five coordination metric models + computation
│ │ ├── tracker.py # CostTracker service (records + queries)
│ │ ├── spending_summary.py # _SpendingTotals base + spending summary models
│ │ ├── hierarchy.py # BudgetHierarchy, BudgetConfig
Expand Down Expand Up @@ -2485,7 +2494,7 @@ These conventions were established during the M0–M2+ review cycle. **Adopted**
| **Tool sandboxing** | Adopted (M3, incremental) | File system tools use in-process `PathValidator` for workspace-scoped path validation (symlink resolution + containment check). `BaseFileSystemTool` ABC provides shared `ToolCategory.FILE_SYSTEM` and `PathValidator` integration — all file system tools extend this base. `SandboxBackend` protocol with `SubprocessSandbox` implemented — git tools accept optional `SandboxBackend` injection and delegate subprocess management to it (env filtering, workspace enforcement, timeout + process-group kill). `DockerSandbox` planned for code_runner, terminal, web, and database tools. `K8sSandbox` planned for future container deployments. Config-driven per-category backend selection planned for engine wiring. | File system tools use defence-in-depth path validation; subprocess sandbox provides lightweight isolation for git tools; heavier Docker/K8s isolation reserved for higher-risk tool categories (code execution, network). See §11.1.2. |
| **Crash recovery** | Adopted (M3) | Pluggable `RecoveryStrategy` protocol. M3: `FailAndReassignStrategy` (catch at engine boundary, log snapshot, mark FAILED / eligible for reassignment). M4/M5: `CheckpointStrategy` (persist `AgentContext` per turn, resume from last checkpoint). | Immutable `model_copy` pattern makes checkpoint serialization trivial to add later. Fail-and-reassign is sufficient for short MVP tasks. See §6.6. |
| **Agent behavior testing** | Planned (M3) | Scripted `FakeProvider` for unit tests (deterministic turn sequences); behavioral outcome assertions for integration tests (task completed, tools called, cost within budget). | Leverages existing `FakeProvider` and `CompletionResponseFactory` fixtures. Precise engine testing without brittle response-matching at integration level. |
| **LLM call analytics** | Planned (incremental) | M3: proxy metrics (`turns_per_task`, `tokens_per_task`). M4: call categorization (`productive`, `coordination`, `system`) + orchestration ratio. M5+: full analytics (retry tracking, latency, cache hits, per-provider comparison). | Append-only, never blocks execution. Builds on existing `CostRecord` infrastructure. Detects orchestration overhead early. See §10.5. |
| **LLM call analytics** | Adopted (incremental) | M3: proxy metrics (`turns_per_task`, `tokens_per_task`) — adopted. M4 data models: call categorization (`productive`, `coordination`, `system`), category analytics, coordination metrics, orchestration ratio — adopted. M4 runtime collection pipeline and M5+ full analytics: planned. | Append-only, never blocks execution. Builds on existing `CostRecord` infrastructure. Detects orchestration overhead early. See §10.5. |
| **State coordination** | Planned (M4) | Centralized single-writer: `TaskEngine` owns all task/project mutations via `asyncio.Queue`. Agents submit requests, engine applies `model_copy(update=...)` sequentially and publishes snapshots. `version: int` field on state models for future optimistic concurrency if multi-process scaling is needed. | Prevents lost updates by design. Trivial in single-threaded asyncio (no locks). Perfect audit trail. Industry consensus: MetaGPT, CrewAI, AutoGen all use prevention-by-design, not conflict resolution. See §6.8 State Coordination table. |
| **Workspace isolation** | Planned (M4) | Pluggable `WorkspaceIsolationStrategy` protocol. Default: planner + git worktrees. Each agent works in an isolated worktree; sequential merge on completion. Textual conflicts detected by git; semantic conflicts reviewed by agent or human. | Industry standard (Codex, Cursor, Claude Code, VS Code). Maximum parallelism. Leverages mature git infrastructure. See §6.8. |
| **Graceful shutdown** | Adopted (M3) | Pluggable `ShutdownStrategy` protocol. Default: cooperative with 30s timeout. Agents check shutdown event at turn boundaries. Force-cancel after timeout. `INTERRUPTED` status for force-cancelled tasks. M4/M5: upgrade to checkpoint-and-stop. | Cross-platform (Windows `signal.signal()` fallback). Bounded shutdown time. Mirrors cooperative shutdown in §6.7. |
Expand Down
32 changes: 32 additions & 0 deletions src/ai_company/budget/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,28 @@
DESIGN_SPEC Section 10.
"""

from ai_company.budget.call_category import LLMCallCategory, OrchestrationAlertLevel
from ai_company.budget.category_analytics import CategoryBreakdown, OrchestrationRatio
from ai_company.budget.config import (
AutoDowngradeConfig,
BudgetAlertConfig,
BudgetConfig,
)
from ai_company.budget.coordination_config import (
CoordinationMetricName,
CoordinationMetricsConfig,
ErrorCategory,
ErrorTaxonomyConfig,
OrchestrationAlertThresholds,
)
from ai_company.budget.coordination_metrics import (
CoordinationEfficiency,
CoordinationMetrics,
CoordinationOverhead,
ErrorAmplification,
MessageDensity,
RedundancyRate,
)
from ai_company.budget.cost_record import CostRecord
from ai_company.budget.enums import BudgetAlertLevel
from ai_company.budget.hierarchy import (
Expand All @@ -32,11 +49,26 @@
"BudgetAlertLevel",
"BudgetConfig",
"BudgetHierarchy",
"CategoryBreakdown",
"CoordinationEfficiency",
"CoordinationMetricName",
"CoordinationMetrics",
"CoordinationMetricsConfig",
"CoordinationOverhead",
"CostRecord",
"CostTracker",
"DepartmentBudget",
"DepartmentSpending",
"ErrorAmplification",
"ErrorCategory",
"ErrorTaxonomyConfig",
"LLMCallCategory",
"MessageDensity",
"OrchestrationAlertLevel",
"OrchestrationAlertThresholds",
"OrchestrationRatio",
"PeriodSpending",
"RedundancyRate",
"SpendingSummary",
"TeamBudget",
]
45 changes: 45 additions & 0 deletions src/ai_company/budget/call_category.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
"""LLM call categorization enums.

Categorizes each LLM API call by its purpose (productive task work,
inter-agent coordination, or framework overhead) and defines alert
levels for orchestration overhead ratio monitoring.
"""

from enum import StrEnum


class LLMCallCategory(StrEnum):
"""Purpose category for an LLM API call.

Used to distinguish direct task work from coordination overhead,
enabling data-driven tuning of multi-agent orchestration.
"""

PRODUCTIVE = "productive"
"""Direct task work — reasoning, code generation, analysis."""

COORDINATION = "coordination"
"""Inter-agent communication — delegation, status updates, handoffs."""

SYSTEM = "system"
"""Framework overhead — planning, re-planning, self-evaluation."""


class OrchestrationAlertLevel(StrEnum):
"""Alert levels for orchestration overhead ratio.

Separate from :class:`~ai_company.budget.enums.BudgetAlertLevel`
because the metric and thresholds are fundamentally different.
"""

NORMAL = "normal"
"""Below the info threshold."""

INFO = "info"
"""At or above 30% orchestration ratio (default threshold)."""

WARNING = "warning"
"""At or above 50% orchestration ratio (default threshold)."""

CRITICAL = "critical"
"""At or above 70% orchestration ratio (default threshold)."""
Loading