Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ src/ai_company/
communication/ # Inter-agent message bus and channels
config/ # YAML company config loading and validation
core/ # Shared domain models and base classes
engine/ # Agent orchestration, execution loops, and task lifecycle
engine/ # Agent orchestration, execution loops, task lifecycle, recovery, and shutdown
memory/ # Persistent agent memory (memory layer TBD)
observability/ # Structured logging, correlation tracking, log sinks
providers/ # LLM provider abstraction (LiteLLM adapter)
Expand Down
20 changes: 14 additions & 6 deletions DESIGN_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -710,17 +710,24 @@ structured_phases:
│ │ COMPLETED │
│ └────────────┘
│ blocked cancelled
│ blocked cancelled (from ASSIGNED or IN_PROGRESS)
┌─────▼─────┐ ┌────────────┐
│ BLOCKED │ │ CANCELLED │
│ BLOCKED │ │ CANCELLED │ ◀── ASSIGNED / IN_PROGRESS
└─────┬─────┘ └────────────┘
│ unblocked (terminal)
└──▶ ASSIGNED

shutdown signal:
┌─────────────┐
│ INTERRUPTED │──── reassign on restart ──▶ ASSIGNED
└─────────────┘
```

> **Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and CANCELLED are terminal states with no outgoing transitions.
> **Non-terminal states:** BLOCKED, FAILED, and INTERRUPTED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6), INTERRUPTED returns to ASSIGNED on restart (see §6.7). COMPLETED and CANCELLED are terminal states with no outgoing transitions.
>
> **Transitions into FAILED:** Both `ASSIGNED → FAILED` (early setup failures) and `IN_PROGRESS → FAILED` (runtime crashes) are valid. `FAILED → ASSIGNED` enables reassignment when `retry_count < max_retries`.
>
> **Transitions into INTERRUPTED:** Both `ASSIGNED → INTERRUPTED` and `IN_PROGRESS → INTERRUPTED` are valid (graceful shutdown can occur at any active phase). `INTERRUPTED → ASSIGNED` enables reassignment on restart.

> **Runtime wrapper (M3):** During execution, `Task` is wrapped by `TaskExecution` (in `engine/task_execution.py`). `TaskExecution` is a frozen Pydantic model that tracks status transitions via `model_copy(update=...)`, accumulates `TokenUsage` cost, and records a `StatusTransition` audit trail. The original `Task` is preserved unchanged; `to_task_snapshot()` produces a `Task` copy with the current execution status for persistence.

Expand Down Expand Up @@ -1026,7 +1033,7 @@ When the process receives SIGTERM/SIGINT (user Ctrl+C, Docker stop, systemd shut

#### Strategy 1: Cooperative with Timeout (Default / MVP)

The engine sets a shutdown event, stops accepting new tasks, and gives in-flight agents a grace period to finish their current turn. Agents check the shutdown event at turn boundaries (between LLM calls, before tool invocations) and exit cooperatively. After the grace period, remaining agents are force-cancelled and their tasks marked `INTERRUPTED`.
The engine sets a shutdown event, stops accepting new tasks, and gives in-flight agents a grace period to finish their current turn. Agents check the shutdown event at turn boundaries (between LLM calls, before tool invocations) and exit cooperatively. After the grace period, remaining agents are force-cancelled. **All tasks terminated by shutdown — whether they exited cooperatively or were force-cancelled — are marked `INTERRUPTED`** by the engine layer.

```yaml
graceful_shutdown:
Expand All @@ -1043,7 +1050,7 @@ On shutdown signal:
4. Force-cancel remaining agents (`task.cancel()`) — tasks transition to `INTERRUPTED`
5. Cleanup phase (`cleanup_seconds`): persist cost records, close provider connections, flush logs

> **Planned non-terminal status:** `INTERRUPTED` will be introduced as a new `TaskStatus` variant (and the task status transition map updated) when graceful shutdown is implemented. Unlike `FAILED` (eligible for automatic reassignment) or `CANCELLED` (terminal), `INTERRUPTED` indicates the task was stopped due to process shutdown and is eligible for manual or automatic reassignment on restart.
> **Non-terminal status (implemented in M3):** `INTERRUPTED` is a `TaskStatus` variant. Unlike `FAILED` (eligible for automatic reassignment) or `CANCELLED` (terminal), `INTERRUPTED` indicates the task was stopped due to process shutdown — regardless of whether the agent exited cooperatively or was force-cancelled — and is eligible for manual or automatic reassignment on restart. Valid transitions: `ASSIGNED → INTERRUPTED`, `IN_PROGRESS → INTERRUPTED`, `INTERRUPTED → ASSIGNED` (reassignment on restart). See the updated §6.1 lifecycle diagram.
>
> **Windows compatibility:** `loop.add_signal_handler()` is not supported on Windows. The implementation uses `signal.signal()` as a fallback. SIGINT (Ctrl+C) works cross-platform; SIGTERM on Windows requires `os.kill()`.
>
Expand Down Expand Up @@ -2304,6 +2311,7 @@ ai-company/
│ │ ├── cost_recording.py # Per-turn cost recording helpers
│ │ ├── run_result.py # AgentRunResult outcome model
│ │ ├── agent_engine.py # Agent execution engine
│ │ ├── shutdown.py # Graceful shutdown strategy & manager
│ │ ├── task_engine.py # Task routing & scheduling (M3-M4)
│ │ ├── workflow_engine.py # Workflow orchestration (M4)
│ │ ├── meeting_engine.py # Meeting coordination (M4)
Expand Down Expand Up @@ -2474,7 +2482,7 @@ These conventions were established during the M0–M2+ review cycle. **Adopted**
| **LLM call analytics** | Planned (incremental) | M3: proxy metrics (`turns_per_task`, `tokens_per_task`). M4: call categorization (`productive`, `coordination`, `system`) + orchestration ratio. M5+: full analytics (retry tracking, latency, cache hits, per-provider comparison). | Append-only, never blocks execution. Builds on existing `CostRecord` infrastructure. Detects orchestration overhead early. See §10.5. |
| **State coordination** | Planned (M4) | Centralized single-writer: `TaskEngine` owns all task/project mutations via `asyncio.Queue`. Agents submit requests, engine applies `model_copy(update=...)` sequentially and publishes snapshots. `version: int` field on state models for future optimistic concurrency if multi-process scaling is needed. | Prevents lost updates by design. Trivial in single-threaded asyncio (no locks). Perfect audit trail. Industry consensus: MetaGPT, CrewAI, AutoGen all use prevention-by-design, not conflict resolution. See §6.8 State Coordination table. |
| **Workspace isolation** | Planned (M4) | Pluggable `WorkspaceIsolationStrategy` protocol. Default: planner + git worktrees. Each agent works in an isolated worktree; sequential merge on completion. Textual conflicts detected by git; semantic conflicts reviewed by agent or human. | Industry standard (Codex, Cursor, Claude Code, VS Code). Maximum parallelism. Leverages mature git infrastructure. See §6.8. |
| **Graceful shutdown** | Planned (M3) | Pluggable `ShutdownStrategy` protocol. Default: cooperative with 30s timeout. Agents check shutdown event at turn boundaries. Force-cancel after timeout. `INTERRUPTED` status for force-cancelled tasks. M4/M5: upgrade to checkpoint-and-stop. | Cross-platform (Windows `signal.signal()` fallback). Bounded shutdown time. Mirrors cooperative shutdown in §6.7. |
| **Graceful shutdown** | Adopted (M3) | Pluggable `ShutdownStrategy` protocol. Default: cooperative with 30s timeout. Agents check shutdown event at turn boundaries. Force-cancel after timeout. `INTERRUPTED` status for force-cancelled tasks. M4/M5: upgrade to checkpoint-and-stop. | Cross-platform (Windows `signal.signal()` fallback). Bounded shutdown time. Mirrors cooperative shutdown in §6.7. |

---

Expand Down
3 changes: 3 additions & 0 deletions src/ai_company/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
default_config_dict
RootConfig
AgentConfig
GracefulShutdownConfig
ProviderConfig
ProviderModelConfig
RoutingConfig
Expand Down Expand Up @@ -37,6 +38,7 @@
)
from ai_company.config.schema import (
AgentConfig,
GracefulShutdownConfig,
ProviderConfig,
ProviderModelConfig,
RootConfig,
Expand All @@ -51,6 +53,7 @@
"ConfigLocation",
"ConfigParseError",
"ConfigValidationError",
"GracefulShutdownConfig",
"ProviderConfig",
"ProviderModelConfig",
"RootConfig",
Expand Down
1 change: 1 addition & 0 deletions src/ai_company/config/defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,5 @@ def default_config_dict() -> dict[str, Any]:
"providers": {},
"routing": {},
"logging": None,
"graceful_shutdown": {},
}
36 changes: 36 additions & 0 deletions src/ai_company/config/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,37 @@ class AgentConfig(BaseModel):
)


class GracefulShutdownConfig(BaseModel):
"""Configuration for graceful shutdown behaviour.

Attributes:
strategy: Shutdown strategy name (e.g. ``"cooperative_timeout"``).
grace_seconds: Seconds to wait for cooperative agent exit
before force-cancelling.
cleanup_seconds: Seconds allowed for cleanup callbacks
(persist costs, close connections, flush logs).
"""

model_config = ConfigDict(frozen=True, allow_inf_nan=False)

strategy: NotBlankStr = Field(
default="cooperative_timeout",
description="Shutdown strategy name",
)
grace_seconds: float = Field(
default=30.0,
gt=0,
le=300,
description="Seconds to wait for cooperative agent exit",
)
cleanup_seconds: float = Field(
default=5.0,
gt=0,
le=60,
description="Seconds allowed for cleanup callbacks",
)


class RootConfig(BaseModel):
"""Root company configuration — the top-level validation target.

Expand All @@ -339,6 +370,7 @@ class RootConfig(BaseModel):
providers: LLM provider configurations keyed by provider name.
routing: Model routing configuration.
logging: Logging configuration (``None`` to use platform defaults).
graceful_shutdown: Graceful shutdown configuration.
"""

model_config = ConfigDict(frozen=True)
Expand Down Expand Up @@ -386,6 +418,10 @@ class RootConfig(BaseModel):
default=None,
description="Logging configuration",
)
graceful_shutdown: GracefulShutdownConfig = Field(
default_factory=GracefulShutdownConfig,
description="Graceful shutdown configuration",
)

@model_validator(mode="after")
def _validate_unique_agent_names(self) -> Self:
Expand Down
8 changes: 5 additions & 3 deletions src/ai_company/core/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,13 +127,14 @@ class TaskStatus(StrEnum):
Summary for quick reference:

CREATED -> ASSIGNED
ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED | FAILED
IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED | FAILED
ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED | FAILED | INTERRUPTED
IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED | FAILED | INTERRUPTED
IN_REVIEW -> COMPLETED | IN_PROGRESS (rework) | BLOCKED | CANCELLED
BLOCKED -> ASSIGNED (unblocked)
FAILED -> ASSIGNED (reassignment for retry)
INTERRUPTED -> ASSIGNED (reassignment on restart)
COMPLETED and CANCELLED are terminal states.
FAILED is non-terminal (can be reassigned).
FAILED and INTERRUPTED are non-terminal (can be reassigned).
"""

CREATED = "created"
Expand All @@ -143,6 +144,7 @@ class TaskStatus(StrEnum):
COMPLETED = "completed"
BLOCKED = "blocked"
FAILED = "failed"
INTERRUPTED = "interrupted"
CANCELLED = "cancelled"


Expand Down
14 changes: 9 additions & 5 deletions src/ai_company/core/task_transitions.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
"""Task lifecycle state machine transitions.

Defines the valid state transitions for the task lifecycle, based on
DESIGN_SPEC Sections 6.1 and 6.6, extended with BLOCKED, CANCELLED, and
FAILED transitions for completeness::
DESIGN_SPEC Sections 6.1 and 6.6, extended with BLOCKED, CANCELLED,
FAILED, and INTERRUPTED transitions for completeness::

CREATED -> ASSIGNED
ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED | FAILED
IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED | FAILED
ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED | FAILED | INTERRUPTED
IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED | FAILED | INTERRUPTED
IN_REVIEW -> COMPLETED | IN_PROGRESS (rework) | BLOCKED | CANCELLED
BLOCKED -> ASSIGNED (unblocked)
FAILED -> ASSIGNED (reassignment for retry)
INTERRUPTED -> ASSIGNED (reassignment on restart)

COMPLETED and CANCELLED are terminal states with no outgoing
transitions. FAILED is non-terminal (can be reassigned).
transitions. FAILED and INTERRUPTED are non-terminal (can be reassigned).
"""

from ai_company.core.enums import TaskStatus
Expand All @@ -32,6 +33,7 @@
TaskStatus.BLOCKED,
TaskStatus.CANCELLED,
TaskStatus.FAILED,
TaskStatus.INTERRUPTED,
}
),
TaskStatus.IN_PROGRESS: frozenset(
Expand All @@ -40,6 +42,7 @@
TaskStatus.BLOCKED,
TaskStatus.CANCELLED,
TaskStatus.FAILED,
TaskStatus.INTERRUPTED,
}
),
TaskStatus.IN_REVIEW: frozenset(
Expand All @@ -52,6 +55,7 @@
),
TaskStatus.BLOCKED: frozenset({TaskStatus.ASSIGNED}),
TaskStatus.FAILED: frozenset({TaskStatus.ASSIGNED}), # reassignment
TaskStatus.INTERRUPTED: frozenset({TaskStatus.ASSIGNED}), # reassignment on restart
TaskStatus.COMPLETED: frozenset(), # terminal
TaskStatus.CANCELLED: frozenset(), # terminal
}
Expand Down
14 changes: 14 additions & 0 deletions src/ai_company/engine/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
BudgetChecker,
ExecutionLoop,
ExecutionResult,
ShutdownChecker,
TerminationReason,
TurnRecord,
)
Expand All @@ -40,6 +41,13 @@
RecoveryStrategy,
)
from ai_company.engine.run_result import AgentRunResult
from ai_company.engine.shutdown import (
CleanupCallback,
CooperativeTimeoutStrategy,
ShutdownManager,
ShutdownResult,
ShutdownStrategy,
)
from ai_company.engine.task_execution import StatusTransition, TaskExecution
from ai_company.providers.models import ZERO_TOKEN_USAGE, add_token_usage

Expand All @@ -52,6 +60,8 @@
"AgentRunResult",
"BudgetChecker",
"BudgetExhaustedError",
"CleanupCallback",
"CooperativeTimeoutStrategy",
"DefaultTokenEstimator",
"EngineError",
"ExecutionLoop",
Expand All @@ -65,6 +75,10 @@
"ReactLoop",
"RecoveryResult",
"RecoveryStrategy",
"ShutdownChecker",
"ShutdownManager",
"ShutdownResult",
"ShutdownStrategy",
"StatusTransition",
"SystemPrompt",
"TaskCompletionMetrics",
Expand Down
Loading
Loading