Aureliolo · Aureliolo · Mar 7, 2026 · Mar 7, 2026 · Mar 7, 2026 · Mar 7, 2026
@@ -693,29 +693,35 @@ structured_phases:
                  │ CREATED   │
                  └─────┬─────┘
                        │ assignment
-                 ┌─────▼─────┐
-          ┌──────│ ASSIGNED   │
-          │      └─────┬─────┘
-          │            │ agent starts
+                 ┌─────▼─────┐           ┌──────────┐
+          ┌──────│ ASSIGNED   │──────────▶│  FAILED   │
+          │      └─────┬─────┘◀───┐      └────┬─────┘
+          │            │ starts    │ reassign  │
+          │      ┌─────▼─────┐    │      ┌────▼─────┐
+          │      │IN_PROGRESS │───┼─────▶│  (retry)  │
+          │      └─────┬─────┘   │       └──────────┘
+          │            │  ◀── (rework)
+          │            │ agent done
           │      ┌─────▼─────┐
-          │      │IN_PROGRESS │◀──── (rework)
-          │      └─────┬─────┘        │
-          │            │ agent done    │
-          │      ┌─────▼─────┐        │
-          │      │ IN_REVIEW  │───────┘
+          │      │ IN_REVIEW  │
           │      └─────┬─────┘
           │            │ approved
           │      ┌─────▼─────┐
           │      │ COMPLETED  │
           │      └────────────┘
           │
-          │ blocked / cancelled
-    ┌─────▼─────┐
-    │ BLOCKED /  │
-    │ CANCELLED  │
-    └────────────┘
+          │ blocked          cancelled
+    ┌─────▼─────┐      ┌────────────┐
+    │  BLOCKED   │      │ CANCELLED   │
+    └─────┬─────┘      └────────────┘
+          │ unblocked        (terminal)
+          └──▶ ASSIGNED
 ```
 
+> **Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and CANCELLED are terminal states with no outgoing transitions.
+>
+> **Transitions into FAILED:** Both `ASSIGNED → FAILED` (early setup failures) and `IN_PROGRESS → FAILED` (runtime crashes) are valid. `FAILED → ASSIGNED` enables reassignment when `retry_count < max_retries`.
+
 > **Runtime wrapper (M3):** During execution, `Task` is wrapped by `TaskExecution` (in `engine/task_execution.py`). `TaskExecution` is a frozen Pydantic model that tracks status transitions via `model_copy(update=...)`, accumulates `TokenUsage` cost, and records a `StatusTransition` audit trail. The original `Task` is preserved unchanged; `to_task_snapshot()` produces a `Task` copy with the current execution status for persistence.
 
 ### 6.2 Task Definition
@@ -748,6 +754,7 @@ task:
   task_structure: "parallel"      # sequential, parallel, mixed (M4 — see §6.9)
   budget_limit: 2.00             # max USD for this task
   deadline: null
+  max_retries: 1                 # max reassignment attempts after failure (0 = no retry)
   status: "assigned"
 ```
 
@@ -952,11 +959,28 @@ When an agent execution fails unexpectedly (unhandled exception, OOM, process ki
 
 > **MVP: Fail-and-Reassign only (Strategy 1).** Checkpoint Recovery is M4/M5.
 
+**`RecoveryStrategy` protocol:**
+
+| Method | Signature | Description |
+|--------|-----------|-------------|
+| `recover` | `async def recover(*, task_execution: TaskExecution, error_message: str, context: AgentContext) -> RecoveryResult` | Apply recovery to a failed task execution |
+| `get_strategy_type` | `def get_strategy_type() -> str` | Return strategy type identifier (must not be empty) |
+
+**`RecoveryResult` model (frozen):**
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `task_execution` | `TaskExecution` | Updated execution after recovery (typically `FAILED`) |
+| `strategy_type` | `NotBlankStr` | Strategy identifier |
+| `context_snapshot` | `AgentContextSnapshot` | Redacted snapshot (turn count, accumulated cost, message count, max turns — no message contents) |
+| `error_message` | `NotBlankStr` | Error that triggered recovery |
+| `can_reassign` | `bool` (computed) | `retry_count < task.max_retries` |
+
 #### Strategy 1: Fail-and-Reassign (Default / MVP)
 
 The engine catches the failure at its outermost boundary, logs a redacted `AgentContext` snapshot (turn count, accumulated cost — excluding message contents to avoid leaking sensitive prompts/tool outputs), transitions the task to `FAILED`, and makes it available for reassignment (manual or automatic via the task router).
 
-> **New non-terminal state:** `FAILED` is a new `TaskStatus` variant to be added alongside `CANCELLED`. The §6.1 lifecycle diagram and `TaskStatus` enum will be updated when crash recovery is implemented in M3. `FAILED` differs from `CANCELLED` (which is terminal) in that failed tasks are eligible for automatic reassignment.
+> **Non-terminal state (implemented in M3):** `FAILED` is a `TaskStatus` variant alongside `CANCELLED`. `FAILED` differs from `CANCELLED` (which is terminal) in that failed tasks are eligible for automatic reassignment. Valid transitions: `IN_PROGRESS → FAILED`, `ASSIGNED → FAILED` (early setup failures), `FAILED → ASSIGNED` (reassignment). See the updated §6.1 lifecycle diagram.
 
 ```yaml
 crash_recovery:
@@ -967,10 +991,12 @@ crash_recovery:
 - All progress is lost on crash — acceptable for short single-agent tasks in the MVP
 
 On crash:
-1. Catch exception at the engine boundary (outermost `try/except` in the execution loop)
-2. Log at ERROR with redacted `AgentContext` snapshot (turn count, accumulated cost, tool call history — message contents excluded)
+1. Catch exception at the `AgentEngine` boundary (outermost `try/except` in `AgentEngine.run()`)
+2. Log at ERROR with redacted `AgentContextSnapshot` (turn count, accumulated cost, message count, max turns — message contents excluded)
 3. Transition `TaskExecution` → `FAILED` with the exception as the failure reason
-4. Task becomes available for reassignment via the task router
+4. `RecoveryResult.can_reassign` reports whether `retry_count < max_retries`
+
+> **M3 limitation:** The `can_reassign` flag is computed and returned in `RecoveryResult`, but automated reassignment is not yet implemented — the task router (§6.4) will consume this in a later milestone. The caller (task router) is responsible for incrementing `retry_count` when creating the next `TaskExecution`.
 
 #### Strategy 2: Checkpoint Recovery (Planned — M4/M5)
 
@@ -2272,6 +2298,8 @@ ai-company/
 │       │   ├── loop_protocol.py    # ExecutionLoop protocol + result models
 │       │   ├── metrics.py          # TaskCompletionMetrics proxy overhead model
 │       │   ├── react_loop.py       # ReAct loop implementation
+│       │   ├── recovery.py         # Crash recovery strategies (RecoveryStrategy protocol)
+│       │   ├── cost_recording.py   # Per-turn cost recording helpers
 │       │   ├── run_result.py       # AgentRunResult outcome model
 │       │   ├── agent_engine.py     # Agent execution engine
 │       │   ├── task_engine.py      # Task routing & scheduling (M3-M4)

@@ -127,11 +127,13 @@ class TaskStatus(StrEnum):
     Summary for quick reference:
 
         CREATED -> ASSIGNED
-        ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED
-        IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED
+        ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED | FAILED
+        IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED | FAILED
         IN_REVIEW -> COMPLETED | IN_PROGRESS (rework) | BLOCKED | CANCELLED
         BLOCKED -> ASSIGNED (unblocked)
+        FAILED -> ASSIGNED (reassignment for retry)
         COMPLETED and CANCELLED are terminal states.
+        FAILED is non-terminal (can be reassigned).
     """
 
     CREATED = "created"
@@ -140,6 +142,7 @@ class TaskStatus(StrEnum):
     IN_REVIEW = "in_review"
     COMPLETED = "completed"
     BLOCKED = "blocked"
+    FAILED = "failed"
     CANCELLED = "cancelled"
 
 

@@ -58,6 +58,7 @@ class Task(BaseModel):
         estimated_complexity: Task complexity estimate.
         budget_limit: Maximum USD spend for this task.
         deadline: Optional deadline (ISO 8601 string or ``None``).
+        max_retries: Max reassignment attempts after failure (default 1).
         status: Current lifecycle status.
     """
 
@@ -112,6 +113,11 @@ class Task(BaseModel):
         default=None,
         description="Optional deadline (ISO 8601 string)",
     )
+    max_retries: int = Field(
+        default=1,
+        ge=0,
+        description="Max reassignment attempts after failure",
+    )
     status: TaskStatus = Field(
         default=TaskStatus.CREATED,
         description="Current lifecycle status",
@@ -153,8 +159,8 @@ def _validate_assignment_consistency(self) -> Self:
 
         ``CREATED`` status must have ``assigned_to=None``.  Statuses beyond
         ``CREATED`` (``ASSIGNED``, ``IN_PROGRESS``, ``IN_REVIEW``,
-        ``COMPLETED``) require ``assigned_to`` to be set.  ``BLOCKED``
-        and ``CANCELLED`` may or may not have an assignee.
+        ``COMPLETED``) require ``assigned_to`` to be set.  ``BLOCKED``,
+        ``FAILED``, and ``CANCELLED`` may or may not have an assignee.
         """
         requires_assignee = {
             TaskStatus.ASSIGNED,

@@ -1,17 +1,18 @@
 """Task lifecycle state machine transitions.
 
 Defines the valid state transitions for the task lifecycle, based on
-DESIGN_SPEC Section 6.1 and extended with BLOCKED and CANCELLED
-transitions from IN_PROGRESS and IN_REVIEW for completeness::
+DESIGN_SPEC Sections 6.1 and 6.6, extended with BLOCKED, CANCELLED, and
+FAILED transitions for completeness::
 
     CREATED -> ASSIGNED
-    ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED
-    IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED
+    ASSIGNED -> IN_PROGRESS | BLOCKED | CANCELLED | FAILED
+    IN_PROGRESS -> IN_REVIEW | BLOCKED | CANCELLED | FAILED
     IN_REVIEW -> COMPLETED | IN_PROGRESS (rework) | BLOCKED | CANCELLED
     BLOCKED -> ASSIGNED (unblocked)
+    FAILED -> ASSIGNED (reassignment for retry)
 
 COMPLETED and CANCELLED are terminal states with no outgoing
-transitions.
+transitions.  FAILED is non-terminal (can be reassigned).
 """
 
 from ai_company.core.enums import TaskStatus
@@ -30,13 +31,15 @@
             TaskStatus.IN_PROGRESS,
             TaskStatus.BLOCKED,
             TaskStatus.CANCELLED,
+            TaskStatus.FAILED,
         }
     ),
     TaskStatus.IN_PROGRESS: frozenset(
         {
             TaskStatus.IN_REVIEW,
             TaskStatus.BLOCKED,
             TaskStatus.CANCELLED,
+            TaskStatus.FAILED,
         }
     ),
     TaskStatus.IN_REVIEW: frozenset(
@@ -48,6 +51,7 @@
         }
     ),
     TaskStatus.BLOCKED: frozenset({TaskStatus.ASSIGNED}),
+    TaskStatus.FAILED: frozenset({TaskStatus.ASSIGNED}),  # reassignment
     TaskStatus.COMPLETED: frozenset(),  # terminal
     TaskStatus.CANCELLED: frozenset(),  # terminal
 }

@@ -34,6 +34,11 @@
     build_system_prompt,
 )
 from ai_company.engine.react_loop import ReactLoop
+from ai_company.engine.recovery import (
+    FailAndReassignStrategy,
+    RecoveryResult,
+    RecoveryStrategy,
+)
 from ai_company.engine.run_result import AgentRunResult
 from ai_company.engine.task_execution import StatusTransition, TaskExecution
 from ai_company.providers.models import ZERO_TOKEN_USAGE, add_token_usage
@@ -52,11 +57,14 @@
     "ExecutionLoop",
     "ExecutionResult",
     "ExecutionStateError",
+    "FailAndReassignStrategy",
     "LoopExecutionError",
     "MaxTurnsExceededError",
     "PromptBuildError",
     "PromptTokenEstimator",
     "ReactLoop",
+    "RecoveryResult",
+    "RecoveryStrategy",
     "StatusTransition",
     "SystemPrompt",
     "TaskCompletionMetrics",