-
Notifications
You must be signed in to change notification settings - Fork 0
feat: implement crash recovery with fail-and-reassign strategy #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -693,29 +693,35 @@ structured_phases: | |
| │ CREATED │ | ||
| └─────┬─────┘ | ||
| │ assignment | ||
| ┌─────▼─────┐ | ||
| ┌──────│ ASSIGNED │ | ||
| │ └─────┬─────┘ | ||
| │ │ agent starts | ||
| ┌─────▼─────┐ ┌──────────┐ | ||
| ┌──────│ ASSIGNED │──────────▶│ FAILED │ | ||
| │ └─────┬─────┘◀───┐ └────┬─────┘ | ||
| │ │ starts │ reassign │ | ||
| │ ┌─────▼─────┐ │ ┌────▼─────┐ | ||
| │ │IN_PROGRESS │───┼─────▶│ (retry) │ | ||
| │ └─────┬─────┘ │ └──────────┘ | ||
| │ │ ◀── (rework) | ||
| │ │ agent done | ||
| │ ┌─────▼─────┐ | ||
| │ │IN_PROGRESS │◀──── (rework) | ||
| │ └─────┬─────┘ │ | ||
| │ │ agent done │ | ||
| │ ┌─────▼─────┐ │ | ||
| │ │ IN_REVIEW │───────┘ | ||
| │ │ IN_REVIEW │ | ||
| │ └─────┬─────┘ | ||
| │ │ approved | ||
| │ ┌─────▼─────┐ | ||
| │ │ COMPLETED │ | ||
| │ └────────────┘ | ||
| │ | ||
| │ blocked / cancelled | ||
| ┌─────▼─────┐ | ||
| │ BLOCKED / │ | ||
| │ CANCELLED │ | ||
| └────────────┘ | ||
| │ blocked cancelled | ||
| ┌─────▼─────┐ ┌────────────┐ | ||
| │ BLOCKED │ │ CANCELLED │ | ||
| └─────┬─────┘ └────────────┘ | ||
| │ unblocked (terminal) | ||
| └──▶ ASSIGNED | ||
| ``` | ||
|
|
||
| > **Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and CANCELLED are terminal states with no outgoing transitions. | ||
| > | ||
| > **Transitions into FAILED:** Both `ASSIGNED → FAILED` (early setup failures) and `IN_PROGRESS → FAILED` (runtime crashes) are valid. `FAILED → ASSIGNED` enables reassignment when `retry_count < max_retries`. | ||
|
|
||
| > **Runtime wrapper (M3):** During execution, `Task` is wrapped by `TaskExecution` (in `engine/task_execution.py`). `TaskExecution` is a frozen Pydantic model that tracks status transitions via `model_copy(update=...)`, accumulates `TokenUsage` cost, and records a `StatusTransition` audit trail. The original `Task` is preserved unchanged; `to_task_snapshot()` produces a `Task` copy with the current execution status for persistence. | ||
|
|
||
| ### 6.2 Task Definition | ||
|
|
@@ -748,6 +754,7 @@ task: | |
| task_structure: "parallel" # sequential, parallel, mixed (M4 — see §6.9) | ||
| budget_limit: 2.00 # max USD for this task | ||
| deadline: null | ||
| max_retries: 1 # max reassignment attempts after failure (0 = no retry) | ||
| status: "assigned" | ||
| ``` | ||
|
|
||
|
|
@@ -952,11 +959,28 @@ When an agent execution fails unexpectedly (unhandled exception, OOM, process ki | |
|
|
||
| > **MVP: Fail-and-Reassign only (Strategy 1).** Checkpoint Recovery is M4/M5. | ||
|
|
||
| **`RecoveryStrategy` protocol:** | ||
|
|
||
| | Method | Signature | Description | | ||
| |--------|-----------|-------------| | ||
| | `recover` | `async def recover(*, task_execution: TaskExecution, error_message: str, context: AgentContext) -> RecoveryResult` | Apply recovery to a failed task execution | | ||
| | `get_strategy_type` | `def get_strategy_type() -> str` | Return strategy type identifier (must not be empty) | | ||
|
|
||
| **`RecoveryResult` model (frozen):** | ||
|
|
||
| | Field | Type | Description | | ||
| |-------|------|-------------| | ||
| | `task_execution` | `TaskExecution` | Updated execution after recovery (typically `FAILED`) | | ||
| | `strategy_type` | `NotBlankStr` | Strategy identifier | | ||
| | `context_snapshot` | `AgentContextSnapshot` | Redacted snapshot (turn count, accumulated cost, message count, max turns — no message contents) | | ||
| | `error_message` | `NotBlankStr` | Error that triggered recovery | | ||
| | `can_reassign` | `bool` (computed) | `retry_count < task.max_retries` | | ||
|
|
||
| #### Strategy 1: Fail-and-Reassign (Default / MVP) | ||
|
|
||
| The engine catches the failure at its outermost boundary, logs a redacted `AgentContext` snapshot (turn count, accumulated cost — excluding message contents to avoid leaking sensitive prompts/tool outputs), transitions the task to `FAILED`, and makes it available for reassignment (manual or automatic via the task router). | ||
|
|
||
| > **New non-terminal state:** `FAILED` is a new `TaskStatus` variant to be added alongside `CANCELLED`. The §6.1 lifecycle diagram and `TaskStatus` enum will be updated when crash recovery is implemented in M3. `FAILED` differs from `CANCELLED` (which is terminal) in that failed tasks are eligible for automatic reassignment. | ||
| > **Non-terminal state (implemented in M3):** `FAILED` is a `TaskStatus` variant alongside `CANCELLED`. `FAILED` differs from `CANCELLED` (which is terminal) in that failed tasks are eligible for automatic reassignment. Valid transitions: `IN_PROGRESS → FAILED`, `ASSIGNED → FAILED` (early setup failures), `FAILED → ASSIGNED` (reassignment). See the updated §6.1 lifecycle diagram. | ||
|
|
||
| ```yaml | ||
| crash_recovery: | ||
|
|
@@ -967,10 +991,12 @@ crash_recovery: | |
| - All progress is lost on crash — acceptable for short single-agent tasks in the MVP | ||
|
|
||
| On crash: | ||
| 1. Catch exception at the engine boundary (outermost `try/except` in the execution loop) | ||
| 2. Log at ERROR with redacted `AgentContext` snapshot (turn count, accumulated cost, tool call history — message contents excluded) | ||
| 1. Catch exception at the `AgentEngine` boundary (outermost `try/except` in `AgentEngine.run()`) | ||
| 2. Log at ERROR with redacted `AgentContextSnapshot` (turn count, accumulated cost, message count, max turns — message contents excluded) | ||
| 3. Transition `TaskExecution` → `FAILED` with the exception as the failure reason | ||
| 4. Task becomes available for reassignment via the task router | ||
| 4. `RecoveryResult.can_reassign` reports whether `retry_count < max_retries` | ||
|
|
||
| > **M3 limitation:** The `can_reassign` flag is computed and returned in `RecoveryResult`, but automated reassignment is not yet implemented — the task router (§6.4) will consume this in a later milestone. The caller (task router) is responsible for incrementing `retry_count` when creating the next `TaskExecution`. | ||
|
|
||
| #### Strategy 2: Checkpoint Recovery (Planned — M4/M5) | ||
|
|
||
|
|
@@ -2272,6 +2298,8 @@ ai-company/ | |
| │ │ ├── loop_protocol.py # ExecutionLoop protocol + result models | ||
| │ │ ├── metrics.py # TaskCompletionMetrics proxy overhead model | ||
| │ │ ├── react_loop.py # ReAct loop implementation | ||
| │ │ ├── recovery.py # Crash recovery strategies (RecoveryStrategy protocol) | ||
| │ │ ├── cost_recording.py # Per-turn cost recording helpers | ||
| │ │ ├── run_result.py # AgentRunResult outcome model | ||
| │ │ ├── agent_engine.py # Agent execution engine | ||
|
Comment on lines
+2301
to
2304
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update the The new 🤖 Prompt for AI Agents |
||
| │ │ ├── task_engine.py # Task routing & scheduling (M3-M4) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the blank line inside this blockquote.
Line 721 triggers markdownlint MD028 (
no-blanks-blockquote). Keep the blockquote contiguous to avoid the lint failure.🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 721-721: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🤖 Prompt for AI Agents