Aureliolo · Aureliolo · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
@@ -94,6 +94,38 @@ Based on changed files, launch applicable review agents **in parallel** using th
 | **type-design-analyzer** | Type annotations or classes added/modified | `pr-review-toolkit:type-design-analyzer` |
 | **logging-audit** | Any `.py` file in `src/` changed | `pr-review-toolkit:code-reviewer` |
 | **resilience-audit** | Provider-layer `.py` files changed (`src/ai_company/providers/`) | `pr-review-toolkit:code-reviewer` |
+| **docs-consistency** | **ALWAYS** — runs on every PR regardless of change type | `pr-review-toolkit:code-reviewer` |
+
+The **docs-consistency** agent ensures project documentation never drifts from the codebase. It runs on **every PR** — code changes, config changes, docs-only changes, all of them.
+
+**What to check:**
+
+Read the current `DESIGN_SPEC.md`, `CLAUDE.md`, and `README.md` in full. Then compare them against the PR diff and the actual current state of the codebase. Flag anything that is now inaccurate, incomplete, or missing.
+
+**DESIGN_SPEC.md (CRITICAL — this is the project's source of truth):**
+1. §15.3 Project Structure — does it match the actual files/directories under `src/ai_company/`? Any new modules missing? Any listed files that no longer exist? (CRITICAL)
+2. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
+3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
-2. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
-3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+2. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+3. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
-2. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
-3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+2. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+3. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
+4. §3.1 Agent Identity Card — does the config/runtime split documentation match the actual model code? (MAJOR)
+5. §10.2 Cost Tracking — does the implementation note match the actual `TokenUsage` and spending summary models? (MAJOR)
+6. §11.1.1 Tool Execution Model — does it match actual `ToolInvoker` behavior? (MAJOR)
+7. §15.2 Technology Stack — are versions, libraries, and rationale current? (MEDIUM)
+8. Any other section that describes behavior, structure, or patterns that have changed (MAJOR)
+
+**CLAUDE.md (CRITICAL — this guides all future development):**
+9. Code Conventions — do documented patterns match what's actually in the code? New patterns used but not documented? Documented patterns no longer followed? (CRITICAL)
+10. Logging section — are event import paths, logger patterns, and rules accurate? (CRITICAL)
+11. Resilience section — does it match the actual retry/rate-limit implementation? (MAJOR)
+12. Package Structure — does it match the actual directory layout? (MAJOR)
+13. Testing section — are markers, commands, and conventions current? (MEDIUM)
+14. Any other section that gives instructions that don't match reality (CRITICAL)
+
+**README.md:**
+15. Installation, usage, and getting-started instructions — still accurate? (MAJOR)
+16. Feature descriptions — do they match what's actually built? (MEDIUM)
+17. Links — any dead links or references to things that moved? (MINOR)
+
+**Key principle:** It is better to flag a false positive than to let documentation drift silently. When in doubt, flag it.
 
 The **logging-audit** agent prompt must check for these violations (see CLAUDE.md `## Logging`):
 

@@ -164,6 +164,40 @@ This captures committed-but-unpushed changes AND any uncommitted/untracked work
 | **logging-audit** | Any `src_py` changed | `pr-review-toolkit:code-reviewer` (custom prompt below) |
 | **resilience-audit** | Files in `src/ai_company/providers/` changed | `pr-review-toolkit:code-reviewer` (custom prompt below) |
 | **security-reviewer** | Files in `src/ai_company/api/`, `src/ai_company/security/`, `src/ai_company/tools/`, `src/ai_company/config/` changed, OR diff contains `subprocess`, `eval`, `exec`, `pickle`, `yaml.load`, auth/credential patterns | `everything-claude-code:security-reviewer` |
+| **docs-consistency** | **ALWAYS** — runs on every PR regardless of change type | `pr-review-toolkit:code-reviewer` (custom prompt below) |
+
+### Docs-consistency custom prompt
+
+The docs-consistency agent ensures project documentation never drifts from the codebase. It runs on **every PR** — code changes, config changes, docs-only changes, all of them.
+
+**What to check:**
+
+Read the current `DESIGN_SPEC.md`, `CLAUDE.md`, and `README.md` in full. Then compare them against the PR diff and the actual current state of the codebase. Flag anything that is now inaccurate, incomplete, or missing.
+
+**DESIGN_SPEC.md (CRITICAL — this is the project's source of truth):**
+1. §15.3 Project Structure — does it match the actual files/directories under `src/ai_company/`? Any new modules missing? Any listed files that no longer exist? (CRITICAL)
+2. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
+3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
-2. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
-3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+2. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+3. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
-2. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
-3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+2. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+3. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? (MAJOR)
+4. §3.1 Agent Identity Card — does the config/runtime split documentation match the actual model code? (MAJOR)
+5. §10.2 Cost Tracking — does the implementation note match the actual `TokenUsage` and spending summary models? (MAJOR)
+6. §11.1.1 Tool Execution Model — does it match actual `ToolInvoker` behavior? (MAJOR)
+7. §15.2 Technology Stack — are versions, libraries, and rationale current? (MEDIUM)
+8. Any other section that describes behavior, structure, or patterns that have changed (MAJOR)
+
+**CLAUDE.md (CRITICAL — this guides all future development):**
+9. Code Conventions — do documented patterns match what's actually in the code? New patterns used but not documented? Documented patterns no longer followed? (CRITICAL)
+10. Logging section — are event import paths, logger patterns, and rules accurate? (CRITICAL)
+11. Resilience section — does it match the actual retry/rate-limit implementation? (MAJOR)
+12. Package Structure — does it match the actual directory layout? (MAJOR)
+13. Testing section — are markers, commands, and conventions current? (MEDIUM)
+14. Any other section that gives instructions that don't match reality (CRITICAL)
+
+**README.md:**
+15. Installation, usage, and getting-started instructions — still accurate? (MAJOR)
+16. Feature descriptions — do they match what's actually built? (MEDIUM)
+17. Links — any dead links or references to things that moved? (MINOR)
+
+**Key principle:** It is better to flag a false positive than to let documentation drift silently. When in doubt, flag it.
 
 ### Logging-audit custom prompt
 

@@ -66,8 +66,10 @@ src/ai_company/
 - **PEP 758 except syntax**: use `except A, B:` (no parentheses) — ruff enforces this on Python 3.14
 - **Type hints**: all public functions, mypy strict mode
 - **Docstrings**: Google style, required on public classes/functions (enforced by ruff D rules)
-- **Immutability**: create new objects, never mutate existing ones
-- **Models**: Pydantic v2 (`BaseModel`, `model_validator`, `ConfigDict`)
+- **Immutability**: create new objects, never mutate existing ones. For `dict`/`list` fields in frozen Pydantic models, use `MappingProxyType` wrapping at construction (not `deepcopy` on access). Deep-copy only at system boundaries (e.g. passing data to `tool.execute()`, serializing for persistence).
+- **Config vs runtime state**: frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state that evolves (e.g. agent execution state, task progress). Never mix static config fields with mutable runtime fields in one model.
+- **Models**: Pydantic v2 (`BaseModel`, `model_validator`, `computed_field`, `ConfigDict`). Use `@computed_field` for derived values (e.g. `total_tokens = input + output`) instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all non-optional identifier/name fields instead of manual whitespace validators.
+- **Async concurrency**: use `asyncio.TaskGroup` for fan-out/fan-in parallel operations (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare `create_task`.
 - **Line length**: 88 characters (ruff)
 - **Functions**: < 50 lines, files < 800 lines
 - **Errors**: handle explicitly, never silently swallow
@@ -78,7 +80,7 @@ src/ai_company/
 - **Every module** with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
 - **Never** use `import logging` / `logging.getLogger()` / `print()` in application code
 - **Variable name**: always `logger` (not `_logger`, not `log`)
-- **Event names**: always use constants from `ai_company.observability.events`
+- **Event names**: always use constants from the appropriate domain module under `ai_company.observability.events/` (e.g. `events.provider`, `events.budget`, `events.tool`, `events.config`, `events.template`, `events.routing`, `events.task`, `events.prompt`)
 - **Structured kwargs**: always `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`
 - **All error paths** must log at WARNING or ERROR with context before raising
 - **All state transitions** must log at INFO