Aureliolo · Aureliolo · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
@@ -94,6 +94,40 @@ Based on changed files, launch applicable review agents **in parallel** using th
 | **type-design-analyzer** | Type annotations or classes added/modified | `pr-review-toolkit:type-design-analyzer` |
 | **logging-audit** | Any `.py` file in `src/` changed | `pr-review-toolkit:code-reviewer` |
 | **resilience-audit** | Provider-layer `.py` files changed (`src/ai_company/providers/`) | `pr-review-toolkit:code-reviewer` |
+| **docs-consistency** | **ALWAYS** — runs on every PR regardless of change type | `pr-review-toolkit:code-reviewer` |
+
+The **docs-consistency** agent ensures project documentation never drifts from the codebase. It runs on **every PR** — code changes, config changes, docs-only changes, all of them.
+
+**What to check:**
+
+Read the current `DESIGN_SPEC.md`, `CLAUDE.md`, and `README.md` in full. Then compare them against the PR diff and the actual current state of the codebase. Flag anything that is now inaccurate, incomplete, or missing.
+
+**DESIGN_SPEC.md (CRITICAL — this is the project's source of truth):**
+1. §15.3 Project Structure — does it match the actual files/directories under `src/ai_company/`? Any new modules missing? Any listed files that no longer exist? (CRITICAL)
+2. §3.1 Agent Identity Card — does the config/runtime split documentation match the actual model code? (MAJOR)
+3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+4. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? Are "Adopted" vs "Planned" labels still accurate? (MAJOR)
+5. §10.2 Cost Tracking — does the implementation note match the actual `TokenUsage` and spending summary models? (MAJOR)
+6. §11.1.1 Tool Execution Model — does it match actual `ToolInvoker` behavior? (MAJOR)
+7. §15.2 Technology Stack — are versions, libraries, and rationale current? (MEDIUM)
+8. §9.2 Provider Configuration — are model IDs, provider capability examples, and config/runtime mapping still representative? (MEDIUM)
+9. §9.3 LiteLLM Integration — does the integration status match reality? (MEDIUM)
+10. Any other section that describes behavior, structure, or patterns that have changed (MAJOR)
+
+**CLAUDE.md (CRITICAL — this guides all future development):**
+11. Code Conventions — do documented patterns match what's actually in the code? New patterns used but not documented? Documented patterns no longer followed? (CRITICAL)
+12. Logging section — are event import paths, logger patterns, and rules accurate? (CRITICAL)
+13. Resilience section — does it match the actual retry/rate-limit implementation? (MAJOR)
+14. Package Structure — does it match the actual directory layout? (MAJOR)
+15. Testing section — are markers, commands, and conventions current? (MEDIUM)
+16. Any other section that gives instructions that don't match reality (CRITICAL)
+
+**README.md:**
+17. Installation, usage, and getting-started instructions — still accurate? (MAJOR)
+18. Feature descriptions — do they match what's actually built? (MEDIUM)
+19. Links — any dead links or references to things that moved? (MINOR)
+
+**Key principle:** It is better to flag a false positive than to let documentation drift silently. When in doubt, flag it.
 
 The **logging-audit** agent prompt must check for these violations (see CLAUDE.md `## Logging`):
 

@@ -164,6 +164,42 @@ This captures committed-but-unpushed changes AND any uncommitted/untracked work
 | **logging-audit** | Any `src_py` changed | `pr-review-toolkit:code-reviewer` (custom prompt below) |
 | **resilience-audit** | Files in `src/ai_company/providers/` changed | `pr-review-toolkit:code-reviewer` (custom prompt below) |
 | **security-reviewer** | Files in `src/ai_company/api/`, `src/ai_company/security/`, `src/ai_company/tools/`, `src/ai_company/config/` changed, OR diff contains `subprocess`, `eval`, `exec`, `pickle`, `yaml.load`, auth/credential patterns | `everything-claude-code:security-reviewer` |
+| **docs-consistency** | **ALWAYS** — runs on every PR regardless of change type | `pr-review-toolkit:code-reviewer` (custom prompt below) |
+
+### Docs-consistency custom prompt
+
+The docs-consistency agent ensures project documentation never drifts from the codebase. It runs on **every PR** — code changes, config changes, docs-only changes, all of them.
+
+**What to check:**
+
+Read the current `DESIGN_SPEC.md`, `CLAUDE.md`, and `README.md` in full. Then compare them against the PR diff and the actual current state of the codebase. Flag anything that is now inaccurate, incomplete, or missing.
+
+**DESIGN_SPEC.md (CRITICAL — this is the project's source of truth):**
+1. §15.3 Project Structure — does it match the actual files/directories under `src/ai_company/`? Any new modules missing? Any listed files that no longer exist? (CRITICAL)
+2. §3.1 Agent Identity Card — does the config/runtime split documentation match the actual model code? (MAJOR)
+3. §15.4 Key Design Decisions — are technology choices and rationale still accurate? (MAJOR)
+4. §15.5 Pydantic Model Conventions — do the documented conventions match how models are actually written in code? Are "Adopted" vs "Planned" labels still accurate? (MAJOR)
+5. §10.2 Cost Tracking — does the implementation note match the actual `TokenUsage` and spending summary models? (MAJOR)
+6. §11.1.1 Tool Execution Model — does it match actual `ToolInvoker` behavior? (MAJOR)
+7. §15.2 Technology Stack — are versions, libraries, and rationale current? (MEDIUM)
+8. §9.2 Provider Configuration — are model IDs, provider capability examples, and config/runtime mapping still representative? (MEDIUM)
+9. §9.3 LiteLLM Integration — does the integration status match reality? (MEDIUM)
+10. Any other section that describes behavior, structure, or patterns that have changed (MAJOR)
+
+**CLAUDE.md (CRITICAL — this guides all future development):**
+11. Code Conventions — do documented patterns match what's actually in the code? New patterns used but not documented? Documented patterns no longer followed? (CRITICAL)
+12. Logging section — are event import paths, logger patterns, and rules accurate? (CRITICAL)
+13. Resilience section — does it match the actual retry/rate-limit implementation? (MAJOR)
+14. Package Structure — does it match the actual directory layout? (MAJOR)
+15. Testing section — are markers, commands, and conventions current? (MEDIUM)
+16. Any other section that gives instructions that don't match reality (CRITICAL)
+
+**README.md:**
+17. Installation, usage, and getting-started instructions — still accurate? (MAJOR)
+18. Feature descriptions — do they match what's actually built? (MEDIUM)
+19. Links — any dead links or references to things that moved? (MINOR)
+
+**Key principle:** It is better to flag a false positive than to let documentation drift silently. When in doubt, flag it.
 
 ### Logging-audit custom prompt
 

@@ -66,8 +66,10 @@ src/ai_company/
 - **PEP 758 except syntax**: use `except A, B:` (no parentheses) — ruff enforces this on Python 3.14
 - **Type hints**: all public functions, mypy strict mode
 - **Docstrings**: Google style, required on public classes/functions (enforced by ruff D rules)
-- **Immutability**: create new objects, never mutate existing ones
-- **Models**: Pydantic v2 (`BaseModel`, `model_validator`, `ConfigDict`)
+- **Immutability**: create new objects, never mutate existing ones. For `dict`/`list` fields in frozen Pydantic models, use `MappingProxyType` wrapping at construction (not `deepcopy` on access). Deep-copy only at system boundaries (e.g. passing data to `tool.execute()`, serializing for persistence).
+- **Config vs runtime state**: frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state that evolves (e.g. agent execution state, task progress). Never mix static config fields with mutable runtime fields in one model.
+- **Models**: Pydantic v2 (`BaseModel`, `model_validator`, `ConfigDict`). Planned conventions for new code: use `@computed_field` for derived values instead of storing + validating redundant fields; use `NotBlankStr` (from `core.types`) for non-optional identifier/name fields instead of manual whitespace validators. Existing models are being migrated incrementally.
+- **Async concurrency**: prefer `asyncio.TaskGroup` for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare `create_task`. Existing code is being migrated incrementally.
 - **Line length**: 88 characters (ruff)
 - **Functions**: < 50 lines, files < 800 lines
 - **Errors**: handle explicitly, never silently swallow
@@ -78,7 +80,7 @@ src/ai_company/
 - **Every module** with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
 - **Never** use `import logging` / `logging.getLogger()` / `print()` in application code
 - **Variable name**: always `logger` (not `_logger`, not `log`)
-- **Event names**: always use constants from `ai_company.observability.events`
+- **Event names**: always use constants from `ai_company.observability.events` (e.g. `PROVIDER_CALL_START`, `BUDGET_RECORD_ADDED`, `TOOL_INVOKE_START`). Import directly: `from ai_company.observability.events import EVENT_CONSTANT`
 - **Structured kwargs**: always `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`
 - **All error paths** must log at WARNING or ERROR with context before raising
 - **All state transitions** must log at INFO
@@ -131,5 +133,5 @@ src/ai_company/
 ## Dependencies
 
 - **Pinned**: all versions use `==` in `pyproject.toml`
-- **Groups**: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen, pydantic)
+- **Groups**: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen)
 - **Install**: `uv sync` installs everything (dev group is default)