Aureliolo · Aureliolo · May 22, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
@@ -0,0 +1,134 @@
+---
+title: Research Mode
+description: A real research subsystem for synthetic organisations. A research brief drives query planning, multi-source retrieval (internal knowledge plus web, academic, and code search), source-credibility triage, deduplication, and citation-backed synthesis. Every run is recorded and replayable, and the deliverable's claims resolve to retrievable sources.
+---
+
+# Research Mode
+
+!!! warning "Designed behaviour; runtime in active development"
+    Research mode is wired at boot behind the `research.enabled` setting and a
+    configured provider plus model. Like the rest of the agent capability
+    layer, the per-task tool loader that surfaces the `research` tool to a
+    running agent lands with the broader runtime-wiring programme; the
+    subsystem, its MCP surface, and its eval lane are complete and tested.
+
+Today "an agent does research" is a curl in a sandbox. Research mode replaces
+that with a real pipeline: a research **brief** becomes a synthesised,
+citation-backed **report** whose every claim resolves to a retrievable
+source, produced through a recorded, replayable run.
+
+## Pipeline
+
+`ResearchService.run(brief, *, run_id, created_by)` drives six stages:
+
+1. **Query planning** (`QueryPlanner`): decompose the brief into
+   source-targeted sub-queries. Default `LlmQueryPlanner`.
+2. **Multi-source retrieval** (`RetrievalSource`), fanned out concurrently
+   via `asyncio.TaskGroup`: the internal knowledge substrate plus web,
+   academic, and code search. A single source failing is logged and skipped;
+   the run continues with the remaining candidates.
+3. **Source-credibility triage** (`CredibilityTriage`): score each candidate
+   and drop those below the brief's threshold. Default
+   `HybridCredibilityTriage` (deterministic heuristic prefilter, then LLM
+   triage on the survivors).
+4. **Deduplication** (`Deduplicator`): collapse near-duplicate findings.
+   Default `LexicalDeduplicator` (content-hash plus canonical-URL plus
+   token-shingle Jaccard; deterministic).
+5. **Synthesis** (`Synthesizer`): the LLM writes a report citing sources by
+   stable reference id; `CitationBinder` validates every cited id resolves to
+   a retained item. An unsourced claim raises `ResearchSynthesisError` rather
+   than emitting an unverifiable report.
+6. **Recording**: the run is persisted as a `ResearchRun`.
+
+Every step is pluggable through a protocol, a default strategy, the
+`build_research_service` factory, and a `ResearchConfig` discriminator
+(`settings/definitions/research.py`). Safe defaults ship; web, academic, and
+code retrieval use vendor-agnostic provider protocols with no bundled
+implementation (mirroring `WebSearchProvider`), so a family fans out only
+once a provider is injected.
+
+## Data model
+
+Frozen Pydantic v2 models (`research/models.py`), all `extra="forbid"`:
+
+- `ResearchBrief` -- the input: question, project scope, source toggles,
+  credibility floor, and cost / wall-clock / sub-query limits.
+- `ResearchQueryPlan` / `SubQuery` -- the planner's decomposition.
+- `RetrievedItem` -- one candidate, carrying a stable `ref_id`, snippet,
+  content hash, relevance, and a `ResearchCitation`.
+- `ResearchCitation` -- resolves a claim to a source: for `knowledge` it
+  embeds the reused knowledge-substrate `Citation`; for `web` / `academic` /
+  `code` it carries a typed locator.
+- `SourceCredibility` -- a triage verdict.
+- `ResearchClaim` -- an assertion backed by at least one citation.
+- `ResearchReport` -- the deliverable: summary plus cited claims plus
+  methodology counts.
+- `ResearchRun` -- the persisted, replayable record: an immutable snapshot of
+  the brief plus the plan, retrieved items, credibility verdicts, and report.
+
+Identifiers are required fields, never random defaults: the agent tool and
+MCP handler derive `(brief_id, run_id)` deterministically from the request,
+so an identical request reproduces the same run id.
+
+## Recording and replay
+
+The run record is the single source of truth for retrieval. Two layers
+compose to make a whole run deterministically replayable:
+
+- **LLM calls** (planning, triage, synthesis) replay through the existing
+  `CassetteCompletionProvider`.
+- **Retrieval results** are persisted on the run as `retrieved_items`;
+  replay swaps each `RetrievalSource` for a `ReplayRetrievalSource` that
+  serves the recorded items by sub-query index. Since the plan comes from the
+  cassetted planner, the same plan reproduces the same routing.
+
+Triage's heuristic component, lexical dedup, and citation binding are
+deterministic, so given the cassette plus the run record the report is
+byte-stable. The default path needs no embedder.
+
+## Persistence
+
+A single `research_runs` table stores each run as a JSON blob with
+denormalised `brief_id` / `project_id` / `status` / `created_at` columns for
+filtering and ordering. `ResearchRunRepository` composes the
+`IdKeyedRepository` and `FilteredQueryRepository` generics; SQLite and
+Postgres implementations are conformance-tested in lockstep.
+
+## Surfaces
+
+- **Agent tool** `research` (`research/tool.py`): runs a brief and returns
+  the cited report. Built per task by `ResearchToolFactory`.
+- **MCP** domain `research:run` / `research:get` / `research:list`
+  (`meta/mcp/domains/research.py`, handlers route through
+  `ResearchService`), 503-ing when the service is not wired.
+
+A REST controller and dashboard surface for operator-driven research are a
+follow-up; the agent tool, MCP surface, and eval lane cover the #1989
+acceptance.
+
+## Security (SEC-1)
+
+All retrieved external content is untrusted. Snippets are wrapped via
+`wrap_untrusted(TAG_RESEARCH_SOURCE, ...)` only where they enter a prompt
+(planning prompt for the brief, triage and synthesis prompts for sources),
+never at storage. The synthesiser and triage system prompts carry the
+untrusted-content directive. The research action is classified `research:run`
+at the MEDIUM risk tier; the underlying egress is separately gated at HIGH
+via `external_data:request`.
+
+## Evaluation
+
+A `kind="research"` eval brief carries a `ResearchBriefSpec` (question,
+expected claims, credibility floor, judged rubric). `grade_research_run`
+(`evals/scoring/research.py`) scores a run deterministically on claim
+coverage, citation resolution (every claim citation resolves to a retrieved
+source -- the acceptance criterion), and cited-source credibility. The lane
+records a run, replays it, asserts the report is byte-identical, and grades
+it.
+
+## Acceptance
+
+Given a research brief, the org produces a synthesised, citation-backed
+report whose claims resolve to retrievable sources, and the run is
+replayable. Validated by the research eval lane and the service-level replay
+test.
@@ -58,10 +58,11 @@
 
 
 class BriefKind(StrEnum):
-    """Discriminator between executable and judged briefs."""
+    """Discriminator between executable, judged, and research briefs."""
 
     EXECUTABLE = "executable"
     JUDGED = "judged"
+    RESEARCH = "research"
 
 
 class BriefPriority(StrEnum):
@@ -189,6 +190,22 @@ def _weights_sum_to_one(self) -> Self:
         return self
 
 
+class ResearchBriefSpec(BaseModel):
+    """Research-lane payload for a ``kind="research"`` brief.
+
+    Carries the research question, the claims a competent run is expected
+    to surface (graded for coverage), the source-credibility floor, and a
+    judged rubric (with its reference answer) used to score report quality.
+    """
+
+    model_config = ConfigDict(frozen=True, extra="forbid", allow_inf_nan=False)
+
+    question: NotBlankStr
+    expected_claims: tuple[NotBlankStr, ...] = Field(min_length=1)
+    min_credibility: float = Field(default=0.5, ge=0.0, le=1.0)
+    rubric: JudgedRubric
+
+
 class Brief(BaseModel):
     """One exam item.
 
@@ -216,36 +233,41 @@ class Brief(BaseModel):
     limits: LimitsSpec
     checks: ExecutableChecks | None = None
     rubric: JudgedRubric | None = None
+    research_spec: ResearchBriefSpec | None = None
+
+    def _require(self, *, present: object, name: str) -> None:
+        """Raise when a required per-kind payload block is missing."""
+        if present is None:
+            msg = (
+                f"Brief {self.brief_id!r}: kind={self.kind.value} "
+                f"requires a {name!r} block"
+            )
+            raise ValueError(msg)
+
+    def _forbid(self, *, present: object, name: str) -> None:
+        """Raise when a per-kind payload block is set but not allowed."""
+        if present is not None:
+            msg = (
+                f"Brief {self.brief_id!r}: kind={self.kind.value} "
+                f"must not carry a {name!r} block"
+            )
+            raise ValueError(msg)
 
     @model_validator(mode="after")
     def _kind_matches_payload(self) -> Self:
-        """Enforce kind / (checks XOR rubric) consistency."""
+        """Enforce that exactly the kind's payload block is populated."""
         if self.kind is BriefKind.EXECUTABLE:
-            if self.checks is None:
-                msg = (
-                    f"Brief {self.brief_id!r}: kind={self.kind.value} "
-                    "requires a 'checks' block"
-                )
-                raise ValueError(msg)
-            if self.rubric is not None:
-                msg = (
-                    f"Brief {self.brief_id!r}: kind={self.kind.value} "
-                    "must not carry a 'rubric' block"
-                )
-                raise ValueError(msg)
+            self._require(present=self.checks, name="checks")
+            self._forbid(present=self.rubric, name="rubric")
+            self._forbid(present=self.research_spec, name="research_spec")
+        elif self.kind is BriefKind.JUDGED:
+            self._require(present=self.rubric, name="rubric")
+            self._forbid(present=self.checks, name="checks")
+            self._forbid(present=self.research_spec, name="research_spec")
         else:
-            if self.rubric is None:
-                msg = (
-                    f"Brief {self.brief_id!r}: kind={self.kind.value} "
-                    "requires a 'rubric' block"
-                )
-                raise ValueError(msg)
-            if self.checks is not None:
-                msg = (
-                    f"Brief {self.brief_id!r}: kind={self.kind.value} "
-                    "must not carry a 'checks' block"
-                )
-                raise ValueError(msg)
+            self._require(present=self.research_spec, name="research_spec")
+            self._forbid(present=self.checks, name="checks")
+            self._forbid(present=self.rubric, name="rubric")
         return self
 
 
@@ -264,6 +286,7 @@ def _kind_matches_payload(self) -> Self:
     "HiddenCheckSpec",
     "JudgedRubric",
     "LimitsSpec",
+    "ResearchBriefSpec",
     "RubricDimension",
     "RubricGradeType",
 ]
@@ -0,0 +1,110 @@
+"""Deterministic grading for ``kind=research`` briefs.
+
+Scores a completed :class:`~synthorg.research.models.ResearchRun` against a
+:class:`~evals.models.brief.ResearchBriefSpec` on three structural axes:
+
+* **claim coverage** -- fraction of the brief's expected claims surfaced by
+  the report (token-overlap match);
+* **citation resolution** -- fraction of claim citations that resolve to a
+  retrieved item in the run (the #1989 acceptance: every claim resolves to
+  a retrievable source);
+* **source credibility** -- fraction of cited sources whose triage score
+  clears the brief's credibility floor.
+
+These are deterministic (no LLM judge), so the lane is replay-stable: the
+same recorded run grades identically every time.
+"""
+
+import re
+from typing import TYPE_CHECKING, Final
+
+from pydantic import BaseModel, ConfigDict, Field
+
+if TYPE_CHECKING:
+    from evals.models.brief import ResearchBriefSpec
+    from synthorg.research.models import ResearchRun
+
+_TOKEN_RE: Final[re.Pattern[str]] = re.compile(r"[^\W_]+", re.UNICODE)
+
+COVERAGE_TOKEN_OVERLAP: Final[float] = 0.5
+"""Fraction of an expected claim's tokens that must appear in a report claim
+for the expected claim to count as covered."""
+
+
+class ResearchScore(BaseModel):
+    """Structural grade for one research run."""
+
+    model_config = ConfigDict(frozen=True, allow_inf_nan=False, extra="forbid")
+
+    claim_coverage: float = Field(ge=0.0, le=1.0)
+    citation_resolution: float = Field(ge=0.0, le=1.0)
+    source_credibility: float = Field(ge=0.0, le=1.0)
+    overall: float = Field(ge=0.0, le=1.0)
+    passed: bool
+
+
+def _tokens(text: str) -> frozenset[str]:
+    return frozenset(_TOKEN_RE.findall(text.casefold()))
+
+
+def _coverage(run: ResearchRun, spec: ResearchBriefSpec) -> float:
+    report = run.report
+    if report is None:
+        return 0.0
+    claim_token_sets = [_tokens(claim.text) for claim in report.claims]
+    covered = 0
+    for expected in spec.expected_claims:
+        wanted = _tokens(expected)
+        if not wanted:
+            continue
+        if any(
+            len(wanted & claim_tokens) / len(wanted) >= COVERAGE_TOKEN_OVERLAP
+            for claim_tokens in claim_token_sets
+        ):
+            covered += 1
+    return covered / len(spec.expected_claims)
+
+
+def _citation_resolution(run: ResearchRun) -> float:
+    report = run.report
+    if report is None:
+        return 0.0
+    retrieved = {item.ref_id for item in run.retrieved_items}
+    refs = [c.ref_id for claim in report.claims for c in claim.citations]
+    if not refs:
+        return 0.0
+    return sum(1 for ref in refs if ref in retrieved) / len(refs)
+
+
+def _source_credibility(run: ResearchRun, spec: ResearchBriefSpec) -> float:
+    report = run.report
+    if report is None:
+        return 0.0
+    cited = {c.ref_id for claim in report.claims for c in claim.citations}
+    if not cited:
+        return 0.0
+    score_by_ref = {v.ref_id: v.score for v in run.credibility}
+    clear = sum(
+        1 for ref in cited if score_by_ref.get(ref, 0.0) >= spec.min_credibility
+    )
+    return clear / len(cited)
+
+
+def grade_research_run(run: ResearchRun, spec: ResearchBriefSpec) -> ResearchScore:
+    """Grade a completed research run against its brief spec.
+
+    The run passes when every claim citation resolves to a retrieved source
+    (the hard acceptance criterion); coverage and credibility are quality
+    signals folded into the overall score.
+    """
+    coverage = _coverage(run, spec)
+    resolution = _citation_resolution(run)
+    credibility = _source_credibility(run, spec)
+    overall = (coverage + resolution + credibility) / 3.0
+    return ResearchScore(
+        claim_coverage=coverage,
+        citation_resolution=resolution,
+        source_credibility=credibility,
+        overall=overall,
+        passed=resolution >= 1.0,
+    )
@@ -81,6 +81,18 @@ ENFORCED KnowledgeRetriever #1988 -- constructed inside knowledge/factory.py::bu
 ENFORCED build_knowledge_tool_factory #1988 -- called in api/app.py::_wire_knowledge_engine; per-task source of SearchKnowledgeTool + IngestKnowledgeTool
 ENFORCED SearchKnowledgeTool #1988 -- constructed by knowledge/tool_factory.py::KnowledgeToolFactory.build_tools per task with the agent's project binding
 ENFORCED IngestKnowledgeTool #1988 -- constructed by knowledge/tool_factory.py::KnowledgeToolFactory.build_tools per task with the agent's project binding
+ENFORCED build_research_service #1989 -- called in api/app.py::_wire_research_engine; assembles planner + sources + triage + dedup + synthesiser into the ResearchService
+ENFORCED ResearchService #1989 -- constructed by research/factory.py::build_research_service; attached to AppState by _wire_research_engine
+ENFORCED LlmQueryPlanner #1989 -- constructed inside research/factory.py::build_research_service; decomposes a brief into source-targeted sub-queries
+ENFORCED HybridCredibilityTriage #1989 -- constructed inside research/factory.py::_build_triage; heuristic prefilter then LLM triage on survivors
+ENFORCED LexicalDeduplicator #1989 -- constructed inside research/factory.py::_build_deduplicator; deterministic hash/url/shingle dedup
+ENFORCED LlmSynthesizer #1989 -- constructed inside research/factory.py::build_research_service; produces the cited report, the binder validates refs
+ENFORCED KnowledgeRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources; wraps KnowledgeService for the internal source
+ENFORCED WebRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources when a WebSearchProvider is injected
+ENFORCED AcademicRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources when an AcademicSearchProvider is injected
+ENFORCED CodeRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources when a CodeSearchProvider is injected
+ENFORCED build_research_tool_factory #1989 -- called in api/app.py::_wire_research_engine; per-task source of the ResearchTool
+ENFORCED ResearchTool #1989 -- constructed by research/tool_factory.py::ResearchToolFactory.build_tools per task with the agent's project + identity binding
 ENFORCED build_toolsmith #1995 -- called by api/app.py::_build_toolsmith_runtime behind tool_creation_enabled + provider switch; wires the self-extending toolkit runtime
 ENFORCED ToolsmithService #1995 -- constructed by meta/toolsmith/factory.py::build_toolsmith; orchestrates gap detection -> author -> guard -> apply at the TOOL_CREATION altitude
 ENFORCED RingBufferCapabilityGapStore #1995 -- constructed by meta/toolsmith/factory.py::build_toolsmith; ring-buffered capability-gap sink + recurrence detector

@@ -83,6 +83,11 @@
     # factory + tool-factory construction lets the manifest track the
     # knowledge substrate's wiring (#1988).
     "src/synthorg/knowledge/",
+    # research/ is reached at boot via api/app.py::_wire_research_engine
+    # (build_research_service / build_research_tool_factory); counting its
+    # factory + strategy + tool-factory construction lets the manifest track
+    # the research subsystem's wiring (#1989).
+    "src/synthorg/research/",
 )