Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions docs/design/research-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
title: Research Mode
description: A real research subsystem for synthetic organisations. A research brief drives query planning, multi-source retrieval (internal knowledge plus web, academic, and code search), source-credibility triage, deduplication, and citation-backed synthesis. Every run is recorded and replayable, and the deliverable's claims resolve to retrievable sources.
---

# Research Mode

!!! warning "Designed behaviour; runtime in active development"
Research mode is wired at boot behind the `research.enabled` setting and a
configured provider plus model. Like the rest of the agent capability
layer, the per-task tool loader that surfaces the `research` tool to a
running agent lands with the broader runtime-wiring programme; the
subsystem, its MCP surface, and its eval lane are complete and tested.

Today "an agent does research" is a curl in a sandbox. Research mode replaces
that with a real pipeline: a research **brief** becomes a synthesised,
citation-backed **report** whose every claim resolves to a retrievable
source, produced through a recorded, replayable run.

## Pipeline

`ResearchService.run(brief, *, run_id, created_by)` drives six stages:

1. **Query planning** (`QueryPlanner`): decompose the brief into
source-targeted sub-queries. Default `LlmQueryPlanner`.
2. **Multi-source retrieval** (`RetrievalSource`), fanned out concurrently
via `asyncio.TaskGroup`: the internal knowledge substrate plus web,
academic, and code search. A single source failing is logged and skipped;
the run continues with the remaining candidates.
3. **Source-credibility triage** (`CredibilityTriage`): score each candidate
and drop those below the brief's threshold. Default
`HybridCredibilityTriage` (deterministic heuristic prefilter, then LLM
triage on the survivors).
4. **Deduplication** (`Deduplicator`): collapse near-duplicate findings.
Default `LexicalDeduplicator` (content-hash plus canonical-URL plus
token-shingle Jaccard; deterministic).
5. **Synthesis** (`Synthesizer`): the LLM writes a report citing sources by
stable reference id; `CitationBinder` validates every cited id resolves to
a retained item. An unsourced claim raises `ResearchSynthesisError` rather
than emitting an unverifiable report.
6. **Recording**: the run is persisted as a `ResearchRun`.

Every step is pluggable through a protocol, a default strategy, the
`build_research_service` factory, and a `ResearchConfig` discriminator
(`settings/definitions/research.py`). Safe defaults ship; web, academic, and
code retrieval use vendor-agnostic provider protocols with no bundled
implementation (mirroring `WebSearchProvider`), so a family fans out only
once a provider is injected.

## Data model

Frozen Pydantic v2 models (`research/models.py`), all `extra="forbid"`:

- `ResearchBrief` -- the input: question, project scope, source toggles,
credibility floor, and cost / wall-clock / sub-query limits.
- `ResearchQueryPlan` / `SubQuery` -- the planner's decomposition.
- `RetrievedItem` -- one candidate, carrying a stable `ref_id`, snippet,
content hash, relevance, and a `ResearchCitation`.
- `ResearchCitation` -- resolves a claim to a source: for `knowledge` it
embeds the reused knowledge-substrate `Citation`; for `web` / `academic` /
`code` it carries a typed locator.
- `SourceCredibility` -- a triage verdict.
- `ResearchClaim` -- an assertion backed by at least one citation.
- `ResearchReport` -- the deliverable: summary plus cited claims plus
methodology counts.
- `ResearchRun` -- the persisted, replayable record: an immutable snapshot of
the brief plus the plan, retrieved items, credibility verdicts, and report.

Identifiers are required fields, never random defaults: the agent tool and
MCP handler derive `(brief_id, run_id)` deterministically from the request,
so an identical request reproduces the same run id.

## Recording and replay

The run record is the single source of truth for retrieval. Two layers
compose to make a whole run deterministically replayable:

- **LLM calls** (planning, triage, synthesis) replay through the existing
`CassetteCompletionProvider`.
- **Retrieval results** are persisted on the run as `retrieved_items`;
replay swaps each `RetrievalSource` for a `ReplayRetrievalSource` that
serves the recorded items by sub-query index. Since the plan comes from the
cassetted planner, the same plan reproduces the same routing.

Triage's heuristic component, lexical dedup, and citation binding are
deterministic, so given the cassette plus the run record the report is
byte-stable. The default path needs no embedder.

## Persistence

A single `research_runs` table stores each run as a JSON blob with
denormalised `brief_id` / `project_id` / `status` / `created_at` columns for
filtering and ordering. `ResearchRunRepository` composes the
`IdKeyedRepository` and `FilteredQueryRepository` generics; SQLite and
Postgres implementations are conformance-tested in lockstep.

## Surfaces

- **Agent tool** `research` (`research/tool.py`): runs a brief and returns
the cited report. Built per task by `ResearchToolFactory`.
- **MCP** domain `research:run` / `research:get` / `research:list`
(`meta/mcp/domains/research.py`, handlers route through
`ResearchService`), 503-ing when the service is not wired.

A REST controller and dashboard surface for operator-driven research are a
follow-up; the agent tool, MCP surface, and eval lane cover the #1989
acceptance.

## Security (SEC-1)

All retrieved external content is untrusted. Snippets are wrapped via
`wrap_untrusted(TAG_RESEARCH_SOURCE, ...)` only where they enter a prompt
(planning prompt for the brief, triage and synthesis prompts for sources),
never at storage. The synthesiser and triage system prompts carry the
untrusted-content directive. The research action is classified `research:run`
at the MEDIUM risk tier; the underlying egress is separately gated at HIGH
via `external_data:request`.

## Evaluation

A `kind="research"` eval brief carries a `ResearchBriefSpec` (question,
expected claims, credibility floor, judged rubric). `grade_research_run`
(`evals/scoring/research.py`) scores a run deterministically on claim
coverage, citation resolution (every claim citation resolves to a retrieved
source -- the acceptance criterion), and cited-source credibility. The lane
records a run, replays it, asserts the report is byte-identical, and grades
it.

## Acceptance

Given a research brief, the org produces a synthesised, citation-backed
report whose claims resolve to retrievable sources, and the run is
replayable. Validated by the research eval lane and the service-level replay
test.
75 changes: 49 additions & 26 deletions evals/models/brief.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,11 @@


class BriefKind(StrEnum):
"""Discriminator between executable and judged briefs."""
"""Discriminator between executable, judged, and research briefs."""

EXECUTABLE = "executable"
JUDGED = "judged"
RESEARCH = "research"


class BriefPriority(StrEnum):
Expand Down Expand Up @@ -189,6 +190,22 @@ def _weights_sum_to_one(self) -> Self:
return self


class ResearchBriefSpec(BaseModel):
"""Research-lane payload for a ``kind="research"`` brief.

Carries the research question, the claims a competent run is expected
to surface (graded for coverage), the source-credibility floor, and a
judged rubric (with its reference answer) used to score report quality.
"""

model_config = ConfigDict(frozen=True, extra="forbid", allow_inf_nan=False)

question: NotBlankStr
expected_claims: tuple[NotBlankStr, ...] = Field(min_length=1)
min_credibility: float = Field(default=0.5, ge=0.0, le=1.0)
rubric: JudgedRubric


class Brief(BaseModel):
"""One exam item.

Expand Down Expand Up @@ -216,36 +233,41 @@ class Brief(BaseModel):
limits: LimitsSpec
checks: ExecutableChecks | None = None
rubric: JudgedRubric | None = None
research_spec: ResearchBriefSpec | None = None

def _require(self, *, present: object, name: str) -> None:
"""Raise when a required per-kind payload block is missing."""
if present is None:
msg = (
f"Brief {self.brief_id!r}: kind={self.kind.value} "
f"requires a {name!r} block"
)
raise ValueError(msg)

def _forbid(self, *, present: object, name: str) -> None:
"""Raise when a per-kind payload block is set but not allowed."""
if present is not None:
msg = (
f"Brief {self.brief_id!r}: kind={self.kind.value} "
f"must not carry a {name!r} block"
)
raise ValueError(msg)

@model_validator(mode="after")
def _kind_matches_payload(self) -> Self:
"""Enforce kind / (checks XOR rubric) consistency."""
"""Enforce that exactly the kind's payload block is populated."""
if self.kind is BriefKind.EXECUTABLE:
if self.checks is None:
msg = (
f"Brief {self.brief_id!r}: kind={self.kind.value} "
"requires a 'checks' block"
)
raise ValueError(msg)
if self.rubric is not None:
msg = (
f"Brief {self.brief_id!r}: kind={self.kind.value} "
"must not carry a 'rubric' block"
)
raise ValueError(msg)
self._require(present=self.checks, name="checks")
self._forbid(present=self.rubric, name="rubric")
self._forbid(present=self.research_spec, name="research_spec")
elif self.kind is BriefKind.JUDGED:
self._require(present=self.rubric, name="rubric")
self._forbid(present=self.checks, name="checks")
self._forbid(present=self.research_spec, name="research_spec")
else:
if self.rubric is None:
msg = (
f"Brief {self.brief_id!r}: kind={self.kind.value} "
"requires a 'rubric' block"
)
raise ValueError(msg)
if self.checks is not None:
msg = (
f"Brief {self.brief_id!r}: kind={self.kind.value} "
"must not carry a 'checks' block"
)
raise ValueError(msg)
self._require(present=self.research_spec, name="research_spec")
self._forbid(present=self.checks, name="checks")
self._forbid(present=self.rubric, name="rubric")
return self


Expand All @@ -264,6 +286,7 @@ def _kind_matches_payload(self) -> Self:
"HiddenCheckSpec",
"JudgedRubric",
"LimitsSpec",
"ResearchBriefSpec",
"RubricDimension",
"RubricGradeType",
]
110 changes: 110 additions & 0 deletions evals/scoring/research.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
"""Deterministic grading for ``kind=research`` briefs.

Scores a completed :class:`~synthorg.research.models.ResearchRun` against a
:class:`~evals.models.brief.ResearchBriefSpec` on three structural axes:

* **claim coverage** -- fraction of the brief's expected claims surfaced by
the report (token-overlap match);
* **citation resolution** -- fraction of claim citations that resolve to a
retrieved item in the run (the #1989 acceptance: every claim resolves to
a retrievable source);
* **source credibility** -- fraction of cited sources whose triage score
clears the brief's credibility floor.

These are deterministic (no LLM judge), so the lane is replay-stable: the
same recorded run grades identically every time.
"""

import re
from typing import TYPE_CHECKING, Final

from pydantic import BaseModel, ConfigDict, Field

if TYPE_CHECKING:
from evals.models.brief import ResearchBriefSpec
from synthorg.research.models import ResearchRun

_TOKEN_RE: Final[re.Pattern[str]] = re.compile(r"[^\W_]+", re.UNICODE)

COVERAGE_TOKEN_OVERLAP: Final[float] = 0.5
"""Fraction of an expected claim's tokens that must appear in a report claim
for the expected claim to count as covered."""


class ResearchScore(BaseModel):
"""Structural grade for one research run."""

model_config = ConfigDict(frozen=True, allow_inf_nan=False, extra="forbid")

claim_coverage: float = Field(ge=0.0, le=1.0)
citation_resolution: float = Field(ge=0.0, le=1.0)
source_credibility: float = Field(ge=0.0, le=1.0)
overall: float = Field(ge=0.0, le=1.0)
passed: bool


def _tokens(text: str) -> frozenset[str]:
return frozenset(_TOKEN_RE.findall(text.casefold()))


def _coverage(run: ResearchRun, spec: ResearchBriefSpec) -> float:
report = run.report
if report is None:
return 0.0
claim_token_sets = [_tokens(claim.text) for claim in report.claims]
covered = 0
for expected in spec.expected_claims:
wanted = _tokens(expected)
if not wanted:
continue
if any(
len(wanted & claim_tokens) / len(wanted) >= COVERAGE_TOKEN_OVERLAP
for claim_tokens in claim_token_sets
):
covered += 1
return covered / len(spec.expected_claims)


def _citation_resolution(run: ResearchRun) -> float:
report = run.report
if report is None:
return 0.0
retrieved = {item.ref_id for item in run.retrieved_items}
refs = [c.ref_id for claim in report.claims for c in claim.citations]
if not refs:
return 0.0
return sum(1 for ref in refs if ref in retrieved) / len(refs)


def _source_credibility(run: ResearchRun, spec: ResearchBriefSpec) -> float:
report = run.report
if report is None:
return 0.0
cited = {c.ref_id for claim in report.claims for c in claim.citations}
if not cited:
return 0.0
score_by_ref = {v.ref_id: v.score for v in run.credibility}
clear = sum(
1 for ref in cited if score_by_ref.get(ref, 0.0) >= spec.min_credibility
)
return clear / len(cited)


def grade_research_run(run: ResearchRun, spec: ResearchBriefSpec) -> ResearchScore:
"""Grade a completed research run against its brief spec.

The run passes when every claim citation resolves to a retrieved source
(the hard acceptance criterion); coverage and credibility are quality
signals folded into the overall score.
"""
coverage = _coverage(run, spec)
resolution = _citation_resolution(run)
credibility = _source_credibility(run, spec)
overall = (coverage + resolution + credibility) / 3.0
return ResearchScore(
claim_coverage=coverage,
citation_resolution=resolution,
source_credibility=credibility,
overall=overall,
passed=resolution >= 1.0,
)
12 changes: 12 additions & 0 deletions scripts/_ghost_wiring_manifest.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,18 @@ ENFORCED KnowledgeRetriever #1988 -- constructed inside knowledge/factory.py::bu
ENFORCED build_knowledge_tool_factory #1988 -- called in api/app.py::_wire_knowledge_engine; per-task source of SearchKnowledgeTool + IngestKnowledgeTool
ENFORCED SearchKnowledgeTool #1988 -- constructed by knowledge/tool_factory.py::KnowledgeToolFactory.build_tools per task with the agent's project binding
ENFORCED IngestKnowledgeTool #1988 -- constructed by knowledge/tool_factory.py::KnowledgeToolFactory.build_tools per task with the agent's project binding
ENFORCED build_research_service #1989 -- called in api/app.py::_wire_research_engine; assembles planner + sources + triage + dedup + synthesiser into the ResearchService
ENFORCED ResearchService #1989 -- constructed by research/factory.py::build_research_service; attached to AppState by _wire_research_engine
ENFORCED LlmQueryPlanner #1989 -- constructed inside research/factory.py::build_research_service; decomposes a brief into source-targeted sub-queries
ENFORCED HybridCredibilityTriage #1989 -- constructed inside research/factory.py::_build_triage; heuristic prefilter then LLM triage on survivors
ENFORCED LexicalDeduplicator #1989 -- constructed inside research/factory.py::_build_deduplicator; deterministic hash/url/shingle dedup
ENFORCED LlmSynthesizer #1989 -- constructed inside research/factory.py::build_research_service; produces the cited report, the binder validates refs
ENFORCED KnowledgeRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources; wraps KnowledgeService for the internal source
ENFORCED WebRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources when a WebSearchProvider is injected
ENFORCED AcademicRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources when an AcademicSearchProvider is injected
ENFORCED CodeRetrievalSource #1989 -- constructed inside research/factory.py::_build_sources when a CodeSearchProvider is injected
ENFORCED build_research_tool_factory #1989 -- called in api/app.py::_wire_research_engine; per-task source of the ResearchTool
ENFORCED ResearchTool #1989 -- constructed by research/tool_factory.py::ResearchToolFactory.build_tools per task with the agent's project + identity binding
ENFORCED build_toolsmith #1995 -- called by api/app.py::_build_toolsmith_runtime behind tool_creation_enabled + provider switch; wires the self-extending toolkit runtime
ENFORCED ToolsmithService #1995 -- constructed by meta/toolsmith/factory.py::build_toolsmith; orchestrates gap detection -> author -> guard -> apply at the TOOL_CREATION altitude
ENFORCED RingBufferCapabilityGapStore #1995 -- constructed by meta/toolsmith/factory.py::build_toolsmith; ring-buffered capability-gap sink + recurrence detector
Expand Down
5 changes: 5 additions & 0 deletions scripts/check_no_ghost_wiring.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,11 @@
# factory + tool-factory construction lets the manifest track the
# knowledge substrate's wiring (#1988).
"src/synthorg/knowledge/",
# research/ is reached at boot via api/app.py::_wire_research_engine
# (build_research_service / build_research_tool_factory); counting its
# factory + strategy + tool-factory construction lets the manifest track
# the research subsystem's wiring (#1989).
"src/synthorg/research/",
)


Expand Down
Loading
Loading