Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
3214437
feat: t1328 matterbridge agent + t1329 cross-review judge pipeline
marcusquinn Feb 25, 2026
02dcb75
fix: address CodeRabbit review — arg parsing, binary validation, arch…
marcusquinn Feb 26, 2026
0582db1
fix: address remaining review feedback — error visibility and curl er…
marcusquinn Feb 26, 2026
d239c55
fix: address CodeRabbit review — tail arg safety, config validation, …
marcusquinn Feb 26, 2026
a841d19
fix: bound judge payload size and replace sed with parameter expansio…
marcusquinn Feb 26, 2026
2760c2c
fix: redirect runner errors to log files, sanitize reasoning field
marcusquinn Feb 26, 2026
abfe08d
fix: escape all SQL string values in cmd_score and fix darwin-64bit a…
marcusquinn Feb 26, 2026
5c7d2b5
fix: correct macOS binary asset names in matterbridge docs (darwin-64…
marcusquinn Feb 27, 2026
03f3045
fix: improve error visibility — redirect stderr to log files instead …
marcusquinn Feb 27, 2026
26afa19
fix: resolve SC2119/SC2120 — pass explicit empty arg to get_all_tier_…
alex-solovyev Feb 27, 2026
a656c53
fix: address remaining review feedback — input validation, error prop…
alex-solovyev Feb 27, 2026
b757969
refactor: move _cross_review_judge_score before its caller for readab…
alex-solovyev Feb 27, 2026
d1e8a49
fix: address final review feedback — decimal score parsing, judge mod…
alex-solovyev Feb 27, 2026
465cee9
fix: sanitize judge_model, fix decimal score clamping, ensure DATA_DI…
marcusquinn Feb 27, 2026
d05c08f
fix: remove duplicate ensure_dirs, escape SQL inputs in cmd_results
alex-solovyev Feb 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 14 additions & 38 deletions .agents/scripts/commands/cross-review.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,21 @@
---
description: Dispatch a prompt to multiple AI models, diff results, and optionally score via a judge model
description: Dispatch the same prompt to multiple AI models, diff results, and optionally auto-score via a judge model
agent: Build+
mode: subagent
---

Run a multi-model adversarial review: dispatch the same prompt to N models in parallel, collect outputs, diff results, and optionally score via a judge model (Ouroboros-style pipeline).
Dispatch a prompt to multiple AI models in parallel, collect and diff their responses, and optionally score them via a judge model.

Target: $ARGUMENTS

## Instructions

Parse the arguments to extract:
- `--prompt`: the review prompt (required)
- `--models`: comma-separated model tiers (default: `sonnet,opus`)
- `--score`: enable judge scoring pipeline (optional flag)
- `--judge`: judge model tier (default: `opus`)
- `--task-type`: scoring category — `code`, `review`, `analysis`, `text`, `general` (default: `general`)
- `--timeout`: per-model timeout in seconds (default: 600)
- `--output`: output directory (default: auto-generated tmp dir)

1. Parse the user's arguments. Common forms:

```bash
/cross-review "review this PR diff" --models sonnet,opus
/cross-review "audit this code" --models sonnet,gemini-pro,gpt-4.1 --score
/cross-review "design this API" --score --judge opus --task-type analysis
/cross-review "design this API" --score --judge opus
```

2. Run the cross-review:
Expand All @@ -41,11 +32,11 @@ Parse the arguments to extract:
--models "sonnet,gemini-pro,gpt-4.1" \
--score

# With custom judge model and task type
# With custom judge model
~/.aidevops/agents/scripts/compare-models-helper.sh cross-review \
--prompt "your prompt here" \
--models "sonnet,opus" \
--score --judge sonnet --task-type review
--score --judge sonnet
```

3. Present the results:
Expand All @@ -55,18 +46,16 @@ Parse the arguments to extract:
- Note any models that failed to respond

4. If `--score` was used, scores are automatically:
- Recorded in the model-comparisons SQLite DB (`~/.aidevops/.agent-workspace/memory/model-comparisons.db`)
- Fed into the pattern tracker for data-driven model routing (`/route`, `/patterns`)
- Recorded in the model-comparisons SQLite DB
- Fed into the pattern tracker for model routing (`/route`, `/patterns`)

## Options

| Option | Default | Description |
|--------|---------|-------------|
| `--prompt` | (required) | The review prompt |
| `--models` | `sonnet,opus` | Comma-separated model tiers to compare |
| `--score` | off | Auto-score outputs via judge model |
| `--judge` | `opus` | Judge model tier (used with `--score`) |
| `--task-type` | `general` | Scoring category: `code`, `review`, `analysis`, `text`, `general` |
| `--timeout` | `600` | Seconds per model |
| `--output` | auto | Directory for raw outputs |
| `--workdir` | `pwd` | Working directory for model context |
Expand All @@ -77,14 +66,13 @@ Parse the arguments to extract:

## Scoring Criteria (judge model, 1-10 scale)

| Criterion | Scale | Description |
|-----------|-------|-------------|
| Correctness | 1-10 | Factual accuracy and technical correctness |
| Completeness | 1-10 | Coverage of all requirements and edge cases |
| Quality | 1-10 | Code quality / writing quality |
| Clarity | 1-10 | Clear explanation, good formatting, readability |
| Adherence | 1-10 | Following instructions precisely, staying on-task |
| Overall | 1-10 | Judge's holistic assessment |
| Criterion | Description |
|-----------|-------------|
| correctness | Factual accuracy and technical correctness |
| completeness | Coverage of all requirements and edge cases |
| quality | Code quality, best practices, maintainability |
| clarity | Clear explanation, good formatting, readability |
| adherence | Following the original prompt instructions precisely |

## Examples

Expand All @@ -96,25 +84,13 @@ Parse the arguments to extract:
/cross-review "Design a rate limiting strategy for a REST API" \
--models sonnet,opus,pro --score

# Custom judge model and task type
/cross-review "Audit this architecture" --models "sonnet,opus" --score --judge opus --task-type analysis

# Quick diff with custom timeout
/cross-review "Summarize the key changes in this diff" --models haiku,sonnet --timeout 120

# View scoring results after a cross-review
/score-responses --leaderboard
```

## Output

- Per-model responses displayed inline
- Diff summary (word counts, unified diff for 2-model comparisons)
- Judge scores table (when `--score` is set)
- Winner declaration with reasoning
- Results saved to `~/.aidevops/.agent-workspace/tmp/cross-review-<timestamp>/`
- Judge JSON saved to `<output_dir>/judge-scores.json`

## Related

- `/compare-models` — Compare model capabilities and pricing (no live dispatch)
Expand Down
Loading