t1012: MODELS.md — live model leaderboard with success rates#1305
t1012: MODELS.md — live model leaderboard with success rates#1305marcusquinn wants to merge 3 commits intomainfrom
Conversation
…1012) Queries three SQLite databases (model-registry, pattern-tracker, response-scoring) to produce a Markdown leaderboard showing all available models, success rates by tier and task type, quality scores, and head-to-head contest results.
Generated from live data: 17 models across 6 providers, 487 pattern data points, 18 scored responses. Shows success rates by model tier and task type.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughIntroduces a new Bash script that auto-generates a MODELS.md leaderboard by aggregating model registry, pattern-tracker, and response-scoring database data. Integrates periodic regeneration into the supervisor pulse cycle with hourly throttling. Updates manifest to export the new script. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 12 21:54:58 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
Hourly throttled pulse phase iterates over known repos and regenerates MODELS.md when pattern data changes. Registered in subagent-index.toon.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 12 22:03:47 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.agents/scripts/generate-models-md.sh:
- Around line 336-358: The current AVG(...) * (1.0/0.25) calculation in the
sqlite query assumes each response has all four criteria and miscomputes when a
criterion is missing; change the query to compute a per-response weighted sum in
a subquery (e.g., alias it ws with columns response_id and weighted_score
computed by SUM(CASE WHEN ... THEN score * weight END) GROUP BY response_id),
JOIN that ws back to responses r on r.response_id, then use
AVG(ws.weighted_score) (formatted with printf) and ORDER BY
AVG(ws.weighted_score) DESC; update the variable reads (model, responses,
avg_score, avg_time) accordingly so avg_score comes from the averaged
ws.weighted_score.
- Around line 492-497: The --output case in the argument-parsing loop uses "$2"
without guarding that a value exists, which causes an unbound variable error
under set -u; update the case branch that assigns OUTPUT_PATH to first verify
that a next argument exists and is not another option (e.g., check that $# -ge 2
and that "$2" does not start with '-') and if the value is missing print a clear
error to stderr and exit with a non-zero code; adjust the case for "--output"
(and any similar flags) to perform this validation before assigning OUTPUT_PATH
and shifting.
In @.agents/scripts/supervisor-helper.sh:
- Around line 11008-11045: The stamp update is currently unconditional; change
this so the stamp (models_md_stamp) is only written when at least one repository
was successfully regenerated. Introduce a local flag (e.g.,
models_md_succeeded=0) before iterating models_repos and set it to 1 when a
regeneration produces a change (inside the branch where "$generate_script"
succeeds and git -C "$models_repo_root" diff shows MODELS.md was updated); after
the loop, only echo "$models_md_now" > "$models_md_stamp" if models_md_succeeded
is 1. Keep all existing logging/commit/push behavior but avoid touching the
stamp on failures or when no repos changed.
- Around line 11049-11057: Remove the orphaned else/fi block and the stray
record_throttle call that follow the Phase 12 MODELS.md logic so the throttle
path is a single coherent if/else branch; specifically, edit the Phase 12
section around the log_warn/log_verbose outputs and remove the dangling else/fi
and the subsequent record_throttle "$models_md_throttle_key" call so that only
the intended throttle handling remains and the script parses correctly (ensure
functions/commands referenced like record_throttle, log_warn, and log_verbose
remain in their proper conditional blocks).
🧹 Nitpick comments (4)
.agents/scripts/generate-models-md.sh (4)
71-80: SQL injection risk indb_has_data—$table_nameis interpolated directly into the query.While currently called only with internal hardcoded table names, this is a fragile pattern. If this utility function is ever reused with user-supplied input, it becomes exploitable. Consider validating the table name against
[a-zA-Z_]or usingsqlite3's.tablesto confirm existence first.🛡️ Suggested hardening
db_has_data() { local db_path="$1" local table_name="$2" + # Validate table name contains only safe characters + if [[ ! "$table_name" =~ ^[a-zA-Z_][a-zA-Z0-9_]*$ ]]; then + return 1 + fi [[ -f "$db_path" ]] || return 1 local count count=$(sqlite3 "$db_path" "SELECT COUNT(*) FROM $table_name;" 2>/dev/null) || return 1
211-245: Leaderboard queries fire N×3 separatesqlite3invocations per tier — consider consolidating.Each tier iteration (line 213) spawns 3 separate
sqlite3processes (successes, failures, last_used). With 5 tiers that's 15 process spawns. A single query could return all tiers at once, improving both performance and readability.♻️ Sketch: single-query approach
- local tiers="opus sonnet pro flash haiku" - for tier in $tiers; do - local successes failures last_used - successes=$(sqlite3 "$MEMORY_DB" "..." 2>/dev/null) || successes=0 - failures=$(sqlite3 "$MEMORY_DB" "..." 2>/dev/null) || failures=0 - ... - done + sqlite3 -separator '|' "$MEMORY_DB" " + SELECT + CASE + WHEN tags LIKE '%model:opus%' OR content LIKE '%[model:opus]%' THEN 'opus' + WHEN tags LIKE '%model:sonnet%' OR content LIKE '%[model:sonnet]%' THEN 'sonnet' + WHEN tags LIKE '%model:pro%' OR content LIKE '%[model:pro]%' THEN 'pro' + WHEN tags LIKE '%model:flash%' OR content LIKE '%[model:flash]%' THEN 'flash' + WHEN tags LIKE '%model:haiku%' OR content LIKE '%[model:haiku]%' THEN 'haiku' + END AS tier, + SUM(CASE WHEN type IN ($SUCCESS_TYPES) THEN 1 ELSE 0 END), + SUM(CASE WHEN type IN ($FAILURE_TYPES) THEN 1 ELSE 0 END), + SUBSTR(MAX(created_at), 1, 10) + FROM learnings + WHERE type IN ($PATTERN_TYPES) + GROUP BY tier + HAVING tier IS NOT NULL + ORDER BY ... + " 2>/dev/null | while IFS='|' read -r tier successes failures last_used; do + local tasks_total=$((successes + failures)) + local rate=$(( (successes * 100) / tasks_total )) + echo "| $tier | $tasks_total | $successes | $failures | ${rate}% | $last_used |" + done
255-303: Same N×2 query pattern in task type breakdown — 13 task types × 2 queries = 26 sqlite3 spawns.Same consolidation opportunity as the leaderboard. Lower priority since this runs hourly, but worth noting for consistency.
433-468:generate_models_md— clean assembly of sections via subshell redirect.The grouped command block
{ ... } > "$output"is a good pattern. All sub-generators write to stdout, cleanly captured. Explicitreturn 0present.One minor note: if the output directory doesn't exist, the redirect will fail. Consider adding a
mkdir -pfor$(dirname "$output")before the write.🛡️ Optional: ensure output directory exists
generate_models_md() { local output="$1" + mkdir -p "$(dirname "$output")" 2>/dev/null || true local timestampAs per coding guidelines: "Error recovery mechanisms" for
.agents/scripts/*.sh.
| sqlite3 -separator '|' "$SCORING_DB" " | ||
| SELECT | ||
| r.model_id, | ||
| COUNT(DISTINCT r.response_id), | ||
| printf('%.2f', | ||
| AVG(CASE WHEN s.criterion='correctness' THEN s.score * 0.30 | ||
| WHEN s.criterion='completeness' THEN s.score * 0.25 | ||
| WHEN s.criterion='code_quality' THEN s.score * 0.25 | ||
| WHEN s.criterion='clarity' THEN s.score * 0.20 | ||
| ELSE 0 END) * (1.0 / 0.25) | ||
| ), | ||
| printf('%.1f', AVG(r.response_time)) | ||
| FROM responses r | ||
| JOIN scores s ON r.response_id = s.response_id | ||
| GROUP BY r.model_id | ||
| ORDER BY AVG(CASE WHEN s.criterion='correctness' THEN s.score * 0.30 | ||
| WHEN s.criterion='completeness' THEN s.score * 0.25 | ||
| WHEN s.criterion='code_quality' THEN s.score * 0.25 | ||
| WHEN s.criterion='clarity' THEN s.score * 0.20 | ||
| ELSE 0 END) DESC; | ||
| " 2>/dev/null | while IFS='|' read -r model responses avg_score avg_time; do | ||
| echo "| $model | $responses | $avg_score/5.0 | $avg_time |" | ||
| done |
There was a problem hiding this comment.
Weighted average calculation is fragile — assumes exactly 4 score criteria per response.
The math AVG(...weighted_component...) * (1.0 / 0.25) (i.e., * 4) only yields a correct weighted score if every response has exactly one row per criterion (correctness, completeness, code_quality, clarity). If any criterion is missing for a response, the divisor in AVG changes and the result inflates or deflates silently.
Consider using a SUM/COUNT(DISTINCT) approach or a subquery that computes the weighted sum per response first, then averages across responses:
SELECT r.model_id,
COUNT(DISTINCT r.response_id),
printf('%.2f', AVG(ws.weighted_score)),
printf('%.1f', AVG(r.response_time))
FROM responses r
JOIN (
SELECT response_id,
SUM(CASE criterion
WHEN 'correctness' THEN score * 0.30
WHEN 'completeness' THEN score * 0.25
WHEN 'code_quality' THEN score * 0.25
WHEN 'clarity' THEN score * 0.20
ELSE 0 END) AS weighted_score
FROM scores
GROUP BY response_id
) ws ON r.response_id = ws.response_id
GROUP BY r.model_id
ORDER BY AVG(ws.weighted_score) DESC;This correctly handles missing criteria by summing only available weights per response.
🤖 Prompt for AI Agents
In @.agents/scripts/generate-models-md.sh around lines 336 - 358, The current
AVG(...) * (1.0/0.25) calculation in the sqlite query assumes each response has
all four criteria and miscomputes when a criterion is missing; change the query
to compute a per-response weighted sum in a subquery (e.g., alias it ws with
columns response_id and weighted_score computed by SUM(CASE WHEN ... THEN score
* weight END) GROUP BY response_id), JOIN that ws back to responses r on
r.response_id, then use AVG(ws.weighted_score) (formatted with printf) and ORDER
BY AVG(ws.weighted_score) DESC; update the variable reads (model, responses,
avg_score, avg_time) accordingly so avg_score comes from the averaged
ws.weighted_score.
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --output) | ||
| OUTPUT_PATH="$2" | ||
| shift 2 | ||
| ;; |
There was a problem hiding this comment.
Missing guard on --output — $2 may be unbound if it's the last argument.
With set -u, passing --output without a value will cause an unbound variable error. Add a check before accessing $2.
🐛 Proposed fix
--output)
+ if [[ $# -lt 2 ]]; then
+ log_error "--output requires a value"
+ exit 1
+ fi
OUTPUT_PATH="$2"
shift 2
;;As per coding guidelines: "Reliability and robustness" and "Proper exit codes" for .agents/scripts/*.sh.
🤖 Prompt for AI Agents
In @.agents/scripts/generate-models-md.sh around lines 492 - 497, The --output
case in the argument-parsing loop uses "$2" without guarding that a value
exists, which causes an unbound variable error under set -u; update the case
branch that assigns OUTPUT_PATH to first verify that a next argument exists and
is not another option (e.g., check that $# -ge 2 and that "$2" does not start
with '-') and if the value is missing print a clear error to stderr and exit
with a non-zero code; adjust the case for "--output" (and any similar flags) to
perform this validation before assigning OUTPUT_PATH and shifting.
| local models_md_interval=3600 # seconds (1 hour) | ||
| local models_md_stamp="$SUPERVISOR_DIR/models-md-last-regen" | ||
| local models_md_now | ||
| models_md_now=$(date +%s) | ||
| local models_md_last=0 | ||
| if [[ -f "$models_md_stamp" ]]; then | ||
| models_md_last=$(cat "$models_md_stamp" 2>/dev/null || echo 0) | ||
| fi | ||
| local models_md_elapsed=$((models_md_now - models_md_last)) | ||
| if [[ "$models_md_elapsed" -ge "$models_md_interval" ]]; then | ||
| local generate_script="${SCRIPT_DIR}/generate-models-md.sh" | ||
| if [[ -x "$generate_script" ]]; then | ||
| local models_repos | ||
| models_repos=$(db "$SUPERVISOR_DB" "SELECT DISTINCT repo FROM tasks;" 2>/dev/null || true) | ||
| if [[ -n "$models_repos" ]]; then | ||
| while IFS= read -r models_repo_path; do | ||
| [[ -n "$models_repo_path" && -d "$models_repo_path" ]] || continue | ||
| local models_repo_root | ||
| models_repo_root=$(git -C "$models_repo_path" rev-parse --show-toplevel 2>/dev/null) || continue | ||
| log_verbose " Phase 12: Regenerating MODELS.md in $models_repo_root" | ||
| if "$generate_script" --output "${models_repo_root}/MODELS.md" --quiet 2>/dev/null; then | ||
| if git -C "$models_repo_root" diff --quiet -- MODELS.md 2>/dev/null; then | ||
| log_verbose " Phase 12: MODELS.md unchanged in $models_repo_root" | ||
| else | ||
| git -C "$models_repo_root" add MODELS.md 2>/dev/null \ | ||
| && git -C "$models_repo_root" commit -m "chore: regenerate MODELS.md leaderboard (t1012)" --no-verify 2>/dev/null \ | ||
| && git -C "$models_repo_root" push 2>/dev/null \ | ||
| && log_info " Phase 12: MODELS.md updated and pushed ($models_repo_root)" \ | ||
| || log_warn " Phase 12: MODELS.md regenerated but commit/push failed ($models_repo_root)" | ||
| fi | ||
| else | ||
| log_warn " Phase 12: MODELS.md generation failed for $models_repo_root" | ||
| fi | ||
| done <<<"$models_repos" | ||
| fi | ||
| fi | ||
| echo "$models_md_now" > "$models_md_stamp" 2>/dev/null || true | ||
| else |
There was a problem hiding this comment.
Don’t advance the regen stamp on failure — preserve throttle state.
Right now the stamp is updated unconditionally, so a failed generation (or missing script) suppresses retries for an hour. The PR objective says failures should preserve throttling state. Gate the stamp update on success (or at least when at least one repo regenerated).
✅ Suggested reliability guard
if [[ "$models_md_elapsed" -ge "$models_md_interval" ]]; then
local generate_script="${SCRIPT_DIR}/generate-models-md.sh"
if [[ -x "$generate_script" ]]; then
+ local models_md_success=false
local models_repos
models_repos=$(db "$SUPERVISOR_DB" "SELECT DISTINCT repo FROM tasks;" 2>/dev/null || true)
if [[ -n "$models_repos" ]]; then
while IFS= read -r models_repo_path; do
[[ -n "$models_repo_path" && -d "$models_repo_path" ]] || continue
local models_repo_root
models_repo_root=$(git -C "$models_repo_path" rev-parse --show-toplevel 2>/dev/null) || continue
log_verbose " Phase 12: Regenerating MODELS.md in $models_repo_root"
if "$generate_script" --output "${models_repo_root}/MODELS.md" --quiet 2>/dev/null; then
+ models_md_success=true
if git -C "$models_repo_root" diff --quiet -- MODELS.md 2>/dev/null; then
log_verbose " Phase 12: MODELS.md unchanged in $models_repo_root"
else
git -C "$models_repo_root" add MODELS.md 2>/dev/null \
&& git -C "$models_repo_root" commit -m "chore: regenerate MODELS.md leaderboard (t1012)" --no-verify 2>/dev/null \
&& git -C "$models_repo_root" push 2>/dev/null \
&& log_info " Phase 12: MODELS.md updated and pushed ($models_repo_root)" \
|| log_warn " Phase 12: MODELS.md regenerated but commit/push failed ($models_repo_root)"
fi
else
log_warn " Phase 12: MODELS.md generation failed for $models_repo_root"
fi
done <<<"$models_repos"
fi
- fi
- echo "$models_md_now" > "$models_md_stamp" 2>/dev/null || true
+ if [[ "$models_md_success" == "true" ]]; then
+ echo "$models_md_now" >"$models_md_stamp" 2>/dev/null || true
+ else
+ log_warn " Phase 12: MODELS.md regen failed — keeping throttle stamp unchanged"
+ fi
+ fi
else🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 11008 - 11045, The stamp
update is currently unconditional; change this so the stamp (models_md_stamp) is
only written when at least one repository was successfully regenerated.
Introduce a local flag (e.g., models_md_succeeded=0) before iterating
models_repos and set it to 1 when a regeneration produces a change (inside the
branch where "$generate_script" succeeds and git -C "$models_repo_root" diff
shows MODELS.md was updated); after the loop, only echo "$models_md_now" >
"$models_md_stamp" if models_md_succeeded is 1. Keep all existing
logging/commit/push behavior but avoid touching the stamp on failures or when no
repos changed.
| else | ||
| log_warn " Phase 12: MODELS.md generation failed" | ||
| fi | ||
| fi | ||
| fi | ||
| record_throttle "$models_md_throttle_key" 2>/dev/null || true | ||
| else | ||
| log_verbose " Phase 12: MODELS.md regen skipped (throttled)" | ||
| fi |
There was a problem hiding this comment.
Fix the stray Phase‑12 throttle block — it breaks bash parsing.
There’s a dangling else/fi block plus record_throttle calls after the Phase 12 logic, which makes the script invalid and matches the ShellCheck parse errors reported in CI (SC1073/SC1056/SC1072). Remove the orphaned block to restore a single coherent throttle path.
🧩 Proposed fix (remove orphaned block)
- else
- log_warn " Phase 12: MODELS.md generation failed"
- fi
- fi
- fi
- record_throttle "$models_md_throttle_key" 2>/dev/null || true
- else
- log_verbose " Phase 12: MODELS.md regen skipped (throttled)"
- fiAs per coding guidelines, “Run ShellCheck with zero violations on all scripts in .agents/scripts/”.
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 11049 - 11057, Remove the
orphaned else/fi block and the stray record_throttle call that follow the Phase
12 MODELS.md logic so the throttle path is a single coherent if/else branch;
specifically, edit the Phase 12 section around the log_warn/log_verbose outputs
and remove the dangling else/fi and the subsequent record_throttle
"$models_md_throttle_key" call so that only the intended throttle handling
remains and the script parses correctly (ensure functions/commands referenced
like record_throttle, log_warn, and log_verbose remain in their proper
conditional blocks).
|
Closing: merge conflicts and framework validation failure. Task t1012 will be re-dispatched fresh. Also blocked-by:t1011 which needs re-implementation. |



Summary
generate-models-md.sh— queries three SQLite databases (model-registry, pattern-tracker, response-scoring) to produce a live Markdown leaderboardMODELS.mdin repo root showing: all 17 available models across 6 providers, success rates by model tier and task type, quality scores from response-scoring evaluations, and head-to-head contest resultssubagent-index.toonOutput Format
Ref #1302
Summary by CodeRabbit