feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options by ComBba · Pull Request #10 · Two-Weeks-Team/vibe-mod

ComBba · 2026-05-12T10:43:52Z

What

scripts/openai-smoketest.ts now records per-call latency and accepts OPENAI_MODELS=a,b,c to run the suite against several models and print a comparison table (pass · median/max latency · avg output tokens).

Measured (real key, free-daily-usage tier — cost not a factor here)

model	pass	median	max	avg out tok
gpt-5.4-nano	7/7	1387ms	2290ms	~110
gpt-5.4-mini	7/7	1213ms	1346ms	~100
gpt-5-mini	3/7	—	—	—
gpt-5-nano	n/a	—	—	—
gpt-4o-mini	6/7	1859ms	2999ms	~90
gpt-4.1-mini / gpt-4.1-nano	n/a	—	—	—

Conclusion: gpt-5.4-nano and gpt-5.4-mini are the right picks — fast (~1.2–1.4s), no reasoning overhead, 7/7 on the rule-compile suite. Reasoning models (gpt-5-mini, o-series) are both slower and need a much bigger token cap, so they're a bad fit for this structured NL→JSON task.

devvit.json change

Trimmed openaiModel options to just gpt-5.4-nano (default) and gpt-5.4-mini, with the measured numbers in the labels. Removed gpt-5-nano (unavailable) and gpt-5-mini (unreliable with our max_completion_tokens: 700 cap).

Verify

tsc --noEmit / eslint --max-warnings 0 / prettier --check clean · vitest 151 passed (1 skipped) · acceptance 4/4 · doctor 0 hard issues · CI green.
OPENAI_MODELS=gpt-5.4-nano,gpt-5.4-mini npm run openai:smoketest → both 7/7 with a comparison table.

🤖 Generated with Claude Code

Summary by CodeRabbit

변경사항

새로운 기능
- OpenAI 모델 옵션 업데이트: 새로운 gpt-5.4-nano 및 gpt-5.4-mini 모델 옵션 추가
제거됨
- 이전 모델 옵션 제거. 기본값은 gpt-5.4-nano로 유지
Chores
- 다중 모델 테스트 및 성능 비교 기능 개선

…im devvit.json model list scripts/openai-smoketest.ts now records per-call wall-clock latency and accepts OPENAI_MODELS=a,b,c to run the suite against several models and print a comparison table (pass rate · median/max latency · avg output tokens). Ran the comparison against a real key (free-daily-usage tier — cost not a factor): gpt-5.4-nano 7/7 median 1387ms max 2290ms ~110 out tok gpt-5.4-mini 7/7 median 1213ms max 1346ms ~100 out tok ← fastest + most consistent gpt-5-mini 3/7 — reasoning model; burns the 700-tok budget on hidden reasoning so the JSON gets truncated ("Unexpected end of JSON input") gpt-5-nano n/a — 403, not available to this project gpt-4o-mini 6/7 median 1859ms max 2999ms — slowest + one miss gpt-4.1-mini / gpt-4.1-nano n/a — 403 So `gpt-5.4-nano` and `gpt-5.4-mini` are the right picks (fast, no reasoning overhead, 7/7). Trimmed devvit.json's openaiModel options to just those two — removed `gpt-5-nano` (unavailable) and `gpt-5-mini` (unreliable with our token cap) — and updated the labels with the measured numbers. Default stays nano (cheapest, 7/7); switch to mini if cost is free for you. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector · 2026-05-12T10:43:57Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

gemini-code-assist

Code Review

This pull request enhances the OpenAI smoke test script by adding support for multi-model comparisons and latency tracking. Key changes include refactoring the execution logic into a reusable runModel function, updating the compile function to measure request duration, and adding a comparison table for performance metrics. A bug was identified in the latency reporting logic where the minimum latency calculation would always result in zero due to an incorrect use of Math.min.

gemini-code-assist · 2026-05-12T10:45:20Z

+  const med = s.ms.length ? [...s.ms].sort((a, b) => a - b)[Math.floor(s.ms.length / 2)] : 0;
  console.log(
-    `${pass}/${CASES.length} cases as expected.  tokens: ${totIn} in / ${totOut} out  ≈ $${cost.toFixed(5)} (at ${MODEL} list price)`,
+    `  → ${s.pass}/${s.total} as expected  ·  latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))})  ·  tokens ${s.tokensIn} in / ${s.tokensOut} out  ·  ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,


The logic for calculating the minimum latency is incorrect. Math.min(...s.ms, 0) will always return 0 because all measured latencies are positive numbers. This results in the reported min value always being 0ms regardless of the actual performance. You should guard against empty arrays and remove the 0 from the Math.min arguments to capture the actual minimum measured latency.

Suggested change

` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,

` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${s.ms.length ? Math.round(Math.min(...s.ms)) : 0} / max ${s.ms.length ? Math.round(Math.max(...s.ms)) : 0}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,

coderabbitai · 2026-05-12T10:48:21Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6830ae2c-3ee5-41c1-94df-af470b5137a9

📥 Commits

Reviewing files that changed from the base of the PR and between 5588f8d and 808be5a.

📒 Files selected for processing (2)

devvit.json
scripts/openai-smoketest.ts

개요

devvit.json의 OpenAI 모델 옵션을 gpt-5.4-nano 및 gpt-5.4-mini로 업데이트하고, 스모크 테스트를 단일 모델 실행에서 여러 모델의 성능을 비교하는 다중 모델 러너로 변환합니다. 각 모델별 요청 지연 시간, 통과 횟수, 토큰 및 비용 통계를 추적합니다.

변경 사항

모델 구성 및 스모크 테스트 개선

Layer / File(s)	요약
모델 옵션 설정 업데이트 `devvit.json`	openaiModel 선택지를 새로운 gpt-5.4-nano 및 gpt-5.4-mini 옵션으로 업데이트하고 레이블 및 설명을 포함합니다.
스모크 테스트 다중 모델 인프라 `scripts/openai-smoketest.ts`	node:perf_hooks에서 performance를 임포트하고, OPENAI_MODELS 환경 변수를 파싱하여 여러 모델을 대상으로 실행하기 위한 모델 목록을 작성합니다.
요청 지연 시간 및 결과 추적 `scripts/openai-smoketest.ts`	ApiResult 타입을 확장하여 모든 결과에 지연 시간을 포함하고, compile 함수를 업데이트하여 모델 매개변수를 받으며, fetch 호출 주변에 타이밍을 추가합니다.
다중 모델 비교 실행기 및 통계 `scripts/openai-smoketest.ts`	ModelSummary를 도입하고 runModel() 함수를 구현하여 각 모델의 통과 횟수, 지연 시간 중앙값, 토큰 및 비용을 계산합니다. 주 실행기는 모든 구성된 모델을 반복하고 비교 테이블을 출력하며, 치명적 오류 시 0이 아닌 코드로 종료합니다.

예상 코드 리뷰 노력

🎯 3 (보통) | ⏱️ ~20분

축하 시

토끼의 마음에서 나온 축하의 말 🐰

새로운 모델들이 나타났네요, gpt-5.4가 왔어!
스모크 테스트는 이제 비교 테이블을 춤춘답니다,
지연 시간을 재고, 비용을 계산하며,
여러 모델이 경쟁하니 정말 멋져요! 🚀✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/model-comparison

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options

…ng (Phase 1.7b Tier 1+2+3) Best-practice UX rework for the 5 deferred audit findings (#2 #3 #5 #7 #10) plus 4 new items (#A empty states, #B delete confirm, #C onboarding, #D token cost). Design + tier rationale: claudedocs/2026-05-14-ux-best-practices-plan.md. ## Tier 1 — compose flow (audit #2 #5 #7) ### #2 Compile preview confirmation form (composeConfirmForm) Compile no longer persists straight to draft. After validation, the moderator sees a confirm form with the rule rendered in plain English (humanizeRule()), the original sentence, and the per-compile token cost (~$0.0006 on gpt-5.4-mini). Tick "Edit instead of save" to re-open compose with the original NL pre-filled. Save persists draft + schedules dry-run via the new persistRuleAndStartDryRun() helper. Defence-in-depth: serialized rule is re-validated before write. New endpoint: /internal/form/compose-confirm-submit New form: composeConfirmForm ### #5 Clarification turn limit (3 rounds) Server-side counter shipped as a disabled paragraph field. After round 3, an oscillating LLM gets refused with an actionable toast ("Try rephrasing more concretely"), instead of opening yet another modal. Each clarify modal description prefixed with "(Round X of 3)" so the moderator sees how many tries remain. ### #7 Editable original rule in clarify modal Removed `disabled: true` on the Original rule field. helpText now says "Re-compile uses this text plus your answer below." ## Tier 2 — Manage rules menu + dashboard hardening (audit #3 #10 #A #B #D) ### #3 + #10 Per-rule control surface (Manage rules) New "vibe-mod: Manage rules" menu (subreddit-level, mod-only). Renders one form-group per rule with a `select` of available actions: - Drafts: Keep / Activate (shadow 24h) / Activate immediately / Delete - Active shadow rules: Keep / Promote shadow → live / Pause / Delete - Active live rules: Keep / Pause / Delete Submit applies all non-destructive actions atomically (one redis.set per bundle) via the new applyManageActions() helper. Dashboard becomes read-only — its old "Activate N drafts" boolean removed. New endpoints: /internal/menu/manage-rules, /internal/form/manage-rules-submit, /internal/form/manage-delete-confirm New forms: manageRulesForm, manageDeleteConfirmForm ### #B Delete confirmation Any `delete` action in the Manage submit forwards to a confirm form listing the rules about to disappear, with an explicit "I understand this is permanent" boolean. The pending action map round-trips through a disabled paragraph carrier so the second-step submit knows what to apply. ### #A Empty states - Manage menu emits a guided toast when there are zero rules at all. - Dashboard description shows "No rules yet — open ⋯ → Compose" when both rules count and recent actions are zero. ### #D Token cost transparency - Dashboard description: "Tokens used (lifetime): N in / N out (~$X on gpt-5.4-mini)" via the new estimateTokenCost() helper. - Compose confirm form shows per-compile cost so the moderator sees the unit economics before saving. ## Tier 3 — Onboarding (audit #C) Dashboard shows a 3-step welcome card on first visit (per-sub Redis flag `${sub}:onboarding:dismissed`). Cancel-label switches to "Don't show intro again" and ticks dismissOnboarding to persist the flag. New key in src/shared/redis-keys.ts. ## Verification - `npm run check` 4/4 gates green - 209 unit + integration tests (1 skipped): - 33 compose tests (8 updated for new 2-step flow + 4 new for #2/#5/#7) - 13 dashboard tests (4 updated for read-only + 4 new for onboarding/empty) - 17 manage tests (NEW file routes-manage.test.ts: menu render, all 5 actions, delete confirm round-trip) - existing trigger / scheduler / settings / undo / evaluator / executor unchanged ## Why a single PR Tier 1+2+3 all touch overlapping files (index.ts, devvit.json, the dashboard form shape). Splitting would force back-and-forth conflict resolution. Logical commits inside this PR (split when applicable in follow-up). Refs: claudedocs/2026-05-14-ux-best-practices-plan.md (Phase 1.7a), claudedocs/2026-05-14-compose-flow-audit.md (audit baseline), docs/demo-scenario.md (downstream dependency)

ComBba merged commit 3f5fa4d into main May 12, 2026
1 check passed

ComBba deleted the feat/model-comparison branch May 12, 2026 10:44

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

ComBba mentioned this pull request May 14, 2026

feat(ux): compose-confirm form + Manage rules menu + dashboard onboarding (Phase 1.7b) #44

Merged

6 tasks

ComBba added a commit that referenced this pull request May 15, 2026

Merge pull request #10 from Two-Weeks-Team/feat/model-comparison

560865f

feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options#10

feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options#10
ComBba merged 1 commit into
mainfrom
feat/model-comparison

ComBba commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 12, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Review failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,
	` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${s.ms.length ? Math.round(Math.min(...s.ms)) : 0} / max ${s.ms.length ? Math.round(Math.max(...s.ms)) : 0}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,

Conversation

ComBba commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Measured (real key, free-daily-usage tier — cost not a factor here)

devvit.json change

Verify

Summary by CodeRabbit

변경사항

Uh oh!

chatgpt-codex-connector Bot commented May 12, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

개요

변경 사항

예상 코드 리뷰 노력

축하 시

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ComBba commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading