feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options#10
Conversation
…im devvit.json model list
scripts/openai-smoketest.ts now records per-call wall-clock latency and accepts
OPENAI_MODELS=a,b,c to run the suite against several models and print a
comparison table (pass rate · median/max latency · avg output tokens).
Ran the comparison against a real key (free-daily-usage tier — cost not a factor):
gpt-5.4-nano 7/7 median 1387ms max 2290ms ~110 out tok
gpt-5.4-mini 7/7 median 1213ms max 1346ms ~100 out tok ← fastest + most consistent
gpt-5-mini 3/7 — reasoning model; burns the 700-tok budget on hidden reasoning
so the JSON gets truncated ("Unexpected end of JSON input")
gpt-5-nano n/a — 403, not available to this project
gpt-4o-mini 6/7 median 1859ms max 2999ms — slowest + one miss
gpt-4.1-mini / gpt-4.1-nano n/a — 403
So `gpt-5.4-nano` and `gpt-5.4-mini` are the right picks (fast, no reasoning
overhead, 7/7). Trimmed devvit.json's openaiModel options to just those two —
removed `gpt-5-nano` (unavailable) and `gpt-5-mini` (unreliable with our token
cap) — and updated the labels with the measured numbers. Default stays nano
(cheapest, 7/7); switch to mini if cost is free for you.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request enhances the OpenAI smoke test script by adding support for multi-model comparisons and latency tracking. Key changes include refactoring the execution logic into a reusable runModel function, updating the compile function to measure request duration, and adding a comparison table for performance metrics. A bug was identified in the latency reporting logic where the minimum latency calculation would always result in zero due to an incorrect use of Math.min.
| const med = s.ms.length ? [...s.ms].sort((a, b) => a - b)[Math.floor(s.ms.length / 2)] : 0; | ||
| console.log( | ||
| `${pass}/${CASES.length} cases as expected. tokens: ${totIn} in / ${totOut} out ≈ $${cost.toFixed(5)} (at ${MODEL} list price)`, | ||
| ` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`, |
There was a problem hiding this comment.
The logic for calculating the minimum latency is incorrect. Math.min(...s.ms, 0) will always return 0 because all measured latencies are positive numbers. This results in the reported min value always being 0ms regardless of the actual performance. You should guard against empty arrays and remove the 0 from the Math.min arguments to capture the actual minimum measured latency.
| ` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`, | |
| ` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${s.ms.length ? Math.round(Math.min(...s.ms)) : 0} / max ${s.ms.length ? Math.round(Math.max(...s.ms)) : 0}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`, |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
개요devvit.json의 OpenAI 모델 옵션을 gpt-5.4-nano 및 gpt-5.4-mini로 업데이트하고, 스모크 테스트를 단일 모델 실행에서 여러 모델의 성능을 비교하는 다중 모델 러너로 변환합니다. 각 모델별 요청 지연 시간, 통과 횟수, 토큰 및 비용 통계를 추적합니다. 변경 사항모델 구성 및 스모크 테스트 개선
예상 코드 리뷰 노력🎯 3 (보통) | ⏱️ ~20분 축하 시
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options
…ng (Phase 1.7b Tier 1+2+3) Best-practice UX rework for the 5 deferred audit findings (#2 #3 #5 #7 #10) plus 4 new items (#A empty states, #B delete confirm, #C onboarding, #D token cost). Design + tier rationale: claudedocs/2026-05-14-ux-best-practices-plan.md. ## Tier 1 — compose flow (audit #2 #5 #7) ### #2 Compile preview confirmation form (composeConfirmForm) Compile no longer persists straight to draft. After validation, the moderator sees a confirm form with the rule rendered in plain English (humanizeRule()), the original sentence, and the per-compile token cost (~$0.0006 on gpt-5.4-mini). Tick "Edit instead of save" to re-open compose with the original NL pre-filled. Save persists draft + schedules dry-run via the new persistRuleAndStartDryRun() helper. Defence-in-depth: serialized rule is re-validated before write. New endpoint: /internal/form/compose-confirm-submit New form: composeConfirmForm ### #5 Clarification turn limit (3 rounds) Server-side counter shipped as a disabled paragraph field. After round 3, an oscillating LLM gets refused with an actionable toast ("Try rephrasing more concretely"), instead of opening yet another modal. Each clarify modal description prefixed with "(Round X of 3)" so the moderator sees how many tries remain. ### #7 Editable original rule in clarify modal Removed `disabled: true` on the Original rule field. helpText now says "Re-compile uses this text plus your answer below." ## Tier 2 — Manage rules menu + dashboard hardening (audit #3 #10 #A #B #D) ### #3 + #10 Per-rule control surface (Manage rules) New "vibe-mod: Manage rules" menu (subreddit-level, mod-only). Renders one form-group per rule with a `select` of available actions: - Drafts: Keep / Activate (shadow 24h) / Activate immediately / Delete - Active shadow rules: Keep / Promote shadow → live / Pause / Delete - Active live rules: Keep / Pause / Delete Submit applies all non-destructive actions atomically (one redis.set per bundle) via the new applyManageActions() helper. Dashboard becomes read-only — its old "Activate N drafts" boolean removed. New endpoints: /internal/menu/manage-rules, /internal/form/manage-rules-submit, /internal/form/manage-delete-confirm New forms: manageRulesForm, manageDeleteConfirmForm ### #B Delete confirmation Any `delete` action in the Manage submit forwards to a confirm form listing the rules about to disappear, with an explicit "I understand this is permanent" boolean. The pending action map round-trips through a disabled paragraph carrier so the second-step submit knows what to apply. ### #A Empty states - Manage menu emits a guided toast when there are zero rules at all. - Dashboard description shows "No rules yet — open ⋯ → Compose" when both rules count and recent actions are zero. ### #D Token cost transparency - Dashboard description: "Tokens used (lifetime): N in / N out (~$X on gpt-5.4-mini)" via the new estimateTokenCost() helper. - Compose confirm form shows per-compile cost so the moderator sees the unit economics before saving. ## Tier 3 — Onboarding (audit #C) Dashboard shows a 3-step welcome card on first visit (per-sub Redis flag `${sub}:onboarding:dismissed`). Cancel-label switches to "Don't show intro again" and ticks dismissOnboarding to persist the flag. New key in src/shared/redis-keys.ts. ## Verification - `npm run check` 4/4 gates green - 209 unit + integration tests (1 skipped): - 33 compose tests (8 updated for new 2-step flow + 4 new for #2/#5/#7) - 13 dashboard tests (4 updated for read-only + 4 new for onboarding/empty) - 17 manage tests (NEW file routes-manage.test.ts: menu render, all 5 actions, delete confirm round-trip) - existing trigger / scheduler / settings / undo / evaluator / executor unchanged ## Why a single PR Tier 1+2+3 all touch overlapping files (index.ts, devvit.json, the dashboard form shape). Splitting would force back-and-forth conflict resolution. Logical commits inside this PR (split when applicable in follow-up). Refs: claudedocs/2026-05-14-ux-best-practices-plan.md (Phase 1.7a), claudedocs/2026-05-14-compose-flow-audit.md (audit baseline), docs/demo-scenario.md (downstream dependency)
What
scripts/openai-smoketest.tsnow records per-call latency and acceptsOPENAI_MODELS=a,b,cto run the suite against several models and print a comparison table (pass · median/max latency · avg output tokens).Measured (real key, free-daily-usage tier — cost not a factor here)
Conclusion:
gpt-5.4-nanoandgpt-5.4-miniare the right picks — fast (~1.2–1.4s), no reasoning overhead, 7/7 on the rule-compile suite. Reasoning models (gpt-5-mini, o-series) are both slower and need a much bigger token cap, so they're a bad fit for this structured NL→JSON task.devvit.json change
Trimmed
openaiModeloptions to justgpt-5.4-nano(default) andgpt-5.4-mini, with the measured numbers in the labels. Removedgpt-5-nano(unavailable) andgpt-5-mini(unreliable with ourmax_completion_tokens: 700cap).Verify
tsc --noEmit/eslint --max-warnings 0/prettier --checkclean ·vitest151 passed (1 skipped) ·acceptance4/4 ·doctor0 hard issues · CI green.OPENAI_MODELS=gpt-5.4-nano,gpt-5.4-mini npm run openai:smoketest→ both 7/7 with a comparison table.🤖 Generated with Claude Code
Summary by CodeRabbit
변경사항
새로운 기능
gpt-5.4-nano및gpt-5.4-mini모델 옵션 추가제거됨
gpt-5.4-nano로 유지Chores