Skip to content

feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options#10

Merged
ComBba merged 1 commit into
mainfrom
feat/model-comparison
May 12, 2026
Merged

feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options#10
ComBba merged 1 commit into
mainfrom
feat/model-comparison

Conversation

@ComBba
Copy link
Copy Markdown
Contributor

@ComBba ComBba commented May 12, 2026

What

scripts/openai-smoketest.ts now records per-call latency and accepts OPENAI_MODELS=a,b,c to run the suite against several models and print a comparison table (pass · median/max latency · avg output tokens).

Measured (real key, free-daily-usage tier — cost not a factor here)

model pass median max avg out tok
gpt-5.4-nano 7/7 1387ms 2290ms ~110
gpt-5.4-mini 7/7 1213ms 1346ms ~100
gpt-5-mini 3/7
gpt-5-nano n/a
gpt-4o-mini 6/7 1859ms 2999ms ~90
gpt-4.1-mini / gpt-4.1-nano n/a

Conclusion: gpt-5.4-nano and gpt-5.4-mini are the right picks — fast (~1.2–1.4s), no reasoning overhead, 7/7 on the rule-compile suite. Reasoning models (gpt-5-mini, o-series) are both slower and need a much bigger token cap, so they're a bad fit for this structured NL→JSON task.

devvit.json change

Trimmed openaiModel options to just gpt-5.4-nano (default) and gpt-5.4-mini, with the measured numbers in the labels. Removed gpt-5-nano (unavailable) and gpt-5-mini (unreliable with our max_completion_tokens: 700 cap).

Verify

  • tsc --noEmit / eslint --max-warnings 0 / prettier --check clean · vitest 151 passed (1 skipped) · acceptance 4/4 · doctor 0 hard issues · CI green.
  • OPENAI_MODELS=gpt-5.4-nano,gpt-5.4-mini npm run openai:smoketest → both 7/7 with a comparison table.

🤖 Generated with Claude Code

Summary by CodeRabbit

변경사항

  • 새로운 기능

    • OpenAI 모델 옵션 업데이트: 새로운 gpt-5.4-nanogpt-5.4-mini 모델 옵션 추가
  • 제거됨

    • 이전 모델 옵션 제거. 기본값은 gpt-5.4-nano로 유지
  • Chores

    • 다중 모델 테스트 및 성능 비교 기능 개선

Review Change Stack

…im devvit.json model list

scripts/openai-smoketest.ts now records per-call wall-clock latency and accepts
OPENAI_MODELS=a,b,c to run the suite against several models and print a
comparison table (pass rate · median/max latency · avg output tokens).

Ran the comparison against a real key (free-daily-usage tier — cost not a factor):
  gpt-5.4-nano   7/7   median 1387ms  max 2290ms  ~110 out tok
  gpt-5.4-mini   7/7   median 1213ms  max 1346ms  ~100 out tok   ← fastest + most consistent
  gpt-5-mini     3/7   — reasoning model; burns the 700-tok budget on hidden reasoning
                       so the JSON gets truncated ("Unexpected end of JSON input")
  gpt-5-nano     n/a   — 403, not available to this project
  gpt-4o-mini    6/7   median 1859ms  max 2999ms  — slowest + one miss
  gpt-4.1-mini / gpt-4.1-nano  n/a — 403

So `gpt-5.4-nano` and `gpt-5.4-mini` are the right picks (fast, no reasoning
overhead, 7/7). Trimmed devvit.json's openaiModel options to just those two —
removed `gpt-5-nano` (unavailable) and `gpt-5-mini` (unreliable with our token
cap) — and updated the labels with the measured numbers. Default stays nano
(cheapest, 7/7); switch to mini if cost is free for you.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@ComBba ComBba merged commit 3f5fa4d into main May 12, 2026
1 check passed
@ComBba ComBba deleted the feat/model-comparison branch May 12, 2026 10:44
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the OpenAI smoke test script by adding support for multi-model comparisons and latency tracking. Key changes include refactoring the execution logic into a reusable runModel function, updating the compile function to measure request duration, and adding a comparison table for performance metrics. A bug was identified in the latency reporting logic where the minimum latency calculation would always result in zero due to an incorrect use of Math.min.

const med = s.ms.length ? [...s.ms].sort((a, b) => a - b)[Math.floor(s.ms.length / 2)] : 0;
console.log(
`${pass}/${CASES.length} cases as expected. tokens: ${totIn} in / ${totOut} out ≈ $${cost.toFixed(5)} (at ${MODEL} list price)`,
` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for calculating the minimum latency is incorrect. Math.min(...s.ms, 0) will always return 0 because all measured latencies are positive numbers. This results in the reported min value always being 0ms regardless of the actual performance. You should guard against empty arrays and remove the 0 from the Math.min arguments to capture the actual minimum measured latency.

Suggested change
` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${Math.round(Math.min(...s.ms, 0))} / max ${Math.round(Math.max(...s.ms, 0))}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,
` → ${s.pass}/${s.total} as expected · latency median ${Math.round(med)}ms (min ${s.ms.length ? Math.round(Math.min(...s.ms)) : 0} / max ${s.ms.length ? Math.round(Math.max(...s.ms)) : 0}) · tokens ${s.tokensIn} in / ${s.tokensOut} out · ≈ $${cost.toFixed(5)} (at gpt-5.4-nano list price)\n`,

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6830ae2c-3ee5-41c1-94df-af470b5137a9

📥 Commits

Reviewing files that changed from the base of the PR and between 5588f8d and 808be5a.

📒 Files selected for processing (2)
  • devvit.json
  • scripts/openai-smoketest.ts

개요

devvit.json의 OpenAI 모델 옵션을 gpt-5.4-nano 및 gpt-5.4-mini로 업데이트하고, 스모크 테스트를 단일 모델 실행에서 여러 모델의 성능을 비교하는 다중 모델 러너로 변환합니다. 각 모델별 요청 지연 시간, 통과 횟수, 토큰 및 비용 통계를 추적합니다.

변경 사항

모델 구성 및 스모크 테스트 개선

Layer / File(s) 요약
모델 옵션 설정 업데이트
devvit.json
openaiModel 선택지를 새로운 gpt-5.4-nano 및 gpt-5.4-mini 옵션으로 업데이트하고 레이블 및 설명을 포함합니다.
스모크 테스트 다중 모델 인프라
scripts/openai-smoketest.ts
node:perf_hooks에서 performance를 임포트하고, OPENAI_MODELS 환경 변수를 파싱하여 여러 모델을 대상으로 실행하기 위한 모델 목록을 작성합니다.
요청 지연 시간 및 결과 추적
scripts/openai-smoketest.ts
ApiResult 타입을 확장하여 모든 결과에 지연 시간을 포함하고, compile 함수를 업데이트하여 모델 매개변수를 받으며, fetch 호출 주변에 타이밍을 추가합니다.
다중 모델 비교 실행기 및 통계
scripts/openai-smoketest.ts
ModelSummary를 도입하고 runModel() 함수를 구현하여 각 모델의 통과 횟수, 지연 시간 중앙값, 토큰 및 비용을 계산합니다. 주 실행기는 모든 구성된 모델을 반복하고 비교 테이블을 출력하며, 치명적 오류 시 0이 아닌 코드로 종료합니다.

예상 코드 리뷰 노력

🎯 3 (보통) | ⏱️ ~20분

축하 시

토끼의 마음에서 나온 축하의 말 🐰

새로운 모델들이 나타났네요, gpt-5.4가 왔어!
스모크 테스트는 이제 비교 테이블을 춤춘답니다,
지연 시간을 재고, 비용을 계산하며,
여러 모델이 경쟁하니 정말 멋져요! 🚀✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/model-comparison

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ComBba added a commit that referenced this pull request May 15, 2026
feat(openai-smoketest): latency + multi-model comparison; trim devvit.json model options
ComBba pushed a commit that referenced this pull request May 15, 2026
…ng (Phase 1.7b Tier 1+2+3)

Best-practice UX rework for the 5 deferred audit findings (#2 #3 #5 #7 #10) plus
4 new items (#A empty states, #B delete confirm, #C onboarding, #D token cost).
Design + tier rationale: claudedocs/2026-05-14-ux-best-practices-plan.md.

## Tier 1 — compose flow (audit #2 #5 #7)

### #2 Compile preview confirmation form (composeConfirmForm)
Compile no longer persists straight to draft. After validation, the moderator
sees a confirm form with the rule rendered in plain English (humanizeRule()),
the original sentence, and the per-compile token cost (~$0.0006 on
gpt-5.4-mini). Tick "Edit instead of save" to re-open compose with the
original NL pre-filled. Save persists draft + schedules dry-run via the new
persistRuleAndStartDryRun() helper. Defence-in-depth: serialized rule is
re-validated before write.

New endpoint: /internal/form/compose-confirm-submit
New form: composeConfirmForm

### #5 Clarification turn limit (3 rounds)
Server-side counter shipped as a disabled paragraph field. After round 3, an
oscillating LLM gets refused with an actionable toast ("Try rephrasing more
concretely"), instead of opening yet another modal. Each clarify modal
description prefixed with "(Round X of 3)" so the moderator sees how many
tries remain.

### #7 Editable original rule in clarify modal
Removed `disabled: true` on the Original rule field. helpText now says
"Re-compile uses this text plus your answer below."

## Tier 2 — Manage rules menu + dashboard hardening (audit #3 #10 #A #B #D)

### #3 + #10 Per-rule control surface (Manage rules)
New "vibe-mod: Manage rules" menu (subreddit-level, mod-only). Renders one
form-group per rule with a `select` of available actions:
- Drafts: Keep / Activate (shadow 24h) / Activate immediately / Delete
- Active shadow rules: Keep / Promote shadow → live / Pause / Delete
- Active live rules: Keep / Pause / Delete
Submit applies all non-destructive actions atomically (one redis.set per
bundle) via the new applyManageActions() helper. Dashboard becomes
read-only — its old "Activate N drafts" boolean removed.

New endpoints: /internal/menu/manage-rules,
               /internal/form/manage-rules-submit,
               /internal/form/manage-delete-confirm
New forms: manageRulesForm, manageDeleteConfirmForm

### #B Delete confirmation
Any `delete` action in the Manage submit forwards to a confirm form listing
the rules about to disappear, with an explicit "I understand this is
permanent" boolean. The pending action map round-trips through a disabled
paragraph carrier so the second-step submit knows what to apply.

### #A Empty states
- Manage menu emits a guided toast when there are zero rules at all.
- Dashboard description shows "No rules yet — open ⋯ → Compose" when both
  rules count and recent actions are zero.

### #D Token cost transparency
- Dashboard description: "Tokens used (lifetime): N in / N out (~$X on
  gpt-5.4-mini)" via the new estimateTokenCost() helper.
- Compose confirm form shows per-compile cost so the moderator sees the
  unit economics before saving.

## Tier 3 — Onboarding (audit #C)

Dashboard shows a 3-step welcome card on first visit (per-sub Redis flag
`${sub}:onboarding:dismissed`). Cancel-label switches to "Don't show intro
again" and ticks dismissOnboarding to persist the flag. New key in
src/shared/redis-keys.ts.

## Verification
- `npm run check` 4/4 gates green
- 209 unit + integration tests (1 skipped):
  - 33 compose tests (8 updated for new 2-step flow + 4 new for #2/#5/#7)
  - 13 dashboard tests (4 updated for read-only + 4 new for onboarding/empty)
  - 17 manage tests (NEW file routes-manage.test.ts: menu render, all 5 actions, delete confirm round-trip)
  - existing trigger / scheduler / settings / undo / evaluator / executor unchanged

## Why a single PR
Tier 1+2+3 all touch overlapping files (index.ts, devvit.json, the dashboard form
shape). Splitting would force back-and-forth conflict resolution. Logical
commits inside this PR (split when applicable in follow-up).

Refs: claudedocs/2026-05-14-ux-best-practices-plan.md (Phase 1.7a),
      claudedocs/2026-05-14-compose-flow-audit.md (audit baseline),
      docs/demo-scenario.md (downstream dependency)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant