refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8)#40
refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8)#40toku345 wants to merge 7 commits into
Conversation
Qwen3.6-35B-A3B-FP8 を vLLM で常用する 1 モデル / 1 プロファイル構成への精選方針を設計書化。 backends/trtllm と backends/nim は全削除、vLLM の既存 profile も qwen36 のみ残して整理する。 4 Phase 構成 (準備 / 追加 / Smoke / 削除+Doc) と Codex レビュー指摘 (image tag 明示 pin, tool calling 事前検証, 削除は Smoke pass 後) を反映。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6-35B-A3B-FP8 を vLLM で常用するための qwen36 profile を追加する。既存の qwen/qwen35/nemotron/nemotron-vl/multi profile は後続の Phase 3 で smoke pass 後に削除する計画。 - image: vllm/vllm-openai:v0.19.0-aarch64-cu130-ubuntu2404 (ARM64, CUDA 13.0, Ubuntu 24.04 を明示 pin。latest や v0.19+ 曖昧指定を避ける) - --max-model-len 131072 (128K 開始、262K は TP=8 前提) - --gpu-memory-utilization 0.8 (重み ~36 GiB + KV cache 分の余裕) - --reasoning-parser qwen3 + --tool-call-parser qwen3_coder (Qwen 公式 vLLM Recipe 準拠) あわせて .gitignore に .codex/ を追記(Codex 実行メタデータの混入防止)。 Design: docs/plans/2026-04-24-model-curation-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
必須 7 項目のうち 6 項目が完全 PASS、項目 6 (スループット) のみ 目標 60 tok/s 未達 (実測 52 tok/s)。Qwen3.5-35B-A3B-FP8 同等速度 で品質は大幅向上 (SWE-bench 65.8% → 73.4%)。Codex 指摘の 連続 tool call 破綻リスクは本環境では再現せず。 速度最適化候補 (本 PR スコープ外): - --attention-backend flashinfer - --enable-prefix-caching Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
精選後の qwen36 profile (Qwen3.6-35B-A3B-FP8) smoke test 完了を受けて、 TRT-LLM / NIM バックエンドと vLLM の既存 profile 群を全削除し、 ドキュメントを精選後構成に合わせて更新する。 削除: - backends/trtllm/ (qwen / nemotron / multi profiles) - backends/nim/ (qwen3-32b-dgx-spark) - backends/vllm/compose.yml から 7 サービス (qwen / qwen35 / nemotron / multi 3 つ / nemotron-vl) - backends/vllm/nginx.conf (multi 用) - backends/vllm/scripts/vllm-up.sh (旧 Qwen3-Coder 用) 更新: - README.md: バックエンド表を vLLM のみに、クイックスタートを qwen36 のみに - CLAUDE.md / AGENTS.md (symlink): ディレクトリ構造 / コマンド例 / バックエンド固有注意を qwen36 構成で再記述 - backends/vllm/README.md: profile 表・設定・API テスト例を刷新 - docs/thinking-mode.md: 対応バックエンドを vLLM 単体に、モデル名例を更新 - docs/tool-calling.md: モデル名・image・メッセージ例を qwen36 に更新 - scripts/benchmark.py: docstring のモデル例示を更新 検証: - 削除後 docker compose --profile qwen36 config valid - Phase 3 編集後の /health 200 OK、chat レスポンス正常 - 残存チェック (trtllm/nemotron/qwen35/qwen3-coder/NIM) 全 0 件 - diff CLAUDE.md AGENTS.md 空 (symlink) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthrough本PRは、TensorRT‑LLM・NIMを含む複数バックエンドを削除して単一の vLLM(Qwen/Qwen3.6-35B-A3B-FP8)構成へ統合し、関連するCompose設定/スクリプト/READMEを置換・削除、モデル重み取得やvLLM起動フローをドキュメント化する変更群です。 Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer/CI
participant HF as HuggingFace
participant FS as Host filesystem
participant Compose as Docker Compose
participant vLLM as vLLM (Qwen3.6)
participant Client as API Client
Dev->>HF: 一度だけモデル重みをダウンロード (hf_hub_download)
HF-->>FS: 配置されたモデル重み
Dev->>FS: モデルを ~/model_weights/Qwen/... に配置
Dev->>Compose: docker compose --profile qwen36 up
Compose->>vLLM: 起動(ピン留めイメージ、--reasoning-parser 等の引数)
vLLM->>FS: /app/model からモデルをロード
vLLM-->>Compose: /health == 200
Client->>vLLM: POST /v1/chat/completions (tool_calls / reasoning_content)
vLLM-->>Client: 応答(content, reasoning_content, tool_calls)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/thinking-mode.md (1)
65-88:⚠️ Potential issue | 🟠 Majorバックエンド削除方針と矛盾する注記が残っています
Line 87 の Nemotron 記載と Line 65/88 の「
<think>タグをクライアントで除去必須」は、同ファイル上部(Line 8-9)の--reasoning-parser qwen3前提と整合していません。現行の単一 qwen36 運用に合わせて、条件付き説明(例: parser 無効時のみ除去)へ修正するか、不要なら削除してください。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/thinking-mode.md` around lines 65 - 88, The docs conflict: the top assumes `--reasoning-parser qwen3` (single qwen36 usage) but the note still mandates client-side removal of `<think>...</think>` and mentions Nemotron (`--reasoning_parser deepseek-r1`) and `reasoning_content`; update the section to be consistent by either (A) changing the advice to a conditional note that explains `<think>` tag removal is only required when a reasoning parser is not used (e.g., when not using `--reasoning-parser qwen3` or when using parsers that do not separate `reasoning_content`), or (B) remove the Nemotron/conditional line entirely if not relevant to the current qwen36-only deployment; reference the `<think>` tag, `--reasoning-parser qwen3`, `--reasoning_parser deepseek-r1`, and `reasoning_content` when making the clarifying edit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backends/vllm/README.md`:
- Line 60: Replace the fragile, project-name-dependent docker logs invocation
(the line containing "docker logs vllm-vllm-qwen36-1 -f") with a docker
compose–based logs instruction that follows the Compose service name (e.g., use
docker compose logs -f for the vllm service) so the README uses the service name
from the compose file and works across environments; update the corresponding
line in backends/vllm/README.md to reference the compose logs command and the
vllm service instead of the fixed container name.
In `@docs/plans/2026-04-24-model-curation-design.md`:
- Around line 38-80: The fenced ASCII-art blocks (the large diagrams showing
"Before" and "After" with backends/vllm, backends/trtllm, etc.) are missing a
language identifier which triggers markdownlint MD040; update both fenced code
blocks (the one containing the "Before(現状)" diagram around the qwen.yaml/qwen36
text and the second diagram later in the file) to use a language specifier such
as ```text instead of ``` so the blocks become ```text ... ```; ensure you
change both occurrences referenced in the diff (the diagram beginning near the
"Before(現状)" header and the "After(精選後)" diagram) to silence MD040.
---
Outside diff comments:
In `@docs/thinking-mode.md`:
- Around line 65-88: The docs conflict: the top assumes `--reasoning-parser
qwen3` (single qwen36 usage) but the note still mandates client-side removal of
`<think>...</think>` and mentions Nemotron (`--reasoning_parser deepseek-r1`)
and `reasoning_content`; update the section to be consistent by either (A)
changing the advice to a conditional note that explains `<think>` tag removal is
only required when a reasoning parser is not used (e.g., when not using
`--reasoning-parser qwen3` or when using parsers that do not separate
`reasoning_content`), or (B) remove the Nemotron/conditional line entirely if
not relevant to the current qwen36-only deployment; reference the `<think>` tag,
`--reasoning-parser qwen3`, `--reasoning_parser deepseek-r1`, and
`reasoning_content` when making the clarifying edit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f45cfb9c-1251-4968-9d1a-9c4f48d7d271
📒 Files selected for processing (20)
.gitignoreCLAUDE.mdREADME.mdbackends/nim/README.mdbackends/nim/compose.ymlbackends/trtllm/README.mdbackends/trtllm/compose.ymlbackends/trtllm/nano_v3.yamlbackends/trtllm/nginx.confbackends/trtllm/qwen.yamlbackends/trtllm/qwen_multi.yamlbackends/vllm/README.mdbackends/vllm/compose.ymlbackends/vllm/nginx.confbackends/vllm/scripts/vllm-up.shdocs/plans/2026-04-24-model-curation-design.mddocs/plans/2026-04-24-qwen36-smoke-test.mddocs/thinking-mode.mddocs/tool-calling.mdscripts/benchmark.py
💤 Files with no reviewable changes (10)
- backends/trtllm/qwen_multi.yaml
- backends/trtllm/nano_v3.yaml
- backends/trtllm/qwen.yaml
- backends/nim/compose.yml
- backends/vllm/nginx.conf
- backends/nim/README.md
- backends/vllm/scripts/vllm-up.sh
- backends/trtllm/README.md
- backends/trtllm/nginx.conf
- backends/trtllm/compose.yml
- docs/thinking-mode.md: 単一 qwen36 / --reasoning-parser qwen3 運用前提に合わせ、 「クライアント側での処理」を書き換え。手動タグ除去は parser 未使用時のみ必要と 明示し、Nemotron 固有の注記は削除。 - backends/vllm/README.md: トラブルシューティングのログ確認コマンドを docker logs (プロジェクト名依存で壊れやすい) から docker compose --profile qwen36 logs -f vllm-qwen36 に変更。 - docs/plans/2026-04-24-model-curation-design.md: ASCII-art フェンスコードブロックに text 言語指定を追加 (markdownlint MD040)。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop dependabot entries for /backends/trtllm and /backends/nim (removed in 90c4eab), and remove the legacy NGC image from compose x-common so any future service must explicitly pin its own image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unless-stopped silently retries forever, hiding engine failures (e.g. SM120 illegal-memory-access) during 300s healthcheck start_period. on-failure:3 lets compose ps surface the failure after 3 retries, matching the project's fail-loud principle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
qwen36) backed byQwen/Qwen3.6-35B-A3B-FP8backends/trtllm/andbackends/nim/entirely, plus legacy vLLM profiles (qwen/qwen35/nemotron/nemotron-vl/multi)The motivation is speed complaint on the existing stack and the desire for a single well-tuned default. Qwen3.6-35B-A3B-FP8 (Apache 2.0, 2026-04-16) covers both the AI coding agent (kakko-de) and Web analysis (briefer) use cases with significantly better quality than the current Qwen3.5 profile (SWE-bench Verified 65.8% → 73.4%) at comparable throughput on DGX Spark.
Test plan
Phase 2 smoke test (results recorded in docs/plans/2026-04-24-qwen36-smoke-test.md)
/health200 OKreasoning_contentfield availabletool_callsoutputdocker compose --profile qwen36 configvalidPhase 3 post-deletion verification
docker compose --profile qwen36 upstill starts,/health200 OK, chat respondsrg -w "trtllm|nemotron|nemotron-vl|qwen35|qwen3-coder" -g '!docs/plans/'→ 0 matchesrg "nvcr\.io/nim|\.cache/nim|backends/nim" -g '!docs/plans/'→ 0 matchesdiff CLAUDE.md AGENTS.md→ empty (AGENTS.md is a symlink to CLAUDE.md)Notes on throughput
The 52 tok/s measurement is below the externally reported 77.74 tok/s. Likely reasons:
These optimizations are out of scope for this refactor and will be addressed in a follow-up PR.
🤖 Generated with Claude Code