Skip to content

refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8)#40

Open
toku345 wants to merge 7 commits into
mainfrom
feature/model-curation
Open

refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8)#40
toku345 wants to merge 7 commits into
mainfrom
feature/model-curation

Conversation

@toku345
Copy link
Copy Markdown
Owner

@toku345 toku345 commented Apr 24, 2026

Summary

  • Curate to a single vLLM profile (qwen36) backed by Qwen/Qwen3.6-35B-A3B-FP8
  • Remove backends/trtllm/ and backends/nim/ entirely, plus legacy vLLM profiles (qwen / qwen35 / nemotron / nemotron-vl / multi)
  • Full design: docs/plans/2026-04-24-model-curation-design.md

The motivation is speed complaint on the existing stack and the desire for a single well-tuned default. Qwen3.6-35B-A3B-FP8 (Apache 2.0, 2026-04-16) covers both the AI coding agent (kakko-de) and Web analysis (briefer) use cases with significantly better quality than the current Qwen3.5 profile (SWE-bench Verified 65.8% → 73.4%) at comparable throughput on DGX Spark.

Test plan

Phase 2 smoke test (results recorded in docs/plans/2026-04-24-qwen36-smoke-test.md)

  • /health 200 OK
  • Short chat with reasoning_content field available
  • 50K token input processed (measured: 45,020 prompt tokens in 14s)
  • Single tool call → structured tool_calls output
  • Consecutive tool calls (Codex review concern vLLM #39056 — no reproduction here)
  • Throughput decode: 52 tok/s cold / 52 tok/s warm (below 60 tok/s target but on par with current Qwen3.5)
  • docker compose --profile qwen36 config valid

Phase 3 post-deletion verification

  • docker compose --profile qwen36 up still starts, /health 200 OK, chat responds
  • rg -w "trtllm|nemotron|nemotron-vl|qwen35|qwen3-coder" -g '!docs/plans/' → 0 matches
  • rg "nvcr\.io/nim|\.cache/nim|backends/nim" -g '!docs/plans/' → 0 matches
  • diff CLAUDE.md AGENTS.md → empty (AGENTS.md is a symlink to CLAUDE.md)

Notes on throughput

The 52 tok/s measurement is below the externally reported 77.74 tok/s. Likely reasons:

  • SM 12.1 exceeds PyTorch official 12.0 support (fallback kernel path)
  • FlashInfer attention backend not enabled
  • Prefix caching not enabled
  • vLLM v0.19.0 upstream image may lack DGX Spark-specific tuning

These optimizations are out of scope for this refactor and will be addressed in a follow-up PR.

🤖 Generated with Claude Code

toku345 and others added 4 commits April 24, 2026 14:18
Qwen3.6-35B-A3B-FP8 を vLLM で常用する 1 モデル / 1 プロファイル構成への精選方針を設計書化。
backends/trtllm と backends/nim は全削除、vLLM の既存 profile も qwen36 のみ残して整理する。

4 Phase 構成 (準備 / 追加 / Smoke / 削除+Doc) と Codex レビュー指摘
(image tag 明示 pin, tool calling 事前検証, 削除は Smoke pass 後) を反映。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6-35B-A3B-FP8 を vLLM で常用するための qwen36 profile を追加する。既存の
qwen/qwen35/nemotron/nemotron-vl/multi profile は後続の Phase 3 で smoke pass
後に削除する計画。

- image: vllm/vllm-openai:v0.19.0-aarch64-cu130-ubuntu2404 (ARM64, CUDA 13.0,
  Ubuntu 24.04 を明示 pin。latest や v0.19+ 曖昧指定を避ける)
- --max-model-len 131072 (128K 開始、262K は TP=8 前提)
- --gpu-memory-utilization 0.8 (重み ~36 GiB + KV cache 分の余裕)
- --reasoning-parser qwen3 + --tool-call-parser qwen3_coder (Qwen 公式 vLLM
  Recipe 準拠)

あわせて .gitignore に .codex/ を追記(Codex 実行メタデータの混入防止)。

Design: docs/plans/2026-04-24-model-curation-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
必須 7 項目のうち 6 項目が完全 PASS、項目 6 (スループット) のみ
目標 60 tok/s 未達 (実測 52 tok/s)。Qwen3.5-35B-A3B-FP8 同等速度
で品質は大幅向上 (SWE-bench 65.8% → 73.4%)。Codex 指摘の
連続 tool call 破綻リスクは本環境では再現せず。

速度最適化候補 (本 PR スコープ外):
- --attention-backend flashinfer
- --enable-prefix-caching

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
精選後の qwen36 profile (Qwen3.6-35B-A3B-FP8) smoke test 完了を受けて、
TRT-LLM / NIM バックエンドと vLLM の既存 profile 群を全削除し、
ドキュメントを精選後構成に合わせて更新する。

削除:
- backends/trtllm/ (qwen / nemotron / multi profiles)
- backends/nim/ (qwen3-32b-dgx-spark)
- backends/vllm/compose.yml から 7 サービス
  (qwen / qwen35 / nemotron / multi 3 つ / nemotron-vl)
- backends/vllm/nginx.conf (multi 用)
- backends/vllm/scripts/vllm-up.sh (旧 Qwen3-Coder 用)

更新:
- README.md: バックエンド表を vLLM のみに、クイックスタートを qwen36 のみに
- CLAUDE.md / AGENTS.md (symlink): ディレクトリ構造 / コマンド例 /
  バックエンド固有注意を qwen36 構成で再記述
- backends/vllm/README.md: profile 表・設定・API テスト例を刷新
- docs/thinking-mode.md: 対応バックエンドを vLLM 単体に、モデル名例を更新
- docs/tool-calling.md: モデル名・image・メッセージ例を qwen36 に更新
- scripts/benchmark.py: docstring のモデル例示を更新

検証:
- 削除後 docker compose --profile qwen36 config valid
- Phase 3 編集後の /health 200 OK、chat レスポンス正常
- 残存チェック (trtllm/nemotron/qwen35/qwen3-coder/NIM) 全 0 件
- diff CLAUDE.md AGENTS.md 空 (symlink)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cedd8b5c-116a-47a1-bdee-115bcd409be5

📥 Commits

Reviewing files that changed from the base of the PR and between 8139b86 and b1583e1.

📒 Files selected for processing (2)
  • .github/dependabot.yml
  • backends/vllm/compose.yml
💤 Files with no reviewable changes (1)
  • .github/dependabot.yml

📝 Walkthrough

Walkthrough

本PRは、TensorRT‑LLM・NIMを含む複数バックエンドを削除して単一の vLLM(Qwen/Qwen3.6-35B-A3B-FP8)構成へ統合し、関連するCompose設定/スクリプト/READMEを置換・削除、モデル重み取得やvLLM起動フローをドキュメント化する変更群です。

Changes

Cohort / File(s) Summary
Version Control設定
\.gitignore
.codex/ を無視リストに追加。
NIMバックエンド削除
backends/nim/README.md, backends/nim/compose.yml
NIM向けREADMEとComposeサービス定義を削除(NGC/キャッシュ/サービス設定を除去)。
TensorRT‑LLMバックエンド削除
backends/trtllm/...
backends/trtllm/README.md, backends/trtllm/compose.yml, backends/trtllm/nginx.conf, backends/trtllm/*.yaml
TRT‑LLM関連のドキュメント・Compose・Nginx・個別YAMLを一括削除(マルチモデルプロキシ等を撤去)。
vLLMバックエンド統合
backends/vllm/README.md, backends/vllm/compose.yml, backends/vllm/nginx.conf, backends/vllm/scripts/vllm-up.sh
複数プロファイルを廃止して qwen36 プロファイルへ統合。イメージ固定化、モデル指定(Qwen3.6-35B-A3B-FP8)と起動フラグ(--reasoning-parser 等)を更新、旧プロキシ/追加サービスを削除。
ドキュメント更新
CLAUDE.md, README.md, docs/thinking-mode.md, docs/tool-calling.md, scripts/benchmark.py
バックエンド参照を vLLM 単一化、モデル名を Qwen/Qwen3.6-35B-A3B-FP8 に統一、起動/API/ベンチ例を置換・簡素化。
CI/Dependabot調整
.github/dependabot.yml
trtllm と nim 用の Dependabot 更新エントリを削除(vllm エントリは残存)。
計画・テスト記録追加
docs/plans/2026-04-24-model-curation-design.md, docs/plans/2026-04-24-qwen36-smoke-test.md
モデル統一の設計ドキュメントと実施済みスモークテスト結果を追加(手順・チェックリスト・測定結果)。
思考モード・ツール呼び出し文書化
docs/thinking-mode.md, docs/tool-calling.md
reasoning_content/qwen3パーサーとqwen3_coderツールコール振る舞いをQwen3.6向けに更新。

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer/CI
    participant HF as HuggingFace
    participant FS as Host filesystem
    participant Compose as Docker Compose
    participant vLLM as vLLM (Qwen3.6)
    participant Client as API Client

    Dev->>HF: 一度だけモデル重みをダウンロード (hf_hub_download)
    HF-->>FS: 配置されたモデル重み
    Dev->>FS: モデルを ~/model_weights/Qwen/... に配置
    Dev->>Compose: docker compose --profile qwen36 up
    Compose->>vLLM: 起動(ピン留めイメージ、--reasoning-parser 等の引数)
    vLLM->>FS: /app/model からモデルをロード
    vLLM-->>Compose: /health == 200
    Client->>vLLM: POST /v1/chat/completions (tool_calls / reasoning_content)
    vLLM-->>Client: 応答(content, reasoning_content, tool_calls)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: consolidating to a single vLLM profile (qwen36) with the specific model Qwen3.6-35B-A3B-FP8, which is directly reflected in the changeset.
Description check ✅ Passed The description is directly related to the changeset, providing clear context about the consolidation to a single vLLM profile, removal of legacy backends, test results, and motivation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/model-curation

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/thinking-mode.md (1)

65-88: ⚠️ Potential issue | 🟠 Major

バックエンド削除方針と矛盾する注記が残っています

Line 87 の Nemotron 記載と Line 65/88 の「<think> タグをクライアントで除去必須」は、同ファイル上部(Line 8-9)の --reasoning-parser qwen3 前提と整合していません。現行の単一 qwen36 運用に合わせて、条件付き説明(例: parser 無効時のみ除去)へ修正するか、不要なら削除してください。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/thinking-mode.md` around lines 65 - 88, The docs conflict: the top
assumes `--reasoning-parser qwen3` (single qwen36 usage) but the note still
mandates client-side removal of `<think>...</think>` and mentions Nemotron
(`--reasoning_parser deepseek-r1`) and `reasoning_content`; update the section
to be consistent by either (A) changing the advice to a conditional note that
explains `<think>` tag removal is only required when a reasoning parser is not
used (e.g., when not using `--reasoning-parser qwen3` or when using parsers that
do not separate `reasoning_content`), or (B) remove the Nemotron/conditional
line entirely if not relevant to the current qwen36-only deployment; reference
the `<think>` tag, `--reasoning-parser qwen3`, `--reasoning_parser deepseek-r1`,
and `reasoning_content` when making the clarifying edit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backends/vllm/README.md`:
- Line 60: Replace the fragile, project-name-dependent docker logs invocation
(the line containing "docker logs vllm-vllm-qwen36-1 -f") with a docker
compose–based logs instruction that follows the Compose service name (e.g., use
docker compose logs -f for the vllm service) so the README uses the service name
from the compose file and works across environments; update the corresponding
line in backends/vllm/README.md to reference the compose logs command and the
vllm service instead of the fixed container name.

In `@docs/plans/2026-04-24-model-curation-design.md`:
- Around line 38-80: The fenced ASCII-art blocks (the large diagrams showing
"Before" and "After" with backends/vllm, backends/trtllm, etc.) are missing a
language identifier which triggers markdownlint MD040; update both fenced code
blocks (the one containing the "Before(現状)" diagram around the qwen.yaml/qwen36
text and the second diagram later in the file) to use a language specifier such
as ```text instead of ``` so the blocks become ```text ... ```; ensure you
change both occurrences referenced in the diff (the diagram beginning near the
"Before(現状)" header and the "After(精選後)" diagram) to silence MD040.

---

Outside diff comments:
In `@docs/thinking-mode.md`:
- Around line 65-88: The docs conflict: the top assumes `--reasoning-parser
qwen3` (single qwen36 usage) but the note still mandates client-side removal of
`<think>...</think>` and mentions Nemotron (`--reasoning_parser deepseek-r1`)
and `reasoning_content`; update the section to be consistent by either (A)
changing the advice to a conditional note that explains `<think>` tag removal is
only required when a reasoning parser is not used (e.g., when not using
`--reasoning-parser qwen3` or when using parsers that do not separate
`reasoning_content`), or (B) remove the Nemotron/conditional line entirely if
not relevant to the current qwen36-only deployment; reference the `<think>` tag,
`--reasoning-parser qwen3`, `--reasoning_parser deepseek-r1`, and
`reasoning_content` when making the clarifying edit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f45cfb9c-1251-4968-9d1a-9c4f48d7d271

📥 Commits

Reviewing files that changed from the base of the PR and between 50bcbb6 and 90c4eab.

📒 Files selected for processing (20)
  • .gitignore
  • CLAUDE.md
  • README.md
  • backends/nim/README.md
  • backends/nim/compose.yml
  • backends/trtllm/README.md
  • backends/trtllm/compose.yml
  • backends/trtllm/nano_v3.yaml
  • backends/trtllm/nginx.conf
  • backends/trtllm/qwen.yaml
  • backends/trtllm/qwen_multi.yaml
  • backends/vllm/README.md
  • backends/vllm/compose.yml
  • backends/vllm/nginx.conf
  • backends/vllm/scripts/vllm-up.sh
  • docs/plans/2026-04-24-model-curation-design.md
  • docs/plans/2026-04-24-qwen36-smoke-test.md
  • docs/thinking-mode.md
  • docs/tool-calling.md
  • scripts/benchmark.py
💤 Files with no reviewable changes (10)
  • backends/trtllm/qwen_multi.yaml
  • backends/trtllm/nano_v3.yaml
  • backends/trtllm/qwen.yaml
  • backends/nim/compose.yml
  • backends/vllm/nginx.conf
  • backends/nim/README.md
  • backends/vllm/scripts/vllm-up.sh
  • backends/trtllm/README.md
  • backends/trtllm/nginx.conf
  • backends/trtllm/compose.yml

Comment thread backends/vllm/README.md Outdated
Comment thread docs/plans/2026-04-24-model-curation-design.md Outdated
@toku345 toku345 self-assigned this Apr 24, 2026
toku345 and others added 3 commits April 24, 2026 17:06
- docs/thinking-mode.md: 単一 qwen36 / --reasoning-parser qwen3 運用前提に合わせ、
  「クライアント側での処理」を書き換え。手動タグ除去は parser 未使用時のみ必要と
  明示し、Nemotron 固有の注記は削除。
- backends/vllm/README.md: トラブルシューティングのログ確認コマンドを
  docker logs (プロジェクト名依存で壊れやすい) から
  docker compose --profile qwen36 logs -f vllm-qwen36 に変更。
- docs/plans/2026-04-24-model-curation-design.md: ASCII-art フェンスコードブロックに
  text 言語指定を追加 (markdownlint MD040)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop dependabot entries for /backends/trtllm and /backends/nim
(removed in 90c4eab), and remove the legacy NGC image from compose
x-common so any future service must explicitly pin its own image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unless-stopped silently retries forever, hiding engine failures
(e.g. SM120 illegal-memory-access) during 300s healthcheck start_period.
on-failure:3 lets compose ps surface the failure after 3 retries,
matching the project's fail-loud principle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant