refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8) by toku345 · Pull Request #40 · toku345/dgx-llm-serve

toku345 · 2026-04-24T07:40:17Z

Summary

Curate to a single vLLM profile (qwen36) backed by Qwen/Qwen3.6-35B-A3B-FP8
Remove backends/trtllm/ and backends/nim/ entirely, plus legacy vLLM profiles (qwen / qwen35 / nemotron / nemotron-vl / multi)
Full design: docs/plans/2026-04-24-model-curation-design.md

The motivation is speed complaint on the existing stack and the desire for a single well-tuned default. Qwen3.6-35B-A3B-FP8 (Apache 2.0, 2026-04-16) covers both the AI coding agent (kakko-de) and Web analysis (briefer) use cases with significantly better quality than the current Qwen3.5 profile (SWE-bench Verified 65.8% → 73.4%) at comparable throughput on DGX Spark.

Test plan

Phase 2 smoke test (results recorded in docs/plans/2026-04-24-qwen36-smoke-test.md)

/health 200 OK
Short chat with reasoning_content field available
50K token input processed (measured: 45,020 prompt tokens in 14s)
Single tool call → structured tool_calls output
Consecutive tool calls (Codex review concern vLLM #39056 — no reproduction here)
Throughput decode: 52 tok/s cold / 52 tok/s warm (below 60 tok/s target but on par with current Qwen3.5)
docker compose --profile qwen36 config valid

Phase 3 post-deletion verification

docker compose --profile qwen36 up still starts, /health 200 OK, chat responds
rg -w "trtllm|nemotron|nemotron-vl|qwen35|qwen3-coder" -g '!docs/plans/' → 0 matches
rg "nvcr\.io/nim|\.cache/nim|backends/nim" -g '!docs/plans/' → 0 matches
diff CLAUDE.md AGENTS.md → empty (AGENTS.md is a symlink to CLAUDE.md)

Notes on throughput

The 52 tok/s measurement is below the externally reported 77.74 tok/s. Likely reasons:

SM 12.1 exceeds PyTorch official 12.0 support (fallback kernel path)
FlashInfer attention backend not enabled
Prefix caching not enabled
vLLM v0.19.0 upstream image may lack DGX Spark-specific tuning

These optimizations are out of scope for this refactor and will be addressed in a follow-up PR.

🤖 Generated with Claude Code

Qwen3.6-35B-A3B-FP8 を vLLM で常用する 1 モデル / 1 プロファイル構成への精選方針を設計書化。 backends/trtllm と backends/nim は全削除、vLLM の既存 profile も qwen36 のみ残して整理する。 4 Phase 構成 (準備 / 追加 / Smoke / 削除+Doc) と Codex レビュー指摘 (image tag 明示 pin, tool calling 事前検証, 削除は Smoke pass 後) を反映。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.6-35B-A3B-FP8 を vLLM で常用するための qwen36 profile を追加する。既存の qwen/qwen35/nemotron/nemotron-vl/multi profile は後続の Phase 3 で smoke pass 後に削除する計画。 - image: vllm/vllm-openai:v0.19.0-aarch64-cu130-ubuntu2404 (ARM64, CUDA 13.0, Ubuntu 24.04 を明示 pin。latest や v0.19+ 曖昧指定を避ける) - --max-model-len 131072 (128K 開始、262K は TP=8 前提) - --gpu-memory-utilization 0.8 (重み ~36 GiB + KV cache 分の余裕) - --reasoning-parser qwen3 + --tool-call-parser qwen3_coder (Qwen 公式 vLLM Recipe 準拠) あわせて .gitignore に .codex/ を追記（Codex 実行メタデータの混入防止）。 Design: docs/plans/2026-04-24-model-curation-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

必須 7 項目のうち 6 項目が完全 PASS、項目 6 (スループット) のみ目標 60 tok/s 未達 (実測 52 tok/s)。Qwen3.5-35B-A3B-FP8 同等速度で品質は大幅向上 (SWE-bench 65.8% → 73.4%)。Codex 指摘の連続 tool call 破綻リスクは本環境では再現せず。速度最適化候補 (本 PR スコープ外): - --attention-backend flashinfer - --enable-prefix-caching Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

精選後の qwen36 profile (Qwen3.6-35B-A3B-FP8) smoke test 完了を受けて、 TRT-LLM / NIM バックエンドと vLLM の既存 profile 群を全削除し、ドキュメントを精選後構成に合わせて更新する。削除: - backends/trtllm/ (qwen / nemotron / multi profiles) - backends/nim/ (qwen3-32b-dgx-spark) - backends/vllm/compose.yml から 7 サービス (qwen / qwen35 / nemotron / multi 3 つ / nemotron-vl) - backends/vllm/nginx.conf (multi 用) - backends/vllm/scripts/vllm-up.sh (旧 Qwen3-Coder 用) 更新: - README.md: バックエンド表を vLLM のみに、クイックスタートを qwen36 のみに - CLAUDE.md / AGENTS.md (symlink): ディレクトリ構造 / コマンド例 / バックエンド固有注意を qwen36 構成で再記述 - backends/vllm/README.md: profile 表・設定・API テスト例を刷新 - docs/thinking-mode.md: 対応バックエンドを vLLM 単体に、モデル名例を更新 - docs/tool-calling.md: モデル名・image・メッセージ例を qwen36 に更新 - scripts/benchmark.py: docstring のモデル例示を更新検証: - 削除後 docker compose --profile qwen36 config valid - Phase 3 編集後の /health 200 OK、chat レスポンス正常 - 残存チェック (trtllm/nemotron/qwen35/qwen3-coder/NIM) 全 0 件 - diff CLAUDE.md AGENTS.md 空 (symlink) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-24T07:40:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cedd8b5c-116a-47a1-bdee-115bcd409be5

📥 Commits

Reviewing files that changed from the base of the PR and between 8139b86 and b1583e1.

📒 Files selected for processing (2)

.github/dependabot.yml
backends/vllm/compose.yml

💤 Files with no reviewable changes (1)

.github/dependabot.yml

📝 Walkthrough

Walkthrough

本PRは、TensorRT‑LLM・NIMを含む複数バックエンドを削除して単一の vLLM（Qwen/Qwen3.6-35B-A3B-FP8）構成へ統合し、関連するCompose設定／スクリプト／READMEを置換・削除、モデル重み取得やvLLM起動フローをドキュメント化する変更群です。

Changes

Cohort / File(s)	Summary
Version Control設定 `\.gitignore`	`.codex/` を無視リストに追加。
NIMバックエンド削除 `backends/nim/README.md`, `backends/nim/compose.yml`	NIM向けREADMEとComposeサービス定義を削除（NGC/キャッシュ/サービス設定を除去）。
TensorRT‑LLMバックエンド削除 `backends/trtllm/...` `backends/trtllm/README.md`, `backends/trtllm/compose.yml`, `backends/trtllm/nginx.conf`, `backends/trtllm/*.yaml`	TRT‑LLM関連のドキュメント・Compose・Nginx・個別YAMLを一括削除（マルチモデルプロキシ等を撤去）。
vLLMバックエンド統合 `backends/vllm/README.md`, `backends/vllm/compose.yml`, `backends/vllm/nginx.conf`, `backends/vllm/scripts/vllm-up.sh`	複数プロファイルを廃止して `qwen36` プロファイルへ統合。イメージ固定化、モデル指定（Qwen3.6-35B-A3B-FP8）と起動フラグ（--reasoning-parser 等）を更新、旧プロキシ/追加サービスを削除。
ドキュメント更新 `CLAUDE.md`, `README.md`, `docs/thinking-mode.md`, `docs/tool-calling.md`, `scripts/benchmark.py`	バックエンド参照を vLLM 単一化、モデル名を `Qwen/Qwen3.6-35B-A3B-FP8` に統一、起動／API／ベンチ例を置換・簡素化。
CI/Dependabot調整 `.github/dependabot.yml`	trtllm と nim 用の Dependabot 更新エントリを削除（vllm エントリは残存）。
計画・テスト記録追加 `docs/plans/2026-04-24-model-curation-design.md`, `docs/plans/2026-04-24-qwen36-smoke-test.md`	モデル統一の設計ドキュメントと実施済みスモークテスト結果を追加（手順・チェックリスト・測定結果）。
思考モード・ツール呼び出し文書化 `docs/thinking-mode.md`, `docs/tool-calling.md`	`reasoning_content`/`qwen3`パーサーと`qwen3_coder`ツールコール振る舞いをQwen3.6向けに更新。

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer/CI
    participant HF as HuggingFace
    participant FS as Host filesystem
    participant Compose as Docker Compose
    participant vLLM as vLLM (Qwen3.6)
    participant Client as API Client

    Dev->>HF: 一度だけモデル重みをダウンロード (hf_hub_download)
    HF-->>FS: 配置されたモデル重み
    Dev->>FS: モデルを ~/model_weights/Qwen/... に配置
    Dev->>Compose: docker compose --profile qwen36 up
    Compose->>vLLM: 起動（ピン留めイメージ、--reasoning-parser 等の引数）
    vLLM->>FS: /app/model からモデルをロード
    vLLM-->>Compose: /health == 200
    Client->>vLLM: POST /v1/chat/completions (tool_calls / reasoning_content)
    vLLM-->>Client: 応答（content, reasoning_content, tool_calls）

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Add Qwen3.5-35B-A3B-FP8 (qwen35) profile to vLLM backend #8 — vLLM プロファイル移行（qwen35→qwen36）と vLLM 設定変更が重複／競合する可能性あり。
Update TRT-LLM image from 1.3.0rc2 to 1.3.0rc3 #5 — TRT‑LLM 関連ファイル（compose/README/YAML）を変更しており、本PRで削除された箇所と直接衝突する可能性あり。
Fix vllm-qwen35: switch to cu130-nightly for SM 12.1 compat #10 — vLLM バックエンド（compose.yml, CLAUDE.md 等）のイメージ／設定変更を行っており、本PRの統合作業と密接に関連。

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: consolidating to a single vLLM profile (qwen36) with the specific model Qwen3.6-35B-A3B-FP8, which is directly reflected in the changeset.
Description check	✅ Passed	The description is directly related to the changeset, providing clear context about the consolidation to a single vLLM profile, removal of legacy backends, test results, and motivation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/model-curation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/thinking-mode.md (1)
65-88: ⚠️ Potential issue | 🟠 Major

バックエンド削除方針と矛盾する注記が残っています

Line 87 の Nemotron 記載と Line 65/88 の「<think> タグをクライアントで除去必須」は、同ファイル上部（Line 8-9）の --reasoning-parser qwen3 前提と整合していません。現行の単一 qwen36 運用に合わせて、条件付き説明（例: parser 無効時のみ除去）へ修正するか、不要なら削除してください。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/thinking-mode.md` around lines 65 - 88, The docs conflict: the top
assumes `--reasoning-parser qwen3` (single qwen36 usage) but the note still
mandates client-side removal of `<think>...</think>` and mentions Nemotron
(`--reasoning_parser deepseek-r1`) and `reasoning_content`; update the section
to be consistent by either (A) changing the advice to a conditional note that
explains `<think>` tag removal is only required when a reasoning parser is not
used (e.g., when not using `--reasoning-parser qwen3` or when using parsers that
do not separate `reasoning_content`), or (B) remove the Nemotron/conditional
line entirely if not relevant to the current qwen36-only deployment; reference
the `<think>` tag, `--reasoning-parser qwen3`, `--reasoning_parser deepseek-r1`,
and `reasoning_content` when making the clarifying edit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backends/vllm/README.md`:
- Line 60: Replace the fragile, project-name-dependent docker logs invocation
(the line containing "docker logs vllm-vllm-qwen36-1 -f") with a docker
compose–based logs instruction that follows the Compose service name (e.g., use
docker compose logs -f for the vllm service) so the README uses the service name
from the compose file and works across environments; update the corresponding
line in backends/vllm/README.md to reference the compose logs command and the
vllm service instead of the fixed container name.

In `@docs/plans/2026-04-24-model-curation-design.md`:
- Around line 38-80: The fenced ASCII-art blocks (the large diagrams showing
"Before" and "After" with backends/vllm, backends/trtllm, etc.) are missing a
language identifier which triggers markdownlint MD040; update both fenced code
blocks (the one containing the "Before（現状）" diagram around the qwen.yaml/qwen36
text and the second diagram later in the file) to use a language specifier such
as ```text instead of ``` so the blocks become ```text ... ```; ensure you
change both occurrences referenced in the diff (the diagram beginning near the
"Before（現状）" header and the "After（精選後）" diagram) to silence MD040.

---

Outside diff comments:
In `@docs/thinking-mode.md`:
- Around line 65-88: The docs conflict: the top assumes `--reasoning-parser
qwen3` (single qwen36 usage) but the note still mandates client-side removal of
`<think>...</think>` and mentions Nemotron (`--reasoning_parser deepseek-r1`)
and `reasoning_content`; update the section to be consistent by either (A)
changing the advice to a conditional note that explains `<think>` tag removal is
only required when a reasoning parser is not used (e.g., when not using
`--reasoning-parser qwen3` or when using parsers that do not separate
`reasoning_content`), or (B) remove the Nemotron/conditional line entirely if
not relevant to the current qwen36-only deployment; reference the `<think>` tag,
`--reasoning-parser qwen3`, `--reasoning_parser deepseek-r1`, and
`reasoning_content` when making the clarifying edit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f45cfb9c-1251-4968-9d1a-9c4f48d7d271

📥 Commits

Reviewing files that changed from the base of the PR and between 50bcbb6 and 90c4eab.

📒 Files selected for processing (20)

.gitignore
CLAUDE.md
README.md
backends/nim/README.md
backends/nim/compose.yml
backends/trtllm/README.md
backends/trtllm/compose.yml
backends/trtllm/nano_v3.yaml
backends/trtllm/nginx.conf
backends/trtllm/qwen.yaml
backends/trtllm/qwen_multi.yaml
backends/vllm/README.md
backends/vllm/compose.yml
backends/vllm/nginx.conf
backends/vllm/scripts/vllm-up.sh
docs/plans/2026-04-24-model-curation-design.md
docs/plans/2026-04-24-qwen36-smoke-test.md
docs/thinking-mode.md
docs/tool-calling.md
scripts/benchmark.py

💤 Files with no reviewable changes (10)

backends/trtllm/qwen_multi.yaml
backends/trtllm/nano_v3.yaml
backends/trtllm/qwen.yaml
backends/nim/compose.yml
backends/vllm/nginx.conf
backends/nim/README.md
backends/vllm/scripts/vllm-up.sh
backends/trtllm/README.md
backends/trtllm/nginx.conf
backends/trtllm/compose.yml

- docs/thinking-mode.md: 単一 qwen36 / --reasoning-parser qwen3 運用前提に合わせ、「クライアント側での処理」を書き換え。手動タグ除去は parser 未使用時のみ必要と明示し、Nemotron 固有の注記は削除。 - backends/vllm/README.md: トラブルシューティングのログ確認コマンドを docker logs (プロジェクト名依存で壊れやすい) から docker compose --profile qwen36 logs -f vllm-qwen36 に変更。 - docs/plans/2026-04-24-model-curation-design.md: ASCII-art フェンスコードブロックに text 言語指定を追加 (markdownlint MD040)。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop dependabot entries for /backends/trtllm and /backends/nim (removed in 90c4eab), and remove the legacy NGC image from compose x-common so any future service must explicitly pin its own image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

unless-stopped silently retries forever, hiding engine failures (e.g. SM120 illegal-memory-access) during 300s healthcheck start_period. on-failure:3 lets compose ps surface the failure after 3 retries, matching the project's fail-loud principle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

toku345 and others added 4 commits April 24, 2026 14:18

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread backends/vllm/README.md Outdated

Comment thread docs/plans/2026-04-24-model-curation-design.md Outdated

toku345 self-assigned this Apr 24, 2026

toku345 and others added 3 commits April 24, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8)#40

refactor: curate to single vLLM qwen36 profile (Qwen3.6-35B-A3B-FP8)#40
toku345 wants to merge 7 commits into
mainfrom
feature/model-curation

toku345 commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toku345 commented Apr 24, 2026

Summary

Test plan

Phase 2 smoke test (results recorded in docs/plans/2026-04-24-qwen36-smoke-test.md)

Phase 3 post-deletion verification

Notes on throughput

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading