feat: use consistent small models across all deploy examples #2573

biswapanda · 2025-08-20T21:52:36Z

Overview:

Use smaller Qwen/Qwen3-0.6B model instead of deepseek-ai/DeepSeek-R1-Distill-Llama-8B for CI tests and smaller start up time. This also fixes CI timeout issues.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Chores
- Switched default model to Qwen/Qwen3-0.6B across SGLang and TRT-LLM deployments, routers, and launch scripts (single-node, disaggregated, and multinode setups).
- Adjusted startup flags: removed skip-tokenizer-init in one aggregate path; added skip-tokenizer-init, disaggregation mode, and transfer backend in prefill paths.
- Ensures the new model loads by default unless overridden.
Documentation
- Updated README and examples to use Qwen/Qwen3-0.6B in curl payloads, startup commands, and performance scripts.

copy-pr-bot · 2025-08-20T21:52:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-08-20T22:15:35Z

Walkthrough

This PR updates model references across SGLang and TRTLLM deployment/launch artifacts and docs to Qwen/Qwen3-0.6B. It also adjusts a few SGLang launch flags: removes --skip-tokenizer-init in agg.sh and adds prefill-related flags in disagg.sh. No APIs or exported entities are changed.

Changes

Cohort / File(s)	Summary
SGLang deploy YAMLs `components/backends/sglang/deploy/agg.yaml`, `.../deploy/agg_router.yaml`, `.../deploy/disagg.yaml`, `.../deploy/disagg_planner.yaml`, `.../deploy/disagg-multinode.yaml`	Switch model-path and served-model-name to Qwen/Qwen3-0.6B across workers; arguments otherwise unchanged. Disagg-multinode also switches from Llama-3.3-70B.
SGLang launch scripts `components/backends/sglang/launch/agg.sh`, `.../launch/agg_router.sh`, `.../launch/disagg.sh`	Update model to Qwen/Qwen3-0.6B. Remove --skip-tokenizer-init in agg.sh. In disagg.sh prefill, add --skip-tokenizer-init, --disaggregation-mode prefill, --disaggregation-transfer-backend nixl.
SGLang docs `components/backends/sglang/docs/sgl-hicache-example.md`, `components/backends/sglang/README.md`	Update example model references and commands to Qwen/Qwen3-0.6B; payloads and other flags unchanged.
TRTLLM deploy YAMLs `components/backends/trtllm/deploy/agg.yaml`, `.../deploy/agg_router.yaml`, `.../deploy/disagg.yaml`, `.../deploy/disagg_router.yaml`	Change model-path and served-model-name to Qwen/Qwen3-0.6B; other args intact.
TRTLLM launch scripts `components/backends/trtllm/launch/agg.sh`, `.../launch/agg_router.sh`, `.../launch/disagg.sh`, `.../launch/disagg_router.sh`	Update default MODEL_PATH and SERVED_MODEL_NAME to Qwen/Qwen3-0.6B; no control-flow changes.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

feat: support SGLang in pre-deployment sweeping #2360 — Also changes SGLang deployment arguments (model-path/served-model-name), overlapping with these YAML edits.
feat: Add trtllm deploy examples for k8s #2133 #2207 — Introduces TRTLLM k8s deploy examples whose model strings are updated here.
docs: add sglang hicache example and guide #2388 — Adds the SGLang HiCache example doc that this PR updates to Qwen/Qwen3-0.6B.

Poem

A hop and a swap to Qwen we go,
Models aligned in a tidy row.
Scripts now whisper a lighter name,
Prefill flags join the game.
I thump with joy—deploys are clean,
Carrots compiled, pipelines green. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai · 2025-08-20T22:15:35Z

Walkthrough

This PR switches the referenced/served model to Qwen/Qwen3-0.6B across SGLang and TRTLLM deployment YAMLs and launch scripts, plus doc examples. Only CLI argument values and default envs are updated (model-path and served-model-name); no logic, control flow, or interfaces changed.

Changes

Cohort / File(s)	Summary of changes
SGLang docs `components/backends/sglang/README.md`, `components/backends/sglang/docs/sgl-hicache-example.md`	Updated example model identifiers to Qwen/Qwen3-0.6B in curl payloads and command snippets.
SGLang deploy YAMLs `components/backends/sglang/deploy/agg.yaml`, `.../deploy/agg_router.yaml`, `.../deploy/disagg.yaml`, `.../deploy/disagg-multinode.yaml`, `.../deploy/disagg_planner.yaml`	Replaced --model-path and --served-model-name values with Qwen/Qwen3-0.6B for decode/prefill workers; other flags unchanged.
SGLang launch scripts `components/backends/sglang/launch/agg.sh`, `.../launch/agg_router.sh`, `.../launch/disagg.sh`	Updated worker invocations to use Qwen/Qwen3-0.6B for --model-path and --served-model-name; no other edits.
TRTLLM deploy YAMLs `components/backends/trtllm/deploy/agg.yaml`, `.../deploy/agg_router.yaml`, `.../deploy/disagg.yaml`, `.../deploy/disagg_router.yaml`	Switched model-path and served-model-name to Qwen/Qwen3-0.6B in TRTLLM worker commands; remaining args unchanged.
TRTLLM launch scripts `components/backends/trtllm/launch/agg.sh`, `.../launch/agg_router.sh`, `.../launch/disagg.sh`, `.../launch/disagg_router.sh`	Changed default exported envs MODEL_PATH and SERVED_MODEL_NAME to Qwen/Qwen3-0.6B; no logic changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

docs: add sglang hicache example and guide #2388 — Touches the same SGLang HiCache doc file; this PR updates the model identifiers introduced/modified there.
feat: support sglang in sla planner #2421 — Modifies the same SGLang deployment manifests and model CLI args used by this PR.
fix: fix examples #2459 — Changes argument formatting in the same SGLang/TRTLLM command lines where this PR swaps model identifiers.

Poem

A twitch of whiskers, configs hop,
From DeepSeek fields to Qwen’s new crop.
Paths and names, a gentle swap—
Launch scripts thump, the pods won’t stop.
I nibble docs, approve the lot,
Qwen now served—carrot on top! 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

components/backends/sglang/deploy/disagg-multinode.yaml (1)
57-58: Namespace mismatch blocks service discovery.

Prefill uses dynamoNamespace: sglang-disagg while the rest of the graph uses sglang-disagg-multinode. This prevents the planner/router from discovering the prefill workers.
-      dynamoNamespace: sglang-disagg
+      dynamoNamespace: sglang-disagg-multinode
components/backends/sglang/deploy/agg_router.yaml (1)
15-20: Critical: clear_namespace pre-start gating is missing

The clear_namespace command—required to block router startup on failure—has been removed from components/backends/sglang/deploy/agg_router.yaml. This gating step is a critical prerequisite and must be reinstated before the router’s container starts.

Please restore the pre-start hook (or init container) so that it includes something like:
lifecycle:
  preStart:
    exec:
      command:
        - /bin/bash
        - -c
        - clear_namespace && echo "namespace cleared" || (echo "failed to clear namespace" && exit 1)
• File needing update:

components/backends/sglang/deploy/agg_router.yaml
• Location: in the container spec (e.g., under lifecycle.preStart or an initContainer)
• Purpose: ensure the router does not start unless clear_namespace succeeds.
components/backends/trtllm/launch/agg_router.sh (1)

27-31: Align AGG engine YAML with Qwen3-0.6B hyperparameters

The file components/backends/trtllm/engine_configs/agg.yaml currently only declares:

max_num_tokens: 8192

kv_cache_config.free_gpu_memory_fraction: 0.85

However, Qwen3-0.6B’s model hyperparameters are:

Context length (max_position_embeddings): 32,768 (huggingface.co, qwenlm.github.io)

Attention configuration:
• hidden_size: 1024
• head_dim: 128
• num_attention_heads: 16 (Q) / num_key_value_heads: 8 (KV) (huggingface.co)

Tokenizer settings:
• eos_token_id: 151643
• pad_token_id: 151643 (huggingface.co)

KV-cache behavior: use_cache should remain true (as per the model’s default) (huggingface.co)

Please update or extend agg.yaml to include these fields (or confirm that the runtime loader picks them up from the model’s config.json) to prevent mis-aligned engine settings and startup failures.

🧹 Nitpick comments (22)

components/backends/sglang/launch/agg.sh (1)
23-24: De-duplicate model identifiers via env defaults for easier future switches.

Define MODEL_PATH/SERVED_MODEL_NAME once and reuse; keeps launch scripts in sync across PRs.

Apply:
@@
-python3 -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
+export MODEL_PATH="${MODEL_PATH:-Qwen/Qwen3-0.6B}"
+export SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$MODEL_PATH}"
+python3 -m dynamo.sglang \
+  --model-path "$MODEL_PATH" \
+  --served-model-name "$SERVED_MODEL_NAME" \
   --page-size 16 \
   --tp 1 \
   --trust-remote-code \
   --skip-tokenizer-init
components/backends/sglang/launch/disagg.sh (2)
23-24: Centralize model identifiers to envs for maintainability.

Same suggestion as agg.sh: define once, reuse in both workers to prevent drift.
@@
-python3 -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
+export MODEL_PATH="${MODEL_PATH:-Qwen/Qwen3-0.6B}"
+export SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$MODEL_PATH}"
+python3 -m dynamo.sglang \
+  --model-path "$MODEL_PATH" \
+  --served-model-name "$SERVED_MODEL_NAME" \
   --page-size 16 \
   --tp 1 \
   --trust-remote-code \
   --skip-tokenizer-init \
   --disaggregation-mode prefill \
   --disaggregation-transfer-backend nixl &
35-36: Mirror the env-based identifiers in decode invocation.

Keep prefill/decode perfectly aligned by reusing the same envs.
@@
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
+CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
+  --model-path "$MODEL_PATH" \
+  --served-model-name "$SERVED_MODEL_NAME" \
   --page-size 16 \
   --tp 1 \
   --trust-remote-code \
   --skip-tokenizer-init \
   --disaggregation-mode decode \
   --disaggregation-transfer-backend nixl
components/backends/sglang/deploy/disagg-multinode.yaml (4)
46-47: Over-provisioned tensor parallelism for a 0.6B model.

--tp-size 8 is unnecessary for Qwen3-0.6B and increases startup time and complexity. For the stated PR goals (faster CI and startup), prefer --tp-size 1 unless you have an explicit cross-node TP experiment.
-              --tp-size 8
+              --tp-size 1
73-74: Same TP sizing concern on prefill.

Mirror the suggestion on the prefill side to keep configuration consistent and lightweight.
-              --tp-size 8
+              --tp-size 1
34-36: Reduce GPU limits to match the small model and TP=1.

If you adopt TP=1 for Qwen3-0.6B, allocating 4 GPUs per worker is wasteful for CI/startup flows.
-        limits:
-          gpu: "4"
+        limits:
+          gpu: "1"
61-63: Same GPU limit reduction for prefill.

Align with the decode worker to keep resources minimal.
-        limits:
-          gpu: "4"
+        limits:
+          gpu: "1"
components/backends/sglang/docs/sgl-hicache-example.md (1)
14-15: Docs updated to Qwen3-0.6B: add two tiny hardening notes.

Looks good. To keep quickstarts reproducible and CI-friendly:

Mention that an HF access token may be required for the model and should be provided via HF_TOKEN.

Consider adding a short note to pin a specific model revision (commit hash) to avoid drift when running perf or cache experiments.

Example snippet to append:
export HF_TOKEN=***   # if required by the model
# Optionally pin a revision to ensure reproducibility:
# export HF_REVISION=<commit-hash>
# ... pass through if your CLI supports --model-revision/--revision
Also applies to: 42-43, 59-60
components/backends/trtllm/deploy/disagg_router.yaml (1)
36-36: Model switch looks consistent; consider parametrizing and pinning revision for reproducibility.

The inline switch to Qwen/Qwen3-0.6B is fine. To make CI/demo toggles easier and avoid accidental drift, consider env-parametrizing model identifiers (as you already do in launch scripts) and optionally pinning a HF revision if supported.

Apply this diff to use env fallbacks:
-            - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy prefill_first --publish-events-and-metrics"
+            - "python3 -m dynamo.trtllm --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B} --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy prefill_first --publish-events-and-metrics"
Optionally add MODEL_REVISION if dynamo.trtllm supports it:
--model-revision ${MODEL_REVISION:-main}
components/backends/sglang/launch/agg_router.sh (1)
33-34: Security note: trust-remote-code is enabled for Qwen; consider pinning by revision.

Running with --trust-remote-code pulls and executes repo code. If sglang supports it, add --model-revision and/or set HF_HUB_ENABLE_HF_TRANSFER=1 with a pinned commit for deterministic CI.

Example:
-  --served-model-name Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
+  --model-revision ${MODEL_REVISION:-main} \
components/backends/trtllm/deploy/disagg.yaml (1)
33-33: Parametrize model id to keep deploys consistent with launch scripts.

Mirroring launch scripts, use env vars so CI can swap models without YAML edits.
-            - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy decode_first"
+            - "python3 -m dynamo.trtllm --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B} --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy decode_first"
If supported, pin revision with --model-revision ${MODEL_REVISION:-main}.
components/backends/trtllm/launch/agg.sh (1)
6-7: Verified model repo exists; optional env‐var sanity check

The HuggingFace API confirms that “Qwen/Qwen3-0.6B” is a valid, ungated model repository ("modelId": "Qwen/Qwen3-0.6B", "gated": false). The default values in components/backends/trtllm/launch/agg.sh are therefore correct.

• File: components/backends/trtllm/launch/agg.sh
Lines 6–7 set
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
Optional: to guard against unintended empties when these environment variables are overridden, you may add a quick check:
 export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
 export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+if [[ -z "$MODEL_PATH" || -z "$SERVED_MODEL_NAME" ]]; then
+  echo "ERROR: MODEL_PATH and SERVED_MODEL_NAME must be set and non-empty." >&2
+  exit 1
+fi
components/backends/sglang/deploy/disagg_planner.yaml (1)
145-147: Optional: pin model revision and surface model via env for faster toggles.

Same motivation as other files—determinism and ease in CI.
-              --model-path Qwen/Qwen3-0.6B
-              --served-model-name Qwen/Qwen3-0.6B
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B}
+              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
+              --model-revision ${MODEL_REVISION:-main}
components/backends/trtllm/launch/disagg.sh (3)
6-7: Switch to Qwen/Qwen3-0.6B looks good; consider deriving SERVED_MODEL_NAME from MODEL_PATH to avoid drift.

Hard-coding both increases the risk they diverge in future edits. Defaulting SERVED_MODEL_NAME to MODEL_PATH keeps them in sync while still allowing overrides.
 export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"$MODEL_PATH"}
1-4: Harden the launcher for CI: fail fast and make cleanup safe with unset PIDs.

CI runs benefit from early exit on errors; also guard trap against unset vars.
 #!/bin/bash
+set -euo pipefail
@@
 cleanup() {
   echo "Cleaning up background processes..."
-  kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
-  wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+  kill ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true
+  wait ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true
   echo "Cleanup complete."
 }
Also applies to: 17-23

11-12: Optional: allow single-GPU CI runners by defaulting DECODE device to prefill device.

This keeps disaggregated flow working on 1-GPU machines without manual overrides.
-export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"}
-export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"1"}
+export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"}
+export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"$PREFILL_CUDA_VISIBLE_DEVICES"}
components/backends/trtllm/launch/disagg_router.sh (3)
6-7: Defaults updated correctly; suggest binding SERVED_MODEL_NAME to MODEL_PATH to prevent mismatches.

Keeps served name aligned with the chosen repo id by default.
 export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"$MODEL_PATH"}
11-12: Optional: single-GPU friendly defaults.

Mirror decode device from prefill when not set to improve OOTB CI behavior.
-export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"}
-export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"1"}
+export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"}
+export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"$PREFILL_CUDA_VISIBLE_DEVICES"}
1-4: Add fail-fast and robust cleanup for CI stability.

Same rationale as disagg.sh.
 #!/bin/bash
+set -euo pipefail
@@
 cleanup() {
   echo "Cleaning up background processes..."
-  kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
-  wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+  kill ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true
+  wait ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true
   echo "Cleanup complete."
 }
Also applies to: 14-21
components/backends/trtllm/launch/agg_router.sh (3)
6-7: Good change; prefer deriving SERVED_MODEL_NAME from MODEL_PATH.

Reduces duplication and accidental drift when updating defaults later.
 export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"$MODEL_PATH"}
6-7: Pin or document the model revision to improve reproducibility.

Hugging Face main may move; CI stability improves if a specific revision/tag is used (or at least an env like MODEL_REVISION). If dynamo.trtllm supports a revision flag, wire it; otherwise document expected commit hash.

Would you like me to propose a follow-up change to introduce MODEL_REVISION and thread it through the launchers and deploy manifests?

1-4: Apply fail-fast and safe cleanup for CI.

Consistent with the other launchers.
 #!/bin/bash
+set -euo pipefail
@@
 cleanup() {
   echo "Cleaning up background processes..."
-  kill $DYNAMO_PID 2>/dev/null || true
-  wait $DYNAMO_PID 2>/dev/null || true
+  kill ${DYNAMO_PID:-} 2>/dev/null || true
+  wait ${DYNAMO_PID:-} 2>/dev/null || true
   echo "Cleanup complete."
 }
Also applies to: 10-18

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d5b66fa and e73e4d6.

📒 Files selected for processing (18)

components/backends/sglang/README.md (1 hunks)
components/backends/sglang/deploy/agg.yaml (1 hunks)
components/backends/sglang/deploy/agg_router.yaml (1 hunks)
components/backends/sglang/deploy/disagg-multinode.yaml (1 hunks)
components/backends/sglang/deploy/disagg.yaml (2 hunks)
components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
components/backends/sglang/docs/sgl-hicache-example.md (3 hunks)
components/backends/sglang/launch/agg.sh (1 hunks)
components/backends/sglang/launch/agg_router.sh (2 hunks)
components/backends/sglang/launch/disagg.sh (2 hunks)
components/backends/trtllm/deploy/agg.yaml (1 hunks)
components/backends/trtllm/deploy/agg_router.yaml (1 hunks)
components/backends/trtllm/deploy/disagg.yaml (2 hunks)
components/backends/trtllm/deploy/disagg_router.yaml (2 hunks)
components/backends/trtllm/launch/agg.sh (1 hunks)
components/backends/trtllm/launch/agg_router.sh (1 hunks)
components/backends/trtllm/launch/disagg.sh (1 hunks)
components/backends/trtllm/launch/disagg_router.sh (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:00:07.968Z

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

components/backends/sglang/deploy/agg_router.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (14)

components/backends/trtllm/deploy/agg_router.yaml (1)

38-39: Compatibility Check Pending: Manual Verification Required

components/backends/trtllm/deploy/agg_router.yaml (lines 38–39) correctly reference Qwen/Qwen3-0.6B for both --model-path and --served-model-name.

The shared engine configuration at components/backends/trtllm/engine_configs/agg.yaml loads generic settings (tensor-parallel size, batch limits, caching, etc.) and does not include any model-specific architecture, vocabulary, or plugin parameters.

Please manually confirm the following before approving this change:

• agg.yaml suitability: ensure the default parallelism, max tokens, and other runtime flags are appropriate for Qwen 3.0-0.6B’s transformer dimensions and memory footprint.
• Tokenizer/vocabulary: verify that trust_remote_code: true (or any custom plugin) correctly handles Qwen3’s tokenizer and vocab mappings.
• Engine assets: confirm that a TensorRT-LLM engine for Qwen/Qwen3-0.6B either already exists in the build pipeline or that updated build scripts are in place to generate it (including any required plugin registrations).
components/backends/sglang/deploy/disagg.yaml (2)

62-63: Prefill worker model switch is consistent with decode; LGTM.

No issues spotted; the identifiers match the decode section.

35-36: Model arguments consistency verified—please confirm tokenizer and weights availability
Verified that both the prefill block (lines 35–36) and the decode block (lines 62–63) use identical model identifiers:
--model-path Qwen/Qwen3-0.6B  
--served-model-name Qwen/Qwen3-0.6B
(output from verification script)
Since --skip-tokenizer-init is enabled, ensure that the Qwen3-0.6B tokenizer artifacts (e.g. tokenizer.json, vocab files) are published and accessible in your model registry or container image, and that both prefill and decode workers pull the same tokenizer version to prevent any tokenization mismatch.

Additionally, confirm that the Qwen3-0.6B model weights are available and correctly versioned under Qwen/Qwen3-0.6B.
components/backends/sglang/deploy/disagg-multinode.yaml (1)

47-48: Pin model revision when using --trust-remote-code

We verified that both decode (lines 47–48) and prefill (lines 74–75) invocations in
components/backends/sglang/deploy/disagg-multinode.yaml include --trust-remote-code but don’t pin to a specific model revision. We also confirmed there is no existing --revision/--model-revision flag in the Dynamo or SGLang CLI parsers, nor any use of an HF_REVISION env var in this codebase.

To ensure reproducible deployments and mitigate supply-chain drift:
In the YAML at lines 47–48 and 74–75, add a revision flag, for example:
  --trust-remote-code
  --model-revision <commit-or-tag>
Extend the SGLang CLI (the ServerArgs.add_cli_args / ServerArgs.from_cli_args implementation) to accept and propagate a --model-revision (or --revision) argument into the HF from_pretrained(..., revision=…) call.

Alternatively (or additionally), read an environment variable (e.g. HF_REVISION=<commit>) inside ServerArgs.from_cli_args and pass it along to pin the exact revision.
Please confirm whether upstream SGLang already supports any revision-pinning flags or env vars; if not, implement one of the above so that --trust-remote-code cannot pull unbounded remote code.
components/backends/sglang/deploy/agg.yaml (2)

35-36: Model switch to Qwen3-0.6B aligns with PR goals.

The replacement is consistent and should lower startup time for CI.

39-41: I’ve added a grep to pinpoint where skip_tokenizer_init is declared and consumed in the codebase. This will show us if any critical initialization steps get bypassed. Let me know the results, and we can confirm whether additional safeguards or documentation are needed.

components/backends/sglang/deploy/agg_router.yaml (1)

38-39: Consistent model switch on router worker.

Change is straightforward and matches other SGLang configs.
components/backends/trtllm/deploy/agg.yaml (1)

35-36: Verify engine config path and model-specific assumptions

I couldn’t find an engine_configs/agg.yaml at the path referenced in components/backends/trtllm/deploy/agg.yaml. It looks like your aggregated engine config actually lives under:
components/backends/trtllm/engine_configs/multinode/agg.yaml

Please:

Confirm that the --extra-engine-args path in deploy/agg.yaml points to the correct file.

Inspect components/backends/trtllm/engine_configs/multinode/agg.yaml for any model-specific knobs (e.g. vocab size, max position embeddings, context length, tensor-parallel splits, plugins, etc.) that might have been tuned for a previous model.

If that config is truly model-agnostic, add a top-of-file comment in multinode/agg.yaml stating it’s safe across architectures/models to prevent future confusion.

Diff suggestion:
--- a/components/backends/trtllm/deploy/agg.yaml
+++ b/components/backends/trtllm/deploy/agg.yaml
@@ -33,7 +33,7 @@
-              --extra-engine-args engine_configs/agg.yaml
+              --extra-engine-args engine_configs/multinode/agg.yaml  # ensure this matches the actual config location
components/backends/trtllm/deploy/disagg_router.yaml (1)

53-53: Confirm disaggregation-strategy consistency across router vs non-router configs

I’ve verified that the router and non-router deployments use different strategies:

In components/backends/trtllm/deploy/disagg_router.yaml
• Line 36: --disaggregation-strategy prefill_first for the prefill worker
• Line 53: --disaggregation-strategy prefill_first for the decode worker

In components/backends/trtllm/deploy/disagg.yaml
• Line 33: --disaggregation-strategy decode_first for the prefill worker
• Line 50: --disaggregation-strategy decode_first for the decode worker

If the router is intentionally configured to always “prefill_first,” then no changes are needed. Otherwise, please update disagg_router.yaml to use decode_first to match the non-router deployment and ensure predictable routing behavior.

components/backends/sglang/launch/agg_router.sh (1)

23-24: HF model Qwen/Qwen3-0.6B exists and is ungated
The Hugging Face API returns modelId “Qwen/Qwen3-0.6B” with "gated": false, so the swap is safe and CI/startup will resolve correctly.

Verified via curl https://huggingface.co/api/models/Qwen/Qwen3-0.6B → "gated": false

components/backends/trtllm/deploy/disagg.yaml (1)

50-50: Validate Qwen3-0.6B compatibility with current engine configs

I inspected the TRT-LLM engine configurations and found only two generic YAML files—decode.yaml and prefill.yaml—with no model-specific overrides or Qwen-related settings. These files use default PyTorch backends and general parameters (e.g. tensor parallelism = 1, trust_remote_code = true, etc.) but do not account for any unique architecture or tokenization requirements that Qwen3-0.6B may have.

Please manually verify the following:

components/backends/trtllm/engine_configs/decode.yaml

components/backends/trtllm/engine_configs/prefill.yaml

Ensure they correctly handle Qwen3-0.6B’s:

Tensor/expert parallelism

Chunked prefill and decode workflows

KV-cache memory fraction and CUDA-graph batching

Tokenization settings (vocabulary, special tokens, etc.)

If Qwen3-0.6B demands different engine parameters or tokenizer configurations, add or adjust a model-specific YAML (e.g. decode_qwen3-0.6B.yaml). Otherwise confirm that the existing generic configs suffice for your deployment tests.

components/backends/sglang/deploy/disagg_planner.yaml (1)

119-121: Please verify that profiling artifacts for the new model exist

It looks like the /workspace/profiling_results directory did not yield any files matching Qwen3, Qwen-0.6B, or Qwen. Before swapping in Qwen3-0.6B, we must ensure that planner profiles (SLA data) have been generated for it.

• In components/backends/sglang/deploy/disagg_planner.yaml (lines 119–121):
– The flags --model-path Qwen/Qwen3-0.6B and --served-model-name Qwen/Qwen3-0.6B were added.
– Confirm that a corresponding profile (e.g. profiling_results/Qwen3-0.6B.*) exists—otherwise planner behavior may be untested for this model.

components/backends/trtllm/launch/disagg.sh (1)

6-7: Verify TensorRT-LLM engine configs for Qwen3-0.6B

It looks like the engine_configs folder wasn’t detected at the project root, so the audit didn’t surface any YAMLs. Please locate the TensorRT-LLM configuration directory (e.g. under components/backends/trtllm/engine_configs/), then:

List all *.yaml files in that directory and confirm none still reference DeepSeek (model names, comments, example values).

In each YAML, ensure the default model/tokenizer points to Qwen/Qwen3-0.6B.

Verify Qwen-specific settings match the model’s requirements:

max_sequence_length (or max_seq_len/context_length)

Data type (dtype) and numeric precision/quantization parameters

KV-cache connector and cache size settings if present (kv_cache or similar)

Remove or replace any lingering DeepSeek parameters with Qwen-aligned values.

Manually review and update these configs to guarantee compatibility with Qwen3-0.6B.

components/backends/trtllm/launch/disagg_router.sh (1)

6-7: DISAGGREGATION_STRATEGY & Metrics Publishing Verified

All checks confirm that for Qwen3-0.6B in disagg_router.sh:

The default DISAGGREGATION_STRATEGY is set to "prefill_first".

The conditional logic adds --publish-events-and-metrics only to the prefill side when using prefill_first, and only to the decode side when using decode_first.

Across the other TRT-LLM launchers, defaults vary appropriately by model (e.g., disagg.sh defaults to decode_first; gpt_oss_disagg.sh defaults to prefill_first), matching intended throughput trade-offs.

No inconsistent defaults or duplicate metrics-publishing branches were found.

components/backends/sglang/deploy/disagg-multinode.yaml

components/backends/sglang/README.md

coderabbitai · 2025-08-20T22:25:18Z

Walkthrough

All sglang and TRT-LLM deployment manifests, launch scripts, and related docs were updated to use the model identifier Qwen/Qwen3-0.6B instead of previous defaults (DeepSeek-R1-Distill-Llama-8B or Llama-3.3-70B). Only --model-path, --served-model-name, or default env vars changed; control flow and other flags remain unchanged.

Changes

Cohort / File(s)	Summary of Changes
SGLang docs `components/backends/sglang/README.md`, `components/backends/sglang/docs/sgl-hicache-example.md`	Updated example model identifiers in curl payloads and example commands to Qwen/Qwen3-0.6B; no other edits.
SGLang deploy YAMLs `components/backends/sglang/deploy/agg.yaml`, `.../deploy/agg_router.yaml`, `.../deploy/disagg.yaml`, `.../deploy/disagg_planner.yaml`, `.../deploy/disagg-multinode.yaml`	Switched --model-path and --served-model-name to Qwen/Qwen3-0.6B across workers (prefill/decode). One file replaced Llama-3.3-70B; others replaced DeepSeek-R1-Distill-Llama-8B. All other args unchanged.
SGLang launch scripts `components/backends/sglang/launch/agg.sh`, `.../launch/agg_router.sh`, `.../launch/disagg.sh`	Updated worker invocations to Qwen/Qwen3-0.6B for --model-path and --served-model-name; no control-flow changes.
TRT-LLM deploy YAMLs `components/backends/trtllm/deploy/agg.yaml`, `.../deploy/agg_router.yaml`, `.../deploy/disagg.yaml`, `.../deploy/disagg_router.yaml`	Updated model-path and served-model-name to Qwen/Qwen3-0.6B in prefill/decode workers; other flags/engine args unchanged.
TRT-LLM launch scripts `components/backends/trtllm/launch/agg.sh`, `.../launch/agg_router.sh`, `.../launch/disagg.sh`, `.../launch/disagg_router.sh`	Changed default env vars MODEL_PATH and SERVED_MODEL_NAME to Qwen/Qwen3-0.6B; scripts otherwise unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

docs: add sglang hicache example and guide #2388 — Also updates sglang hicache docs/examples and configs to new model identifiers.
feat: Add trtllm deploy examples for k8s #2133 #2207 — Modifies TRT-LLM deploy/launch configs affecting model configuration arguments.
feat: support sglang in sla planner #2421 — Touches sglang disaggregation planner deployment, updating model-related arguments.

Poem

A hop and a swap through the clover I go,
DeepSeek to Qwen, what a tidy new show.
Prefill and decode in a chorus of three,
Scripts hum along with a lighter key.
I thump out a tune: “Qwen3 it shall be!” 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

components/backends/sglang/deploy/disagg-multinode.yaml (1)
57-57: Namespace mismatch: prefill uses sglang-disagg vs. -multinode elsewhere.

Frontend and Decode use dynamoNamespace: sglang-disagg-multinode, but Prefill sets sglang-disagg. That likely prevents service discovery and planner/router from seeing all workers under one namespace.

Apply this diff:
-      dynamoNamespace: sglang-disagg
+      dynamoNamespace: sglang-disagg-multinode
components/backends/sglang/launch/agg_router.sh (1)
18-18: Use python3 for consistency and to avoid invoking Python 2 or missing python shim.

This can break on systems where python isn’t available or points to Python 2.
-python -m dynamo.frontend --router-mode kv --http-port=8000 &
+python3 -m dynamo.frontend --router-mode kv --http-port=8000 &

🧹 Nitpick comments (19)

components/backends/sglang/deploy/disagg.yaml (1)
35-36: Make Qwen/Qwen3-0.6B overridable via env vars in disagg.yaml

To avoid drifting from upstream and speed up iteration, let’s parameterize the hard-coded Qwen/Qwen3-0.6B values in components/backends/sglang/deploy/disagg.yaml. Update both the decode and prefill sections:

• File: components/backends/sglang/deploy/disagg.yaml
– Lines 35–36 (decode)
– Lines 62–64 (prefill)

Suggested change:
-              --model-path Qwen/Qwen3-0.6B
-              --served-model-name Qwen/Qwen3-0.6B
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B}
+              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
Optional: pin a specific HF snapshot for CI determinism by appending a revision flag (e.g. commit SHA or tag):
-              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B}
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B@${QWEN_REV:-<commit_sha>}}
You may also consider applying the same MODEL_PATH/SERVED_MODEL_NAME overrides to the other SGLang deployment templates (e.g. agg.yaml, agg_router.yaml, disagg-multinode.yaml, disagg_planner.yaml) for consistency across environments.
components/backends/trtllm/deploy/agg.yaml (1)
35-36: Confirm TRT-LLM engine configuration and tokenizer compatibility for Qwen3-0.6B.

Switching to Qwen/Qwen3-0.6B is fine, but TRT-LLM engines and tokenizer configs are model-specific. Please verify that:

engine_configs/agg.yaml is valid for Qwen3-0.6B (e.g., vocab/tokenizer settings, max context, dtype).

The current TRT-LLM runtime image supports Qwen3 architectures.

A pinned HF revision is used to avoid engine rebuild drift in CI.

Optionally make the values env-overridable to ease testing:
-              --model-path Qwen/Qwen3-0.6B
-              --served-model-name Qwen/Qwen3-0.6B
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B}
+              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
components/backends/trtllm/deploy/agg_router.yaml (1)
38-39: Router targets updated; verify end-to-end name alignment and pin revision.

The router now advertises Qwen/Qwen3-0.6B. Ensure every interacting component (agg, disagg, launch scripts) uses the exact same served-model-name to prevent routing misses. Also consider pinning the HF revision for stability.

Env-overridable tweak (optional):
-              --model-path Qwen/Qwen3-0.6B
-              --served-model-name Qwen/Qwen3-0.6B
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B}
+              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
components/backends/sglang/deploy/agg_router.yaml (1)
38-39: SGLang router model updated; keep name consistency and consider revision pinning.

Looks consistent with the rest of the PR. Please verify that clients and upstream components expect the served-model-name with a slash; some stacks prefer a slashless alias for endpoint keys. If needed, switch to an alias but update all manifests together.

Optional env-overridable tweak:
-              --model-path Qwen/Qwen3-0.6B
-              --served-model-name Qwen/Qwen3-0.6B
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B}
+              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
If your tooling requires a slashless endpoint name, consider:
-              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
+              --served-model-name ${SERVED_MODEL_NAME:-qwen3-0.6b}
Coordinate this change across all deploy/launch files if you adopt it.
components/backends/sglang/deploy/agg.yaml (1)
35-36: LGTM; model swap is minimal and scoped. Add revision pin for reproducible CI.

This change aligns with the PR’s goal of faster startup with a smaller model. To avoid upstream drift in CI, pin a specific HF snapshot.

Suggested tweak:
-              --model-path Qwen/Qwen3-0.6B
-              --served-model-name Qwen/Qwen3-0.6B
+              --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B@${QWEN_REV:-<commit_sha>}}
+              --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}
components/backends/sglang/deploy/disagg-multinode.yaml (3)

46-53: Double-check tp-size vs GPU topology for 0.6B.

tp-size=8 with resources.limits.gpu="4" and multinode.nodeCount=2 likely targets 8-way TP across 8 total GPUs. With Qwen3-0.6B, 8-way TP may be unnecessary overhead and could reduce efficiency. Consider tp-size=1 or 2 unless there’s a measured benefit.

If you have perf numbers, please share. Otherwise, I can propose a sizing matrix for Qwen3-0.6B under your cluster constraints.

52-53: mem-fraction-static=0.82 may be overly conservative for a 0.6B model.

For smaller models, you can usually reclaim memory for batching/throughput. Consider lowering to ~0.7 and re-measuring OR set to an empirically derived value.

Also applies to: 78-79

44-49: Security: --trust-remote-code is enabled.

Expected for some HF repos, but confirm the SHA/pin if reproducibility is required in CI. Consider pinning a commit or mirror if supply-chain risk matters.

I can add an allowlist/pinning mechanism to the manifests if you’d like.

Also applies to: 71-76

components/backends/trtllm/deploy/disagg.yaml (1)

50-50: served-model-name with slashes may not be Triton-safe.

If served-model-name is used as a Triton model repository name, slashes are invalid. Some codepaths sanitize automatically; others don’t. Consider a sanitized served-model-name (e.g., Qwen__Qwen3-0.6B) or add a sanitization layer.

If you want to keep CLI values unchanged, we can sanitize internally where the name is consumed.

components/backends/trtllm/deploy/disagg_router.yaml (1)

53-53: Decode worker: match engine to model and confirm name sanitation.

Same concerns as Prefill: confirm decode.yaml aligns with Qwen3-0.6B, and ensure served-model-name is safe for downstream consumers (e.g., metrics, Triton model repo).

If sanitation is needed, I can add a simple normalization (replace / and : with __) in the launcher.
components/backends/trtllm/launch/agg.sh (1)
6-7: Optional: derive a Triton-safe SERVED_MODEL_NAME from MODEL_PATH by default.

If downstream consumers treat served-model-name as an identifier (filesystem or metrics), slashes can be problematic. Consider auto-sanitizing when the var isn’t explicitly set.

Apply this diff to keep the current MODEL_PATH default but sanitize SERVED_MODEL_NAME when unset:
 export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"${MODEL_PATH//\//__}"}
This yields Qwen__Qwen3-0.6B by default while preserving user overrides.
components/backends/trtllm/launch/disagg.sh (1)
17-25: Fail fast and gate on clear_namespace for safer launches.

The script proceeds even if clear_namespace fails, and the script doesn’t fail fast on other errors. Recommend adding strict mode and gating to avoid undefined states.

Apply:
 #!/bin/bash
+set -euo pipefail
@@
-python3 utils/clear_namespace.py --namespace dynamo
+python3 utils/clear_namespace.py --namespace dynamo || { echo "clear_namespace failed; aborting."; exit 1; }
Also applies to: 26-31
components/backends/trtllm/launch/agg_router.sh (2)
6-7: LGTM on switching defaults; confirm consumers tolerate slash in served name.

The model swap aligns with CI speed goals. Please confirm any systems that serialize SERVED_MODEL_NAME (filenames, metrics keys) accept "Qwen/Qwen3-0.6B"; otherwise consider a slug.

Potential mitigation:
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen-Qwen3-0.6B"}
19-24: Make router startup contingent on namespace clearing.

Per prior decision in deploy YAMLs, router shouldn’t start if namespace isn’t cleared. Mirror that here to reduce flakiness.
-python3 utils/clear_namespace.py --namespace dynamo
+set -euo pipefail
+python3 utils/clear_namespace.py --namespace dynamo
components/backends/trtllm/launch/disagg_router.sh (2)
6-7: Defaults updated correctly; please validate slash handling and strategy docs.

Confirm slash in SERVED_MODEL_NAME doesn’t leak into file paths or identifiers.

DISAGGREGATION_STRATEGY default is "prefill_first" in this router vs "decode_first" in disagg.sh. If intentional, consider a short comment to avoid confusion.

Optional served-name slug:
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen-Qwen3-0.6B"}
23-28: Gate router start on clear_namespace and fail fast.

Keep behavior consistent with deploy flows; abort early if cleanup fails.
-python3 utils/clear_namespace.py --namespace dynamo
+set -euo pipefail
+python3 utils/clear_namespace.py --namespace dynamo
components/backends/sglang/docs/sgl-hicache-example.md (1)
14-14: Consider adding an explicit --served-model-name for consistency
SGLang’s CLI launch omits --served-model-name, while downstream request examples assume “Qwen/Qwen3-0.6B.” Making the served name explicit avoids mismatches if defaults ever change. Note: SGLang natively supports Qwen3 models, so a --trust-remote-code flag is not required here.

• File: components/backends/sglang/docs/sgl-hicache-example.md, line 14
 python -m dynamo.sglang \
   --model-path Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
   --host 0.0.0.0 --port 8000 \
components/backends/sglang/launch/disagg.sh (2)
23-24: DRY the model identifiers via a single variable; easier future switches.

Both --model-path and --served-model-name repeat the same literal. Centralize to MODEL to simplify maintenance across prefill/decode.
+# Allow override: MODEL and TP
+MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
+TP="${TP:-1}"
 ...
 python3 -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
+  --model-path "$MODEL" \
+  --served-model-name "$MODEL" \
   --page-size 16 \
-  --tp 1 \
+  --tp "$TP" \
   --trust-remote-code \
   --skip-tokenizer-init \
   --disaggregation-mode prefill \
   --disaggregation-transfer-backend nixl &
35-36: Apply the same MODEL variable to the decode worker; keep served name identical.

Mirrors the prefill change; prevents accidental divergence between workers.
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
+  --model-path "$MODEL" \
+  --served-model-name "$MODEL" \
   --page-size 16 \
-  --tp 1 \
+  --tp "$TP" \
   --trust-remote-code \
   --skip-tokenizer-init \
   --disaggregation-mode decode \
   --disaggregation-transfer-backend nixl

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d5b66fa and e73e4d6.

📒 Files selected for processing (18)

components/backends/sglang/README.md (1 hunks)
components/backends/sglang/deploy/agg.yaml (1 hunks)
components/backends/sglang/deploy/agg_router.yaml (1 hunks)
components/backends/sglang/deploy/disagg-multinode.yaml (1 hunks)
components/backends/sglang/deploy/disagg.yaml (2 hunks)
components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
components/backends/sglang/docs/sgl-hicache-example.md (3 hunks)
components/backends/sglang/launch/agg.sh (1 hunks)
components/backends/sglang/launch/agg_router.sh (2 hunks)
components/backends/sglang/launch/disagg.sh (2 hunks)
components/backends/trtllm/deploy/agg.yaml (1 hunks)
components/backends/trtllm/deploy/agg_router.yaml (1 hunks)
components/backends/trtllm/deploy/disagg.yaml (2 hunks)
components/backends/trtllm/deploy/disagg_router.yaml (2 hunks)
components/backends/trtllm/launch/agg.sh (1 hunks)
components/backends/trtllm/launch/agg_router.sh (1 hunks)
components/backends/trtllm/launch/disagg.sh (1 hunks)
components/backends/trtllm/launch/disagg_router.sh (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:00:07.968Z

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

components/backends/trtllm/deploy/agg_router.yaml
components/backends/sglang/launch/agg_router.sh
components/backends/sglang/deploy/agg_router.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (9)

components/backends/sglang/deploy/disagg_planner.yaml (3)

119-120: LGTM: Model updated to Qwen/Qwen3-0.6B (Decode worker).

The change is consistent with the PR objective and other files.

145-146: LGTM: Model updated to Qwen/Qwen3-0.6B (Prefill worker).

Matches Decode and keeps the deployment coherent.

121-124: Ensure CLI Flag Consistency for sglang Manifests

The sglang manifests are currently mixed between two flags:

--tp <value> (used in 6 manifests)

--tp-size <value> (used only in disagg-multinode.yaml)

Run the following to see all occurrences:
rg -n -S -g 'components/**/sglang/**/*.yaml' -e '\-\-tp(\s|=)|\-\-tp-size(\s|=)'
Results:

components/backends/sglang/deploy/agg_router.yaml:41 (--tp 1)

components/backends/sglang/deploy/disagg.yaml:38, 65 (--tp 1)

components/backends/sglang/deploy/disagg_planner.yaml:122, 148 (--tp 1)

components/backends/sglang/deploy/agg.yaml:38 (--tp 1)

components/backends/sglang/deploy/disagg-multinode.yaml:46, 73 (--tp-size 8)

Please verify which flag the sglang runtime (nvcr.io/.../sglang-runtime:hzhou-0811-1) expects:

Run docker run --rm nvcr.io/.../sglang-runtime:hzhou-0811-1 sglang --help | grep tp

Confirm whether --tp is an accepted alias or only --tp-size is supported.

Once confirmed, standardize all manifests to use the correct flag. If the runtime only supports --tp-size, update:

Replace every --tp 1 with --tp-size 1 (or the appropriate value) in the six affected YAML files.

If both flags are supported, add a note to the deployment docs or scripts indicating that --tp is an alias for --tp-size.
components/backends/trtllm/deploy/disagg.yaml (1)

33-33: Verify engine artifacts rebuilt for Qwen3-0.6B

The engine_configs/prefill.yaml you’re passing to the TRT-LLM launcher under --extra-engine-args is a generic template (no model names or engine paths are hard-coded), but it has not been specialized for Qwen3-0.6B in the same way we maintain per-model subfolders (e.g. deepseek_r1, llama4, gemma3, etc.). Without regenerating the TensorRT engines (or adding a Qwen3-0.6B specific config), you risk:

Shape or vocabulary mismatches (hidden size, max sequence length)

Incorrect parallel-ism or memory-fraction settings

Runtime failures when loading or executing the engine binaries

Please confirm:

You have rebuilt the TRT engines for Qwen/Qwen3-0.6B (e.g. via trtllm build … --model-path Qwen/Qwen3-0.6B) so that the .engine files reflect the new model’s dimensions.

The deploy command’s --served-model-name and --model-path point to the directory containing those newly generated artifacts.

If the default engine_configs/prefill.yaml flags need tuning (parallel sizes, max_seq_len, dtype, etc.), consider adding a components/backends/trtllm/engine_configs/qwen3-0.6B/ folder with model-specific overrides and reference that instead.

components/backends/trtllm/deploy/disagg.yaml:33
-  - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B \
-     --served-model-name Qwen/Qwen3-0.6B \
-     --extra-engine-args engine_configs/prefill.yaml \
-     --disaggregation-mode prefill \
-     --disaggregation-strategy decode_first"
+  - "… --extra-engine-args engine_configs/qwen3-0.6B/prefill.yaml …"
components/backends/trtllm/deploy/disagg_router.yaml (1)

36-36: Confirm Qwen3-0.6B engine config and router strategy

I inspected engine_configs/prefill.yaml and discovered it’s the generic prefill config with no Qwen-specific overrides—meaning there isn’t a rebuilt version for Qwen3-0.6B. I also verified in disagg_router.yaml that both the Prefill and Decode workers are set to --disaggregation-strategy prefill_first.

• engine_configs/prefill.yaml is unchanged from the generic template; no Qwen3-0.6B tuning present.
• disagg_router.yaml uses prefill_first for both TRTLLMPrefillWorker and TRTLLMDecodeWorker.

Please confirm whether:

A Qwen3-0.6B–specific prefill config should be added (or if the generic prefill.yaml is sufficient).

The router’s use of prefill_first for both workers is intentional, or if the Decode worker should use decode_first instead.
components/backends/trtllm/launch/disagg.sh (1)

6-7: Action Required: Confirm TRT-LLM config compatibility & slash handling in SERVED_MODEL_NAME

Engine configuration files (engine_configs/prefill.yaml and engine_configs/decode.yaml) appear model-agnostic (no hard-coded architecture, tokenizer, or dtype overrides), but please verify they’re tuned for Qwen/Qwen3-0.6B’s characteristics to avoid performance or runtime regressions.

SERVED_MODEL_NAME still contains a slash. It appears in:
• components/backends/trtllm/launch/disagg.sh
• examples/basics/multinode/README.md
• deploy/inference-gateway/vllm_agg_qwen.yaml
• lib/bindings/python/examples/hello_world/server_sglang_static.py
• benchmarks/profiler/utils/dynamo_deployment.py
• various docs and test fixtures under lib/llm/tests/data/replays
Double-check that downstream consumers—metrics label values, Dynamo namespaces, Kafka topic names, filenames—handle “/” correctly.

If any consumer fails on “/”, slugify the default served name:
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
+export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen-Qwen3-0.6B"}
To catch any remaining unsafe usages of “/” in served/model names, re-run:
rg -nP --glob '!**/node_modules/**' '(served|model).*name.*[/]' -C3
components/backends/sglang/launch/agg_router.sh (1)

15-16: Block startup if namespace clearing fails (consistency with prior decision).

Past learning indicates the router should not start when clear_namespace fails. Mirror that behavior here.

Add strict mode so failure of clear_namespace stops the script:
 # Setup cleanup trap
 cleanup() {
@@
-trap cleanup EXIT INT TERM
+trap cleanup EXIT INT TERM
+set -euo pipefail
components/backends/sglang/docs/sgl-hicache-example.md (2)

42-42: Confirm the request “model” string matches the served name.

Given the explicit --served-model-name suggestion above, verify this value remains identical to avoid 404/not found errors from the OpenAI-compatible endpoint.

59-59: perf.sh uses Hugging Face model ID

I confirmed that the --model flag in the perf harness is defined as “Hugging Face model ID to benchmark” and is passed directly to genai-perf profile (see the usage string and the --model ${model} invocation in perf.sh). There is no separate --served-model-name parameter. The example
  --model Qwen/Qwen3-0.6B \
is therefore correct.

• No changes required to the example value.
• Optional nit: add a brief note above the snippet stating that --model expects the HF repo ID (not a custom served-model-name).

components/backends/sglang/deploy/disagg-multinode.yaml

components/backends/sglang/launch/agg_router.sh

components/backends/sglang/launch/agg.sh

components/backends/sglang/README.md

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

components/backends/sglang/deploy/disagg.yaml (1)
62-69: Outstanding legacy model references detected; migration incomplete

The repository grep revealed numerous lingering references to DeepSeek-R1-Distill-Llama-8B, meta-llama/Llama-3.3-70B-Instruct, and other pre-migration model identifiers. These must be updated (or removed) to complete the migration to Qwen3.0 and avoid confusion or breakages.

Files/locations needing attention include (but are not limited to):

README.md (e.g. lines 127, 134)

lib/llm/tests/data/replays/deepseek-r1-distill-llama-8b/**

tests/serve/**/*.py (trtllm, sglang, vllm)

tests/kvbm and tests/fault_tolerance configurations

docs/index.rst, docs/guides/, components/backends/vllm/, components/backends/sglang/**

examples/deployments/router_standalone/**

benchmarks/**

Once all legacy references are replaced with the appropriate new model names (e.g., Qwen/Qwen3-0.6B or slash-free variants), the migration will be consistent across the codebase.

Optional served-model-name normalization for this file (to prevent downstream string-handling issues):
components/backends/sglang/deploy/disagg.yaml
- --served-model-name Qwen/Qwen3-0.6B
+ --served-model-name Qwen3-0.6B
components/backends/sglang/deploy/disagg-multinode.yaml (1)
57-57: Namespace mismatch between decode and prefill services.

decode uses dynamoNamespace "sglang-disagg-multinode" while prefill uses "sglang-disagg". This will break discovery/routing across services.

Apply this diff to align the namespace:
-      dynamoNamespace: sglang-disagg
+      dynamoNamespace: sglang-disagg-multinode

♻️ Duplicate comments (2)

components/backends/sglang/README.md (1)

196-196: Curl example updated to use the new model.

The testing payload correctly references Qwen/Qwen3-0.6B, matching the model updates throughout the SGLang configuration files.
components/backends/sglang/deploy/disagg-multinode.yaml (1)
44-45: Critical: decode still points to Llama while prefill uses Qwen (model mismatch).

Decode and prefill must use the exact same model for tokenizer/KV/routing compatibility. Update both flags to Qwen/Qwen3-0.6B.

Apply this diff:
-              --model-path meta-llama/Llama-3.3-70B-Instruct
-              --served-model-name meta-llama/Llama-3.3-70B-Instruct
+              --model-path Qwen/Qwen3-0.6B
+              --served-model-name Qwen/Qwen3-0.6B

🧹 Nitpick comments (14)

components/backends/sglang/docs/sgl-hicache-example.md (1)
14-23: Consider specifying --served-model-name to match the request payload.

Your curl below sends "model": "Qwen/Qwen3-0.6B". If the frontend routes by served-model-name, explicitly setting it avoids mismatches with any defaults.

Apply this diff in the snippet:
 python -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
+  --model-path Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
   --host 0.0.0.0 --port 8000 \
components/backends/sglang/launch/disagg.sh (2)
35-42: Decode pinned to GPU 1 may fail on single-GPU hosts; add a guard.

Hard-coding CUDA_VISIBLE_DEVICES=1 will break on 1-GPU nodes. Add a simple GPU count check and choose a valid device, or allow K8s/device plugin to assign.

Add this near the top of the script:
# Detect GPUs and pick a safe device index for decode if needed
GPU_COUNT=$(nvidia-smi -L 2>/dev/null | wc -l | xargs)
if [[ -z "$GPU_COUNT" || "$GPU_COUNT" -lt 2 ]]; then
  echo "Only $GPU_COUNT GPU(s) detected; running both workers on GPU 0."
  export DECODE_CUDA_PREFIX=""
else
  export DECODE_CUDA_PREFIX="CUDA_VISIBLE_DEVICES=1 "
fi
Then change the decode launch:
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
+${DECODE_CUDA_PREFIX}python3 -m dynamo.sglang \
23-25: Optional: normalize served-model-name to a slash-free alias.

Same rationale as TRT-LLM configs; avoid '/' in identifiers used by logs/metrics/routes, unless you’ve validated it across the stack.
-  --served-model-name Qwen/Qwen3-0.6B \
+  --served-model-name Qwen3-0.6B \
If you keep the slash, confirm the frontend routes, metrics, and any file paths handle it safely.

Also applies to: 35-37
components/backends/trtllm/deploy/agg.yaml (1)

35-36: Model switch LGTM; re-validate engine configs and consider slash-free served name.

Ensure engine_configs/agg.yaml is compatible with Qwen3 and rebuild engines as needed.

Consider using Qwen3-0.6B for --served-model-name to avoid '/' issues.

Smoke test plan:

Deploy agg and run a minimal prompt through the Frontend with "model": "".

Check logs for tokenizer/engine load errors tied to Qwen3.
components/backends/sglang/deploy/agg_router.yaml (1)
38-43: Normalize served-model-name and revisit --skip-tokenizer-init trade-offs

served-model-name containing a slash can complicate routing keys, dashboards, and metrics labels in some environments. Consider a normalized alias (e.g., qwen3-0.6b).

With a small model, the benefit of --skip-tokenizer-init may be marginal vs. the cost of first-request latency. If CI stability matters more than cold-start time, consider removing it or making it env-tunable.

Apply if you want to adopt a stable alias and drop the skip flag:
-              --served-model-name Qwen/Qwen3-0.6B
+              --served-model-name qwen3-0.6b
...
-              --skip-tokenizer-init
If you keep the flag, can you confirm that first tokenization happens off the hot path (e.g., during a warm-up) so CI timings remain predictable?
components/backends/trtllm/launch/agg.sh (1)
6-7: Optional: provide a sanitized alias for SERVED_MODEL_NAME

Some downstream systems dislike slashes in identifiers. Consider deriving a sanitized alias for metrics/routing while still loading the HF path from MODEL_PATH.

You could add this right after the export lines:
# Derived alias without slashes for metrics/routing; still load from MODEL_PATH
export SERVED_MODEL_ALIAS="${SERVED_MODEL_NAME//\//_}"   # Qwen_Qwen3-0.6B
# Then pass --served-model-name "$SERVED_MODEL_ALIAS" to the worker if needed
components/backends/trtllm/deploy/disagg_router.yaml (1)
53-53: Add --publish-events-and-metrics to Decode worker for parity

Prefill has --publish-events-and-metrics but Decode does not. Add it for consistent observability across both stages.
-            - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy prefill_first"
+            - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy prefill_first --publish-events-and-metrics"
components/backends/sglang/deploy/disagg_planner.yaml (2)
117-126: Pin HF revision and reduce trust-remote-code exposure

For reproducible CI and supply-chain safety:

Pin a specific HF revision for Qwen/Qwen3-0.6B (e.g., via @).

If possible, avoid --trust-remote-code; if it’s required by the model, at least pin the revision and set an allowlist.
-              --model-path Qwen/Qwen3-0.6B
+              --model-path Qwen/Qwen3-0.6B@${HF_REVISION:-main}
-              --served-model-name Qwen/Qwen3-0.6B
+              --served-model-name qwen3-0.6b
Add an env to this service (or the namespace) like:
envs:
  - name: HF_REVISION
    value: "abcdef1"  # pin to a known-good commit
Confirm whether SGLang requires remote code for this model; if not, remove --trust-remote-code to tighten security.

143-151: Tokenizer init flag: verify cold-start vs predictability

--skip-tokenizer-init helps startup but shifts cost to the first request. With a 0.6B model, the saved time may be small; consider removing it or performing an explicit warm-up after pod readiness.

If you keep the flag, can you add a readiness/warm-up hook to run a trivial tokenize/generate to amortize the first-request spike?
components/backends/trtllm/launch/disagg.sh (1)
6-7: Safer defaults on single-GPU nodes

Current defaults use GPU 0 for prefill and 1 for decode; this fails on 1-GPU runners. Consider auto-detecting GPU count and falling back to device 0 for both when only one GPU is present.

You can insert this after the export block:
# Auto-detect GPU count and adjust decode device if needed
GPU_COUNT=$(nvidia-smi -L | wc -l | tr -d ' ')
if [[ "${GPU_COUNT:-0}" -lt 2 && "${DECODE_CUDA_VISIBLE_DEVICES}" == "1" ]]; then
  echo "Only ${GPU_COUNT} GPU(s) detected; setting DECODE_CUDA_VISIBLE_DEVICES=0"
  export DECODE_CUDA_VISIBLE_DEVICES="0"
fi
components/backends/sglang/deploy/disagg-multinode.yaml (1)
48-48: Confirm intent to keep --skip-tokenizer-init for Qwen.

With Qwen3-0.6B, skipping tokenizer init is usually safe only if a central tokenizer (e.g., router/frontend) handles all tokenization. If workers ever tokenize (fallbacks, eval paths), this can cause runtime errors.

If workers should own tokenization, drop the flag:
-              --skip-tokenizer-init
(Do this for both decode and prefill.)

Also applies to: 75-75
components/backends/trtllm/launch/agg_router.sh (1)
1-4: Optional: harden the script with strict mode.

Improve failure visibility and cleanup reliability.
 #!/bin/bash
+set -Eeuo pipefail
components/backends/trtllm/launch/disagg_router.sh (2)
14-21: Ensure decode is cleaned up on signals; avoid orphaned processes.

Trap currently kills frontend and prefill PIDs; decode runs in foreground and may survive certain termination paths. Track decode PID and include it in cleanup, or kill the whole process group.

Option A — background decode and wait:
 cleanup() {
   echo "Cleaning up background processes..."
-  kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
-  wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+  kill $DYNAMO_PID $PREFILL_PID $DECODE_PID 2>/dev/null || true
+  wait $DYNAMO_PID $PREFILL_PID $DECODE_PID 2>/dev/null || true
   echo "Cleanup complete."
 }
 trap cleanup EXIT INT TERM
@@
-# run decode worker
-CUDA_VISIBLE_DEVICES=$DECODE_CUDA_VISIBLE_DEVICES python3 -m dynamo.trtllm \
+# run decode worker
+CUDA_VISIBLE_DEVICES=$DECODE_CUDA_VISIBLE_DEVICES python3 -m dynamo.trtllm \
   --model-path "$MODEL_PATH" \
   --served-model-name "$SERVED_MODEL_NAME" \
   --extra-engine-args "$DECODE_ENGINE_ARGS" \
   --disaggregation-mode decode \
   --disaggregation-strategy "$DISAGGREGATION_STRATEGY" \
-  "${EXTRA_DECODE_ARGS[@]}"
+  "${EXTRA_DECODE_ARGS[@]}" &
+DECODE_PID=$!
+wait $DECODE_PID
Option B — kill the whole process group on exit:
-trap cleanup EXIT INT TERM
+trap 'trap - EXIT; kill -- -$$' EXIT INT TERM
(Note: Option B is simpler but more forceful.)

Also applies to: 49-56

31-37: Publishing metrics logic is fine. Consider a single toggle for clarity.

Current conditional works; if you prefer explicitness, allow an env var like PUBLISH_METRICS=1 to override strategy-based defaults.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d5b66fa and e73e4d6.

📒 Files selected for processing (18)

components/backends/sglang/README.md (1 hunks)
components/backends/sglang/deploy/agg.yaml (1 hunks)
components/backends/sglang/deploy/agg_router.yaml (1 hunks)
components/backends/sglang/deploy/disagg-multinode.yaml (1 hunks)
components/backends/sglang/deploy/disagg.yaml (2 hunks)
components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
components/backends/sglang/docs/sgl-hicache-example.md (3 hunks)
components/backends/sglang/launch/agg.sh (1 hunks)
components/backends/sglang/launch/agg_router.sh (2 hunks)
components/backends/sglang/launch/disagg.sh (2 hunks)
components/backends/trtllm/deploy/agg.yaml (1 hunks)
components/backends/trtllm/deploy/agg_router.yaml (1 hunks)
components/backends/trtllm/deploy/disagg.yaml (2 hunks)
components/backends/trtllm/deploy/disagg_router.yaml (2 hunks)
components/backends/trtllm/launch/agg.sh (1 hunks)
components/backends/trtllm/launch/agg_router.sh (1 hunks)
components/backends/trtllm/launch/disagg.sh (1 hunks)
components/backends/trtllm/launch/disagg_router.sh (1 hunks)

🔇 Additional comments (24)

components/backends/sglang/launch/agg_router.sh (2)

23-24: Model references updated correctly for SGLang worker.

The SGLang worker invocation now uses the smaller Qwen/Qwen3-0.6B model, which aligns with the PR objective to reduce CI resource usage and startup time.

33-34: Model references updated correctly for second worker instance.

The second SGLang worker instance (on CUDA device 1) now uses the smaller Qwen/Qwen3-0.6B model, maintaining consistency with the first worker.

components/backends/sglang/deploy/agg.yaml (1)

35-36: Kubernetes deployment config updated consistently.

Both --model-path and --served-model-name are correctly updated to use Qwen/Qwen3-0.6B in the SGLangDecodeWorker container args, maintaining consistency with other deployment configurations.

components/backends/sglang/launch/agg.sh (2)

23-24: Model references updated correctly.

The worker invocation now uses the smaller Qwen/Qwen3-0.6B model for both --model-path and --served-model-name, which supports the PR's goal of reducing resource usage.

28-28: Ignore incorrect removal concern

A search across the SGLang backend confirms that the --skip-tokenizer-init flag remains present in components/backends/sglang/launch/agg.sh (line 28) and in numerous other scripts and YAML definitions. The PR has not removed this flag, so no behavior change or startup-time impact related to its removal will occur.

components/backends/sglang/launch/agg.sh, line 28: --skip-tokenizer-init is still included

Similar occurrences in deploy and launch scripts (e.g., disagg.sh, agg_router.sh) and in YAML configs

Likely an incorrect or invalid review comment.

components/backends/trtllm/deploy/agg_router.yaml (1)

38-39: TRTLLM deployment config updated consistently.

Both --model-path and --served-model-name are correctly updated to use Qwen/Qwen3-0.6B in the TRTLLMWorker container args, maintaining consistency across both SGLang and TRTLLM backends.

components/backends/sglang/docs/sgl-hicache-example.md (3)

14-22: Switching to Qwen/Qwen3-0.6B looks correct and aligns with the PR goal.

The model path change is appropriate for a lightweight CI/startup profile. Keeping --skip-tokenizer-init is fine for cold-start wins. Note that the first request will lazily init the tokenizer.

42-42: Payload model name matches the new HF repo.

The Hugging Face repo Qwen/Qwen3-0.6B exists; keeping the payload model consistent with served-model-name will ensure correct routing. (huggingface.co)

To ensure the router matches on name, confirm the frontend expects an exact served-model-name match. If it strips slashes, prefer a dash-only alias (e.g., Qwen3-0.6B) and adjust the payload accordingly.

59-59: No filesystem paths derived from --model, slash-safe

I’ve reviewed benchmarks/llm/perf.sh and confirmed:

Artifact directories are created as
artifacts_root_dir/artifacts_<index> via a numeric counter, with no use of the model identifier in the path.

The --model value is only passed through to genai-perf and embedded in the JSON config; it isn’t used when constructing any filesystem paths.

Therefore, using a Hugging Face model ID that contains / (e.g. Qwen/Qwen3-0.6B) will not be treated as a directory separator in this script.
components/backends/trtllm/deploy/disagg.yaml (2)

33-33: Confirm TRT-LLM engine configs are compatible with Qwen3.

You’re passing --extra-engine-args engine_configs/prefill.yaml built previously. Ensure these configs (and any prebuilt engines) were regenerated for Qwen3 architecture/tokenizer; otherwise runtime/build will fail at load.

Suggested checks:

Rebuild engines with the Qwen/Qwen3-0.6B checkpoint.

Verify vocab/tokenizer settings and rope scaling match Qwen3 specs.

50-50: Please verify slash usage for --served-model-name across all components

We scanned the repo and found that --served-model-name is used with slash-containing values in scores of scripts, YAML manifests, tests, and documentation. Changing this one invocation in components/backends/trtllm/deploy/disagg.yaml without updating the others will likely break:

• Bash launch scripts under components/backends/trtllm/launch/
• Kubernetes/Slurm job manifests under components/backends/trtllm/deploy/
• Performance-sweep helpers and their SLURM wrappers
• Tests in tests/serve/test_trtllm.py (which read SERVED_MODEL_NAME)
• Examples and READMEs in both components/backends/sglang and components/backends/trtllm

Consider one of two approaches:

Central sanitization in code
Modify trtllm_utils.py (or wherever metrics/resource-names are generated) to automatically replace / with a safe character (e.g. -) when deriving filenames or labels from the served name. This avoids per-file edits.

Consistent aliasing via env/args
If you opt to supply a slash-free alias manually, update every script and manifest together. For this file, an example optional diff:
-            - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy decode_first"
+            - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen3-0.6B   --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy decode_first"
• If you proceed with manual aliasing, be sure to update the 50+ other occurrences (e.g., in launch/*.sh, deploy/*.yaml, tests/serve, and various .md docs).
• If you prefer automatic sanitization, please verify that any new logic correctly handles edge cases (multiple slashes, unusual characters) across all backends.

Let us know which path you choose so we can ensure consistency.
components/backends/sglang/launch/disagg.sh (1)

23-31: Prefill: model switch + disagg flags are consistent.

Good use of --skip-tokenizer-init for faster cold start and explicit disaggregation flags with NIXL.

Ensure the runtime image has NIXL support enabled; otherwise, transfer-backend nixl will error out.

components/backends/sglang/deploy/disagg.yaml (1)

35-42: Decode worker: settings are coherent with the Qwen3 switch.

Flags look consistent: page-size, tp, trust-remote-code, skip-tokenizer-init, decode mode, and nixl transfer backend.

Confirm the router and client payloads use the same served-model-name string to avoid 404 route misses.

components/backends/sglang/deploy/agg_router.yaml (1)

38-39: Model switch to Qwen3-0.6B looks good

This aligns with the PR objective to reduce CI/startup cost. No other behavioral flags changed here.

components/backends/trtllm/launch/agg.sh (1)

6-7: Defaults updated to Qwen3-0.6B — LGTM

Matches the deployment YAMLs and keeps CLI/env symmetry.
components/backends/trtllm/deploy/disagg_router.yaml (2)

36-36: Prefill worker model target updated — looks consistent

The model-path and served-model-name changes are aligned with the rest of the PR.

36-53: Ignore weight/tokenizer references in engine_configs; verify CLI flags instead

The prefill.yaml and decode.yaml under components/backends/trtllm/engine_configs/ only define engine parameters (parallelism, cache, scheduler, etc.) and do not embed any model-weight or tokenizer paths. Instead, the model artifacts are specified via the --model-path and --served-model-name flags in your deployment YAML.

• In components/backends/trtllm/deploy/disagg_router.yaml (lines 36–53), you correctly pass:
  --model-path Qwen/Qwen3-0.6B
  --served-model-name Qwen/Qwen3-0.6B
which ensures the Qwen3-0.6B weights and tokenizer are loaded.
• No changes are needed in engine_configs/prefill.yaml or engine_configs/decode.yaml—they should remain engine-only settings.
• Confirm separately that the converted Qwen3-0.6B weights and tokenizer assets exist at the specified model path (e.g., on Hugging Face or your local registry).

Likely an incorrect or invalid review comment.
components/backends/sglang/deploy/disagg_planner.yaml (2)

119-120: Decode worker switched to Qwen3-0.6B — good

Matches the rest of the migration, no conflicting flags here.

145-146: Prefill worker switched to Qwen3-0.6B — good

Consistent with decode worker and other files in the PR.

components/backends/trtllm/launch/disagg.sh (2)

6-7: Defaults updated to Qwen3-0.6B — LGTM

Keeps parity with other launchers and deploy YAMLs.

6-7: Engine config regen needed for Qwen3-0.6B
I inspected the top-level engine configs under components/backends/trtllm/engine_configs and didn’t find any model‐specific entries or mentions of “Qwen” (nor “DeepSeek”) in either prefill.yaml or decode.yaml. This suggests they haven’t been regenerated with the new model’s vocab, rope, and kv‐cache settings.

Please manually verify and, if needed, regenerate both files for Qwen3-0.6B to ensure:

The correct vocab_file (or tokenizer) path is set.

The rotary embedding parameters (rope_base, rope_theta_scale, etc.) match Qwen3-0.6B.

The kv_cache_format aligns with the model’s cache implementation.

Compare against a known working model config (e.g. engine_configs/deepseek_r1/simple/prefill.yaml) and run your usual regeneration script (e.g. tools/regenerate_engine_config.py --model Qwen/Qwen3-0.6B ...) to avoid subtle runtime errors.

components/backends/sglang/deploy/disagg-multinode.yaml (1)

46-47: Verify TP/gpu topology across multinode.

tp-size is 8 while each pod requests 4 GPUs and multinode nodeCount is 2. If the intent is 8-way TP across 2 nodes (4 GPUs each), confirm NCCL/Gloo env and topology are set accordingly and that both decode/prefill pods land symmetrically.

Also applies to: 61-63, 73-74

components/backends/trtllm/launch/agg_router.sh (1)

6-7: No lingering DeepSeek/Llama defaults in launch/deploy scripts

I ran the provided grep over all TRT-LLM files and confirmed that the only matches for “DeepSeek”, “deepseek-ai”, “meta-llama” or “Llama-3” appear in Markdown docs (performance_sweeps/, multinode guides, README.md). None of the shell scripts in components/backends/trtllm/launch or components/backends/trtllm/deploy contain those references, and agg_router.sh is correctly set to Qwen/Qwen3-0.6B. All good to merge.

components/backends/trtllm/launch/disagg_router.sh (1)

6-7: Defaults updated to Qwen3-0.6B: aligns with the migration.

components/backends/sglang/deploy/agg_router.yaml

components/backends/trtllm/launch/agg.sh

…2607)

Signed-off-by: Hannah Zhang <[email protected]>

Signed-off-by: Krishnan Prashanth <[email protected]>

Signed-off-by: nnshah1 <[email protected]>

pull-request-size bot added the size/M label Aug 20, 2025

biswapanda changed the title ~~Bis/sgl small model~~ feat: consistent models across deploy examples Aug 20, 2025

biswapanda self-assigned this Aug 20, 2025

biswapanda requested review from ishandhanani, julienmancuso and tedzhouhk and removed request for ishandhanani and julienmancuso August 20, 2025 21:53

github-actions bot added the feat label Aug 20, 2025

biswapanda added 3 commits August 20, 2025 14:55

--wip--

7559978

--wip--

cff0642

trtllm

e73e4d6

biswapanda force-pushed the bis/sgl-small-model branch from f9360f6 to e73e4d6 Compare August 20, 2025 21:55

copy-pr-bot bot temporarily deployed to GITLAB August 20, 2025 21:55 Inactive

biswapanda requested review from dmitry-tokarev-nv and tanmayv25 August 20, 2025 21:55

biswapanda changed the title ~~feat: consistent models across deploy examples~~ feat: use consistent small models across all deploy examples Aug 20, 2025

biswapanda requested a review from julienmancuso August 20, 2025 21:56

copy-pr-bot bot temporarily deployed to GITLAB August 20, 2025 22:00 Inactive

julienmancuso approved these changes Aug 20, 2025

View reviewed changes

biswapanda enabled auto-merge (squash) August 20, 2025 22:12

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

components/backends/sglang/deploy/disagg-multinode.yaml Show resolved Hide resolved

components/backends/sglang/README.md Show resolved Hide resolved

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

components/backends/sglang/deploy/disagg-multinode.yaml Show resolved Hide resolved

components/backends/sglang/launch/agg_router.sh Show resolved Hide resolved

components/backends/sglang/launch/agg.sh Show resolved Hide resolved

components/backends/sglang/README.md Show resolved Hide resolved

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

components/backends/sglang/deploy/agg_router.yaml Show resolved Hide resolved

components/backends/trtllm/launch/agg.sh Show resolved Hide resolved

biswapanda merged commit 4eb2563 into main Aug 21, 2025
18 of 20 checks passed

biswapanda deleted the bis/sgl-small-model branch August 21, 2025 01:39

biswapanda added a commit that referenced this pull request Aug 21, 2025

feat: use consistent small models across all deploy examples (#2573)

8d64387

biswapanda mentioned this pull request Aug 21, 2025

feat: use consistent small models across all deploy examples (#2573) #2607

Merged

dmitry-tokarev-nv pushed a commit that referenced this pull request Aug 21, 2025

feat: use consistent small models across all deploy examples (#2573) (#…

2b7c2a7

…2607)

This was referenced Aug 22, 2025

fix: 0.4.1 disable kvbm tests (CP #2611) #2635

Merged

chore: vllm 0.10.1.1 #2691

Merged

fix: hello world DGD #2732

Merged

hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025

feat: use consistent small models across all deploy examples (#2573)

3bfecb3

Signed-off-by: Hannah Zhang <[email protected]>

nv-anants pushed a commit that referenced this pull request Aug 28, 2025

feat: use consistent small models across all deploy examples (#2573)

7425d7e

KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025

feat: use consistent small models across all deploy examples (#2573)

6aabf05

Signed-off-by: Krishnan Prashanth <[email protected]>

coderabbitai bot mentioned this pull request Sep 8, 2025

fix: fix typo in multinode example #2931

Merged

nnshah1 pushed a commit that referenced this pull request Sep 8, 2025

feat: use consistent small models across all deploy examples (#2573)

de34b8c

Signed-off-by: nnshah1 <[email protected]>

feat: use consistent small models across all deploy examples #2573

feat: use consistent small models across all deploy examples #2573

Uh oh!

Conversation

biswapanda commented Aug 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 20, 2025

Uh oh!

coderabbitai bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot commented Aug 20, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Aug 20, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

biswapanda commented Aug 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 20, 2025 •

edited

Loading