-
Notifications
You must be signed in to change notification settings - Fork 693
feat: use consistent small models across all deploy examples #2573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f9360f6 to
e73e4d6
Compare
WalkthroughThis PR updates model references across SGLang and TRTLLM deployment/launch artifacts and docs to Qwen/Qwen3-0.6B. It also adjusts a few SGLang launch flags: removes --skip-tokenizer-init in agg.sh and adds prefill-related flags in disagg.sh. No APIs or exported entities are changed. Changes
Sequence Diagram(s)Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
WalkthroughThis PR switches the referenced/served model to Qwen/Qwen3-0.6B across SGLang and TRTLLM deployment YAMLs and launch scripts, plus doc examples. Only CLI argument values and default envs are updated (model-path and served-model-name); no logic, control flow, or interfaces changed. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
components/backends/sglang/deploy/disagg-multinode.yaml (1)
57-58: Namespace mismatch blocks service discovery.Prefill uses
dynamoNamespace: sglang-disaggwhile the rest of the graph usessglang-disagg-multinode. This prevents the planner/router from discovering the prefill workers.- dynamoNamespace: sglang-disagg + dynamoNamespace: sglang-disagg-multinodecomponents/backends/sglang/deploy/agg_router.yaml (1)
15-20: Critical: clear_namespace pre-start gating is missingThe
clear_namespacecommand—required to block router startup on failure—has been removed fromcomponents/backends/sglang/deploy/agg_router.yaml. This gating step is a critical prerequisite and must be reinstated before the router’s container starts.Please restore the pre-start hook (or init container) so that it includes something like:
lifecycle: preStart: exec: command: - /bin/bash - -c - clear_namespace && echo "namespace cleared" || (echo "failed to clear namespace" && exit 1)• File needing update:
- components/backends/sglang/deploy/agg_router.yaml
• Location: in the container spec (e.g., underlifecycle.preStartor an initContainer)
• Purpose: ensure the router does not start unlessclear_namespacesucceeds.components/backends/trtllm/launch/agg_router.sh (1)
27-31: Align AGG engine YAML with Qwen3-0.6B hyperparametersThe file
components/backends/trtllm/engine_configs/agg.yamlcurrently only declares:
max_num_tokens: 8192kv_cache_config.free_gpu_memory_fraction: 0.85However, Qwen3-0.6B’s model hyperparameters are:
- Context length (
max_position_embeddings): 32,768 (huggingface.co, qwenlm.github.io)- Attention configuration:
•hidden_size: 1024
•head_dim: 128
•num_attention_heads: 16 (Q) /num_key_value_heads: 8 (KV) (huggingface.co)- Tokenizer settings:
•eos_token_id: 151643
•pad_token_id: 151643 (huggingface.co)- KV-cache behavior:
use_cacheshould remaintrue(as per the model’s default) (huggingface.co)Please update or extend
agg.yamlto include these fields (or confirm that the runtime loader picks them up from the model’sconfig.json) to prevent mis-aligned engine settings and startup failures.
🧹 Nitpick comments (22)
components/backends/sglang/launch/agg.sh (1)
23-24: De-duplicate model identifiers via env defaults for easier future switches.Define
MODEL_PATH/SERVED_MODEL_NAMEonce and reuse; keeps launch scripts in sync across PRs.Apply:
@@ -python3 -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ - --served-model-name Qwen/Qwen3-0.6B \ +export MODEL_PATH="${MODEL_PATH:-Qwen/Qwen3-0.6B}" +export SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$MODEL_PATH}" +python3 -m dynamo.sglang \ + --model-path "$MODEL_PATH" \ + --served-model-name "$SERVED_MODEL_NAME" \ --page-size 16 \ --tp 1 \ --trust-remote-code \ --skip-tokenizer-initcomponents/backends/sglang/launch/disagg.sh (2)
23-24: Centralize model identifiers to envs for maintainability.Same suggestion as agg.sh: define once, reuse in both workers to prevent drift.
@@ -python3 -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ - --served-model-name Qwen/Qwen3-0.6B \ +export MODEL_PATH="${MODEL_PATH:-Qwen/Qwen3-0.6B}" +export SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$MODEL_PATH}" +python3 -m dynamo.sglang \ + --model-path "$MODEL_PATH" \ + --served-model-name "$SERVED_MODEL_NAME" \ --page-size 16 \ --tp 1 \ --trust-remote-code \ --skip-tokenizer-init \ --disaggregation-mode prefill \ --disaggregation-transfer-backend nixl &
35-36: Mirror the env-based identifiers in decode invocation.Keep prefill/decode perfectly aligned by reusing the same envs.
@@ -CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ - --served-model-name Qwen/Qwen3-0.6B \ +CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ + --model-path "$MODEL_PATH" \ + --served-model-name "$SERVED_MODEL_NAME" \ --page-size 16 \ --tp 1 \ --trust-remote-code \ --skip-tokenizer-init \ --disaggregation-mode decode \ --disaggregation-transfer-backend nixlcomponents/backends/sglang/deploy/disagg-multinode.yaml (4)
46-47: Over-provisioned tensor parallelism for a 0.6B model.
--tp-size 8is unnecessary for Qwen3-0.6B and increases startup time and complexity. For the stated PR goals (faster CI and startup), prefer--tp-size 1unless you have an explicit cross-node TP experiment.- --tp-size 8 + --tp-size 1
73-74: Same TP sizing concern on prefill.Mirror the suggestion on the prefill side to keep configuration consistent and lightweight.
- --tp-size 8 + --tp-size 1
34-36: Reduce GPU limits to match the small model and TP=1.If you adopt TP=1 for Qwen3-0.6B, allocating 4 GPUs per worker is wasteful for CI/startup flows.
- limits: - gpu: "4" + limits: + gpu: "1"
61-63: Same GPU limit reduction for prefill.Align with the decode worker to keep resources minimal.
- limits: - gpu: "4" + limits: + gpu: "1"components/backends/sglang/docs/sgl-hicache-example.md (1)
14-15: Docs updated to Qwen3-0.6B: add two tiny hardening notes.Looks good. To keep quickstarts reproducible and CI-friendly:
- Mention that an HF access token may be required for the model and should be provided via
HF_TOKEN.- Consider adding a short note to pin a specific model revision (commit hash) to avoid drift when running perf or cache experiments.
Example snippet to append:
export HF_TOKEN=*** # if required by the model # Optionally pin a revision to ensure reproducibility: # export HF_REVISION=<commit-hash> # ... pass through if your CLI supports --model-revision/--revisionAlso applies to: 42-43, 59-60
components/backends/trtllm/deploy/disagg_router.yaml (1)
36-36: Model switch looks consistent; consider parametrizing and pinning revision for reproducibility.The inline switch to Qwen/Qwen3-0.6B is fine. To make CI/demo toggles easier and avoid accidental drift, consider env-parametrizing model identifiers (as you already do in launch scripts) and optionally pinning a HF revision if supported.
Apply this diff to use env fallbacks:
- - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy prefill_first --publish-events-and-metrics" + - "python3 -m dynamo.trtllm --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B} --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy prefill_first --publish-events-and-metrics"Optionally add MODEL_REVISION if
dynamo.trtllmsupports it:--model-revision ${MODEL_REVISION:-main}components/backends/sglang/launch/agg_router.sh (1)
33-34: Security note: trust-remote-code is enabled for Qwen; consider pinning by revision.Running with
--trust-remote-codepulls and executes repo code. If sglang supports it, add--model-revisionand/or setHF_HUB_ENABLE_HF_TRANSFER=1with a pinned commit for deterministic CI.Example:
- --served-model-name Qwen/Qwen3-0.6B \ + --served-model-name Qwen/Qwen3-0.6B \ + --model-revision ${MODEL_REVISION:-main} \components/backends/trtllm/deploy/disagg.yaml (1)
33-33: Parametrize model id to keep deploys consistent with launch scripts.Mirroring launch scripts, use env vars so CI can swap models without YAML edits.
- - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy decode_first" + - "python3 -m dynamo.trtllm --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B} --extra-engine-args engine_configs/prefill.yaml --disaggregation-mode prefill --disaggregation-strategy decode_first"If supported, pin revision with
--model-revision ${MODEL_REVISION:-main}.components/backends/trtllm/launch/agg.sh (1)
6-7: Verified model repo exists; optional env‐var sanity checkThe HuggingFace API confirms that “Qwen/Qwen3-0.6B” is a valid, ungated model repository (
"modelId": "Qwen/Qwen3-0.6B","gated": false). The default values incomponents/backends/trtllm/launch/agg.share therefore correct.• File:
components/backends/trtllm/launch/agg.sh
Lines 6–7 setexport MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"} export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}Optional: to guard against unintended empties when these environment variables are overridden, you may add a quick check:
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"} export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +if [[ -z "$MODEL_PATH" || -z "$SERVED_MODEL_NAME" ]]; then + echo "ERROR: MODEL_PATH and SERVED_MODEL_NAME must be set and non-empty." >&2 + exit 1 +ficomponents/backends/sglang/deploy/disagg_planner.yaml (1)
145-147: Optional: pin model revision and surface model via env for faster toggles.Same motivation as other files—determinism and ease in CI.
- --model-path Qwen/Qwen3-0.6B - --served-model-name Qwen/Qwen3-0.6B + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} + --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B} + --model-revision ${MODEL_REVISION:-main}components/backends/trtllm/launch/disagg.sh (3)
6-7: Switch to Qwen/Qwen3-0.6B looks good; consider deriving SERVED_MODEL_NAME from MODEL_PATH to avoid drift.Hard-coding both increases the risk they diverge in future edits. Defaulting SERVED_MODEL_NAME to MODEL_PATH keeps them in sync while still allowing overrides.
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"} -export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"$MODEL_PATH"}
1-4: Harden the launcher for CI: fail fast and make cleanup safe with unset PIDs.CI runs benefit from early exit on errors; also guard trap against unset vars.
#!/bin/bash +set -euo pipefail @@ cleanup() { echo "Cleaning up background processes..." - kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true - wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true + kill ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true + wait ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true echo "Cleanup complete." }Also applies to: 17-23
11-12: Optional: allow single-GPU CI runners by defaulting DECODE device to prefill device.This keeps disaggregated flow working on 1-GPU machines without manual overrides.
-export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"} -export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"1"} +export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"} +export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"$PREFILL_CUDA_VISIBLE_DEVICES"}components/backends/trtllm/launch/disagg_router.sh (3)
6-7: Defaults updated correctly; suggest binding SERVED_MODEL_NAME to MODEL_PATH to prevent mismatches.Keeps served name aligned with the chosen repo id by default.
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"} -export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"$MODEL_PATH"}
11-12: Optional: single-GPU friendly defaults.Mirror decode device from prefill when not set to improve OOTB CI behavior.
-export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"} -export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"1"} +export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"} +export DECODE_CUDA_VISIBLE_DEVICES=${DECODE_CUDA_VISIBLE_DEVICES:-"$PREFILL_CUDA_VISIBLE_DEVICES"}
1-4: Add fail-fast and robust cleanup for CI stability.Same rationale as disagg.sh.
#!/bin/bash +set -euo pipefail @@ cleanup() { echo "Cleaning up background processes..." - kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true - wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true + kill ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true + wait ${DYNAMO_PID:-} ${PREFILL_PID:-} 2>/dev/null || true echo "Cleanup complete." }Also applies to: 14-21
components/backends/trtllm/launch/agg_router.sh (3)
6-7: Good change; prefer deriving SERVED_MODEL_NAME from MODEL_PATH.Reduces duplication and accidental drift when updating defaults later.
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"} -export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"$MODEL_PATH"}
6-7: Pin or document the model revision to improve reproducibility.Hugging Face main may move; CI stability improves if a specific revision/tag is used (or at least an env like MODEL_REVISION). If dynamo.trtllm supports a revision flag, wire it; otherwise document expected commit hash.
Would you like me to propose a follow-up change to introduce MODEL_REVISION and thread it through the launchers and deploy manifests?
1-4: Apply fail-fast and safe cleanup for CI.Consistent with the other launchers.
#!/bin/bash +set -euo pipefail @@ cleanup() { echo "Cleaning up background processes..." - kill $DYNAMO_PID 2>/dev/null || true - wait $DYNAMO_PID 2>/dev/null || true + kill ${DYNAMO_PID:-} 2>/dev/null || true + wait ${DYNAMO_PID:-} 2>/dev/null || true echo "Cleanup complete." }Also applies to: 10-18
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
components/backends/sglang/README.md(1 hunks)components/backends/sglang/deploy/agg.yaml(1 hunks)components/backends/sglang/deploy/agg_router.yaml(1 hunks)components/backends/sglang/deploy/disagg-multinode.yaml(1 hunks)components/backends/sglang/deploy/disagg.yaml(2 hunks)components/backends/sglang/deploy/disagg_planner.yaml(2 hunks)components/backends/sglang/docs/sgl-hicache-example.md(3 hunks)components/backends/sglang/launch/agg.sh(1 hunks)components/backends/sglang/launch/agg_router.sh(2 hunks)components/backends/sglang/launch/disagg.sh(2 hunks)components/backends/trtllm/deploy/agg.yaml(1 hunks)components/backends/trtllm/deploy/agg_router.yaml(1 hunks)components/backends/trtllm/deploy/disagg.yaml(2 hunks)components/backends/trtllm/deploy/disagg_router.yaml(2 hunks)components/backends/trtllm/launch/agg.sh(1 hunks)components/backends/trtllm/launch/agg_router.sh(1 hunks)components/backends/trtllm/launch/disagg.sh(1 hunks)components/backends/trtllm/launch/disagg_router.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:00:07.968Z
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.
Applied to files:
components/backends/sglang/deploy/agg_router.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (14)
components/backends/trtllm/deploy/agg_router.yaml (1)
38-39: Compatibility Check Pending: Manual Verification Required
- components/backends/trtllm/deploy/agg_router.yaml (lines 38–39) correctly reference
Qwen/Qwen3-0.6Bfor both--model-pathand--served-model-name.- The shared engine configuration at
components/backends/trtllm/engine_configs/agg.yamlloads generic settings (tensor-parallel size, batch limits, caching, etc.) and does not include any model-specific architecture, vocabulary, or plugin parameters.Please manually confirm the following before approving this change:
• agg.yaml suitability: ensure the default parallelism, max tokens, and other runtime flags are appropriate for Qwen 3.0-0.6B’s transformer dimensions and memory footprint.
• Tokenizer/vocabulary: verify thattrust_remote_code: true(or any custom plugin) correctly handles Qwen3’s tokenizer and vocab mappings.
• Engine assets: confirm that a TensorRT-LLM engine for Qwen/Qwen3-0.6B either already exists in the build pipeline or that updated build scripts are in place to generate it (including any required plugin registrations).components/backends/sglang/deploy/disagg.yaml (2)
62-63: Prefill worker model switch is consistent with decode; LGTM.No issues spotted; the identifiers match the decode section.
35-36: Model arguments consistency verified—please confirm tokenizer and weights availability
- Verified that both the prefill block (lines 35–36) and the decode block (lines 62–63) use identical model identifiers:
(output from verification script)--model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B- Since
--skip-tokenizer-initis enabled, ensure that the Qwen3-0.6B tokenizer artifacts (e.g.tokenizer.json, vocab files) are published and accessible in your model registry or container image, and that both prefill and decode workers pull the same tokenizer version to prevent any tokenization mismatch.- Additionally, confirm that the Qwen3-0.6B model weights are available and correctly versioned under
Qwen/Qwen3-0.6B.components/backends/sglang/deploy/disagg-multinode.yaml (1)
47-48: Pin model revision when using--trust-remote-codeWe verified that both decode (lines 47–48) and prefill (lines 74–75) invocations in
components/backends/sglang/deploy/disagg-multinode.yamlinclude--trust-remote-codebut don’t pin to a specific model revision. We also confirmed there is no existing--revision/--model-revisionflag in the Dynamo or SGLang CLI parsers, nor any use of anHF_REVISIONenv var in this codebase.To ensure reproducible deployments and mitigate supply-chain drift:
- In the YAML at lines 47–48 and 74–75, add a revision flag, for example:
--trust-remote-code --model-revision <commit-or-tag>- Extend the SGLang CLI (the
ServerArgs.add_cli_args/ServerArgs.from_cli_argsimplementation) to accept and propagate a--model-revision(or--revision) argument into the HFfrom_pretrained(..., revision=…)call.- Alternatively (or additionally), read an environment variable (e.g.
HF_REVISION=<commit>) insideServerArgs.from_cli_argsand pass it along to pin the exact revision.Please confirm whether upstream SGLang already supports any revision-pinning flags or env vars; if not, implement one of the above so that
--trust-remote-codecannot pull unbounded remote code.components/backends/sglang/deploy/agg.yaml (2)
35-36: Model switch to Qwen3-0.6B aligns with PR goals.The replacement is consistent and should lower startup time for CI.
39-41: I’ve added a grep to pinpoint whereskip_tokenizer_initis declared and consumed in the codebase. This will show us if any critical initialization steps get bypassed. Let me know the results, and we can confirm whether additional safeguards or documentation are needed.components/backends/sglang/deploy/agg_router.yaml (1)
38-39: Consistent model switch on router worker.Change is straightforward and matches other SGLang configs.
components/backends/trtllm/deploy/agg.yaml (1)
35-36: Verify engine config path and model-specific assumptionsI couldn’t find an
engine_configs/agg.yamlat the path referenced incomponents/backends/trtllm/deploy/agg.yaml. It looks like your aggregated engine config actually lives under:
components/backends/trtllm/engine_configs/multinode/agg.yamlPlease:
- Confirm that the
--extra-engine-argspath indeploy/agg.yamlpoints to the correct file.- Inspect
components/backends/trtllm/engine_configs/multinode/agg.yamlfor any model-specific knobs (e.g. vocab size, max position embeddings, context length, tensor-parallel splits, plugins, etc.) that might have been tuned for a previous model.- If that config is truly model-agnostic, add a top-of-file comment in
multinode/agg.yamlstating it’s safe across architectures/models to prevent future confusion.Diff suggestion:
--- a/components/backends/trtllm/deploy/agg.yaml +++ b/components/backends/trtllm/deploy/agg.yaml @@ -33,7 +33,7 @@ - --extra-engine-args engine_configs/agg.yaml + --extra-engine-args engine_configs/multinode/agg.yaml # ensure this matches the actual config locationcomponents/backends/trtllm/deploy/disagg_router.yaml (1)
53-53: Confirm disaggregation-strategy consistency across router vs non-router configsI’ve verified that the router and non-router deployments use different strategies:
- In components/backends/trtllm/deploy/disagg_router.yaml
• Line 36:--disaggregation-strategy prefill_firstfor the prefill worker
• Line 53:--disaggregation-strategy prefill_firstfor the decode worker- In components/backends/trtllm/deploy/disagg.yaml
• Line 33:--disaggregation-strategy decode_firstfor the prefill worker
• Line 50:--disaggregation-strategy decode_firstfor the decode workerIf the router is intentionally configured to always “prefill_first,” then no changes are needed. Otherwise, please update disagg_router.yaml to use
decode_firstto match the non-router deployment and ensure predictable routing behavior.components/backends/sglang/launch/agg_router.sh (1)
23-24: HF model Qwen/Qwen3-0.6B exists and is ungated
The Hugging Face API returns modelId “Qwen/Qwen3-0.6B” with"gated": false, so the swap is safe and CI/startup will resolve correctly.
- Verified via
curl https://huggingface.co/api/models/Qwen/Qwen3-0.6B→"gated": falsecomponents/backends/trtllm/deploy/disagg.yaml (1)
50-50: Validate Qwen3-0.6B compatibility with current engine configsI inspected the TRT-LLM engine configurations and found only two generic YAML files—
decode.yamlandprefill.yaml—with no model-specific overrides or Qwen-related settings. These files use default PyTorch backends and general parameters (e.g. tensor parallelism = 1, trust_remote_code = true, etc.) but do not account for any unique architecture or tokenization requirements that Qwen3-0.6B may have.Please manually verify the following:
- components/backends/trtllm/engine_configs/decode.yaml
- components/backends/trtllm/engine_configs/prefill.yaml
Ensure they correctly handle Qwen3-0.6B’s:
- Tensor/expert parallelism
- Chunked prefill and decode workflows
- KV-cache memory fraction and CUDA-graph batching
- Tokenization settings (vocabulary, special tokens, etc.)
If Qwen3-0.6B demands different engine parameters or tokenizer configurations, add or adjust a model-specific YAML (e.g.
decode_qwen3-0.6B.yaml). Otherwise confirm that the existing generic configs suffice for your deployment tests.components/backends/sglang/deploy/disagg_planner.yaml (1)
119-121: Please verify that profiling artifacts for the new model existIt looks like the
/workspace/profiling_resultsdirectory did not yield any files matchingQwen3,Qwen-0.6B, orQwen. Before swapping in Qwen3-0.6B, we must ensure that planner profiles (SLA data) have been generated for it.• In components/backends/sglang/deploy/disagg_planner.yaml (lines 119–121):
– The flags--model-path Qwen/Qwen3-0.6Band--served-model-name Qwen/Qwen3-0.6Bwere added.
– Confirm that a corresponding profile (e.g.profiling_results/Qwen3-0.6B.*) exists—otherwise planner behavior may be untested for this model.components/backends/trtllm/launch/disagg.sh (1)
6-7: Verify TensorRT-LLM engine configs for Qwen3-0.6BIt looks like the
engine_configsfolder wasn’t detected at the project root, so the audit didn’t surface any YAMLs. Please locate the TensorRT-LLM configuration directory (e.g. undercomponents/backends/trtllm/engine_configs/), then:
- List all
*.yamlfiles in that directory and confirm none still reference DeepSeek (model names, comments, example values).- In each YAML, ensure the default model/tokenizer points to
Qwen/Qwen3-0.6B.- Verify Qwen-specific settings match the model’s requirements:
max_sequence_length(ormax_seq_len/context_length)- Data type (
dtype) and numeric precision/quantization parameters- KV-cache connector and cache size settings if present (
kv_cacheor similar)- Remove or replace any lingering DeepSeek parameters with Qwen-aligned values.
Manually review and update these configs to guarantee compatibility with Qwen3-0.6B.
components/backends/trtllm/launch/disagg_router.sh (1)
6-7: DISAGGREGATION_STRATEGY & Metrics Publishing VerifiedAll checks confirm that for Qwen3-0.6B in
disagg_router.sh:
- The default
DISAGGREGATION_STRATEGYis set to"prefill_first".- The conditional logic adds
--publish-events-and-metricsonly to the prefill side when usingprefill_first, and only to the decode side when usingdecode_first.- Across the other TRT-LLM launchers, defaults vary appropriately by model (e.g.,
disagg.shdefaults todecode_first;gpt_oss_disagg.shdefaults toprefill_first), matching intended throughput trade-offs.No inconsistent defaults or duplicate metrics-publishing branches were found.
WalkthroughAll sglang and TRT-LLM deployment manifests, launch scripts, and related docs were updated to use the model identifier Qwen/Qwen3-0.6B instead of previous defaults (DeepSeek-R1-Distill-Llama-8B or Llama-3.3-70B). Only --model-path, --served-model-name, or default env vars changed; control flow and other flags remain unchanged. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/backends/sglang/deploy/disagg-multinode.yaml (1)
57-57: Namespace mismatch: prefill uses sglang-disagg vs. -multinode elsewhere.Frontend and Decode use dynamoNamespace: sglang-disagg-multinode, but Prefill sets sglang-disagg. That likely prevents service discovery and planner/router from seeing all workers under one namespace.
Apply this diff:
- dynamoNamespace: sglang-disagg + dynamoNamespace: sglang-disagg-multinodecomponents/backends/sglang/launch/agg_router.sh (1)
18-18: Use python3 for consistency and to avoid invoking Python 2 or missing python shim.This can break on systems where python isn’t available or points to Python 2.
-python -m dynamo.frontend --router-mode kv --http-port=8000 & +python3 -m dynamo.frontend --router-mode kv --http-port=8000 &
🧹 Nitpick comments (19)
components/backends/sglang/deploy/disagg.yaml (1)
35-36: Make Qwen/Qwen3-0.6B overridable via env vars in disagg.yamlTo avoid drifting from upstream and speed up iteration, let’s parameterize the hard-coded Qwen/Qwen3-0.6B values in
components/backends/sglang/deploy/disagg.yaml. Update both the decode and prefill sections:• File:
components/backends/sglang/deploy/disagg.yaml
– Lines 35–36 (decode)
– Lines 62–64 (prefill)Suggested change:
- --model-path Qwen/Qwen3-0.6B - --served-model-name Qwen/Qwen3-0.6B + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} + --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}Optional: pin a specific HF snapshot for CI determinism by appending a revision flag (e.g. commit SHA or tag):
- --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B@${QWEN_REV:-<commit_sha>}}You may also consider applying the same
MODEL_PATH/SERVED_MODEL_NAMEoverrides to the other SGLang deployment templates (e.g.agg.yaml,agg_router.yaml,disagg-multinode.yaml,disagg_planner.yaml) for consistency across environments.components/backends/trtllm/deploy/agg.yaml (1)
35-36: Confirm TRT-LLM engine configuration and tokenizer compatibility for Qwen3-0.6B.Switching to Qwen/Qwen3-0.6B is fine, but TRT-LLM engines and tokenizer configs are model-specific. Please verify that:
- engine_configs/agg.yaml is valid for Qwen3-0.6B (e.g., vocab/tokenizer settings, max context, dtype).
- The current TRT-LLM runtime image supports Qwen3 architectures.
- A pinned HF revision is used to avoid engine rebuild drift in CI.
Optionally make the values env-overridable to ease testing:
- --model-path Qwen/Qwen3-0.6B - --served-model-name Qwen/Qwen3-0.6B + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} + --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}components/backends/trtllm/deploy/agg_router.yaml (1)
38-39: Router targets updated; verify end-to-end name alignment and pin revision.The router now advertises Qwen/Qwen3-0.6B. Ensure every interacting component (agg, disagg, launch scripts) uses the exact same served-model-name to prevent routing misses. Also consider pinning the HF revision for stability.
Env-overridable tweak (optional):
- --model-path Qwen/Qwen3-0.6B - --served-model-name Qwen/Qwen3-0.6B + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} + --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}components/backends/sglang/deploy/agg_router.yaml (1)
38-39: SGLang router model updated; keep name consistency and consider revision pinning.Looks consistent with the rest of the PR. Please verify that clients and upstream components expect the served-model-name with a slash; some stacks prefer a slashless alias for endpoint keys. If needed, switch to an alias but update all manifests together.
Optional env-overridable tweak:
- --model-path Qwen/Qwen3-0.6B - --served-model-name Qwen/Qwen3-0.6B + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B} + --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}If your tooling requires a slashless endpoint name, consider:
- --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B} + --served-model-name ${SERVED_MODEL_NAME:-qwen3-0.6b}Coordinate this change across all deploy/launch files if you adopt it.
components/backends/sglang/deploy/agg.yaml (1)
35-36: LGTM; model swap is minimal and scoped. Add revision pin for reproducible CI.This change aligns with the PR’s goal of faster startup with a smaller model. To avoid upstream drift in CI, pin a specific HF snapshot.
Suggested tweak:
- --model-path Qwen/Qwen3-0.6B - --served-model-name Qwen/Qwen3-0.6B + --model-path ${MODEL_PATH:-Qwen/Qwen3-0.6B@${QWEN_REV:-<commit_sha>}} + --served-model-name ${SERVED_MODEL_NAME:-Qwen/Qwen3-0.6B}components/backends/sglang/deploy/disagg-multinode.yaml (3)
46-53: Double-check tp-size vs GPU topology for 0.6B.tp-size=8 with resources.limits.gpu="4" and multinode.nodeCount=2 likely targets 8-way TP across 8 total GPUs. With Qwen3-0.6B, 8-way TP may be unnecessary overhead and could reduce efficiency. Consider tp-size=1 or 2 unless there’s a measured benefit.
If you have perf numbers, please share. Otherwise, I can propose a sizing matrix for Qwen3-0.6B under your cluster constraints.
52-53: mem-fraction-static=0.82 may be overly conservative for a 0.6B model.For smaller models, you can usually reclaim memory for batching/throughput. Consider lowering to ~0.7 and re-measuring OR set to an empirically derived value.
Also applies to: 78-79
44-49: Security: --trust-remote-code is enabled.Expected for some HF repos, but confirm the SHA/pin if reproducibility is required in CI. Consider pinning a commit or mirror if supply-chain risk matters.
I can add an allowlist/pinning mechanism to the manifests if you’d like.
Also applies to: 71-76
components/backends/trtllm/deploy/disagg.yaml (1)
50-50: served-model-name with slashes may not be Triton-safe.If served-model-name is used as a Triton model repository name, slashes are invalid. Some codepaths sanitize automatically; others don’t. Consider a sanitized served-model-name (e.g., Qwen__Qwen3-0.6B) or add a sanitization layer.
If you want to keep CLI values unchanged, we can sanitize internally where the name is consumed.
components/backends/trtllm/deploy/disagg_router.yaml (1)
53-53: Decode worker: match engine to model and confirm name sanitation.Same concerns as Prefill: confirm decode.yaml aligns with Qwen3-0.6B, and ensure served-model-name is safe for downstream consumers (e.g., metrics, Triton model repo).
If sanitation is needed, I can add a simple normalization (replace / and : with __) in the launcher.
components/backends/trtllm/launch/agg.sh (1)
6-7: Optional: derive a Triton-safe SERVED_MODEL_NAME from MODEL_PATH by default.If downstream consumers treat served-model-name as an identifier (filesystem or metrics), slashes can be problematic. Consider auto-sanitizing when the var isn’t explicitly set.
Apply this diff to keep the current MODEL_PATH default but sanitize SERVED_MODEL_NAME when unset:
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"} -export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"${MODEL_PATH//\//__}"}This yields Qwen__Qwen3-0.6B by default while preserving user overrides.
components/backends/trtllm/launch/disagg.sh (1)
17-25: Fail fast and gate on clear_namespace for safer launches.The script proceeds even if clear_namespace fails, and the script doesn’t fail fast on other errors. Recommend adding strict mode and gating to avoid undefined states.
Apply:
#!/bin/bash +set -euo pipefail @@ -python3 utils/clear_namespace.py --namespace dynamo +python3 utils/clear_namespace.py --namespace dynamo || { echo "clear_namespace failed; aborting."; exit 1; }Also applies to: 26-31
components/backends/trtllm/launch/agg_router.sh (2)
6-7: LGTM on switching defaults; confirm consumers tolerate slash in served name.The model swap aligns with CI speed goals. Please confirm any systems that serialize SERVED_MODEL_NAME (filenames, metrics keys) accept "Qwen/Qwen3-0.6B"; otherwise consider a slug.
Potential mitigation:
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen-Qwen3-0.6B"}
19-24: Make router startup contingent on namespace clearing.Per prior decision in deploy YAMLs, router shouldn’t start if namespace isn’t cleared. Mirror that here to reduce flakiness.
-python3 utils/clear_namespace.py --namespace dynamo +set -euo pipefail +python3 utils/clear_namespace.py --namespace dynamocomponents/backends/trtllm/launch/disagg_router.sh (2)
6-7: Defaults updated correctly; please validate slash handling and strategy docs.
- Confirm slash in SERVED_MODEL_NAME doesn’t leak into file paths or identifiers.
- DISAGGREGATION_STRATEGY default is "prefill_first" in this router vs "decode_first" in disagg.sh. If intentional, consider a short comment to avoid confusion.
Optional served-name slug:
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen-Qwen3-0.6B"}
23-28: Gate router start on clear_namespace and fail fast.Keep behavior consistent with deploy flows; abort early if cleanup fails.
-python3 utils/clear_namespace.py --namespace dynamo +set -euo pipefail +python3 utils/clear_namespace.py --namespace dynamocomponents/backends/sglang/docs/sgl-hicache-example.md (1)
14-14: Consider adding an explicit--served-model-namefor consistency
SGLang’s CLI launch omits--served-model-name, while downstream request examples assume “Qwen/Qwen3-0.6B.” Making the served name explicit avoids mismatches if defaults ever change. Note: SGLang natively supports Qwen3 models, so a--trust-remote-codeflag is not required here.• File:
components/backends/sglang/docs/sgl-hicache-example.md, line 14python -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ + --served-model-name Qwen/Qwen3-0.6B \ --host 0.0.0.0 --port 8000 \components/backends/sglang/launch/disagg.sh (2)
23-24: DRY the model identifiers via a single variable; easier future switches.Both --model-path and --served-model-name repeat the same literal. Centralize to MODEL to simplify maintenance across prefill/decode.
+# Allow override: MODEL and TP +MODEL="${MODEL:-Qwen/Qwen3-0.6B}" +TP="${TP:-1}" ... python3 -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ - --served-model-name Qwen/Qwen3-0.6B \ + --model-path "$MODEL" \ + --served-model-name "$MODEL" \ --page-size 16 \ - --tp 1 \ + --tp "$TP" \ --trust-remote-code \ --skip-tokenizer-init \ --disaggregation-mode prefill \ --disaggregation-transfer-backend nixl &
35-36: Apply the same MODEL variable to the decode worker; keep served name identical.Mirrors the prefill change; prevents accidental divergence between workers.
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ - --served-model-name Qwen/Qwen3-0.6B \ + --model-path "$MODEL" \ + --served-model-name "$MODEL" \ --page-size 16 \ - --tp 1 \ + --tp "$TP" \ --trust-remote-code \ --skip-tokenizer-init \ --disaggregation-mode decode \ --disaggregation-transfer-backend nixl
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
components/backends/sglang/README.md(1 hunks)components/backends/sglang/deploy/agg.yaml(1 hunks)components/backends/sglang/deploy/agg_router.yaml(1 hunks)components/backends/sglang/deploy/disagg-multinode.yaml(1 hunks)components/backends/sglang/deploy/disagg.yaml(2 hunks)components/backends/sglang/deploy/disagg_planner.yaml(2 hunks)components/backends/sglang/docs/sgl-hicache-example.md(3 hunks)components/backends/sglang/launch/agg.sh(1 hunks)components/backends/sglang/launch/agg_router.sh(2 hunks)components/backends/sglang/launch/disagg.sh(2 hunks)components/backends/trtllm/deploy/agg.yaml(1 hunks)components/backends/trtllm/deploy/agg_router.yaml(1 hunks)components/backends/trtllm/deploy/disagg.yaml(2 hunks)components/backends/trtllm/deploy/disagg_router.yaml(2 hunks)components/backends/trtllm/launch/agg.sh(1 hunks)components/backends/trtllm/launch/agg_router.sh(1 hunks)components/backends/trtllm/launch/disagg.sh(1 hunks)components/backends/trtllm/launch/disagg_router.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:00:07.968Z
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.
Applied to files:
components/backends/trtllm/deploy/agg_router.yamlcomponents/backends/sglang/launch/agg_router.shcomponents/backends/sglang/deploy/agg_router.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
components/backends/sglang/deploy/disagg_planner.yaml (3)
119-120: LGTM: Model updated to Qwen/Qwen3-0.6B (Decode worker).The change is consistent with the PR objective and other files.
145-146: LGTM: Model updated to Qwen/Qwen3-0.6B (Prefill worker).Matches Decode and keeps the deployment coherent.
121-124: Ensure CLI Flag Consistency for sglang ManifestsThe sglang manifests are currently mixed between two flags:
--tp <value>(used in 6 manifests)--tp-size <value>(used only indisagg-multinode.yaml)Run the following to see all occurrences:
rg -n -S -g 'components/**/sglang/**/*.yaml' -e '\-\-tp(\s|=)|\-\-tp-size(\s|=)'Results:
- components/backends/sglang/deploy/agg_router.yaml:41 (
--tp 1)- components/backends/sglang/deploy/disagg.yaml:38, 65 (
--tp 1)- components/backends/sglang/deploy/disagg_planner.yaml:122, 148 (
--tp 1)- components/backends/sglang/deploy/agg.yaml:38 (
--tp 1)- components/backends/sglang/deploy/disagg-multinode.yaml:46, 73 (
--tp-size 8)Please verify which flag the sglang runtime (nvcr.io/.../sglang-runtime:hzhou-0811-1) expects:
- Run
docker run --rm nvcr.io/.../sglang-runtime:hzhou-0811-1 sglang --help | grep tp- Confirm whether
--tpis an accepted alias or only--tp-sizeis supported.Once confirmed, standardize all manifests to use the correct flag. If the runtime only supports
--tp-size, update:
- Replace every
--tp 1with--tp-size 1(or the appropriate value) in the six affected YAML files.If both flags are supported, add a note to the deployment docs or scripts indicating that
--tpis an alias for--tp-size.components/backends/trtllm/deploy/disagg.yaml (1)
33-33: Verify engine artifacts rebuilt for Qwen3-0.6BThe
engine_configs/prefill.yamlyou’re passing to the TRT-LLM launcher under--extra-engine-argsis a generic template (no model names or engine paths are hard-coded), but it has not been specialized for Qwen3-0.6B in the same way we maintain per-model subfolders (e.g.deepseek_r1,llama4,gemma3, etc.). Without regenerating the TensorRT engines (or adding a Qwen3-0.6B specific config), you risk:
- Shape or vocabulary mismatches (hidden size, max sequence length)
- Incorrect parallel-ism or memory-fraction settings
- Runtime failures when loading or executing the engine binaries
Please confirm:
- You have rebuilt the TRT engines for Qwen/Qwen3-0.6B (e.g. via
trtllm build … --model-path Qwen/Qwen3-0.6B) so that the.enginefiles reflect the new model’s dimensions.- The deploy command’s
--served-model-nameand--model-pathpoint to the directory containing those newly generated artifacts.- If the default
engine_configs/prefill.yamlflags need tuning (parallel sizes, max_seq_len, dtype, etc.), consider adding acomponents/backends/trtllm/engine_configs/qwen3-0.6B/folder with model-specific overrides and reference that instead.components/backends/trtllm/deploy/disagg.yaml:33
- - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B \ - --served-model-name Qwen/Qwen3-0.6B \ - --extra-engine-args engine_configs/prefill.yaml \ - --disaggregation-mode prefill \ - --disaggregation-strategy decode_first" + - "… --extra-engine-args engine_configs/qwen3-0.6B/prefill.yaml …"components/backends/trtllm/deploy/disagg_router.yaml (1)
36-36: Confirm Qwen3-0.6B engine config and router strategyI inspected
engine_configs/prefill.yamland discovered it’s the generic prefill config with no Qwen-specific overrides—meaning there isn’t a rebuilt version for Qwen3-0.6B. I also verified indisagg_router.yamlthat both the Prefill and Decode workers are set to--disaggregation-strategy prefill_first.• engine_configs/prefill.yaml is unchanged from the generic template; no Qwen3-0.6B tuning present.
• disagg_router.yaml usesprefill_firstfor both TRTLLMPrefillWorker and TRTLLMDecodeWorker.Please confirm whether:
- A Qwen3-0.6B–specific prefill config should be added (or if the generic prefill.yaml is sufficient).
- The router’s use of
prefill_firstfor both workers is intentional, or if the Decode worker should usedecode_firstinstead.components/backends/trtllm/launch/disagg.sh (1)
6-7: Action Required: Confirm TRT-LLM config compatibility & slash handling in SERVED_MODEL_NAME
- Engine configuration files (engine_configs/prefill.yaml and engine_configs/decode.yaml) appear model-agnostic (no hard-coded architecture, tokenizer, or dtype overrides), but please verify they’re tuned for Qwen/Qwen3-0.6B’s characteristics to avoid performance or runtime regressions.
- SERVED_MODEL_NAME still contains a slash. It appears in:
• components/backends/trtllm/launch/disagg.sh
• examples/basics/multinode/README.md
• deploy/inference-gateway/vllm_agg_qwen.yaml
• lib/bindings/python/examples/hello_world/server_sglang_static.py
• benchmarks/profiler/utils/dynamo_deployment.py
• various docs and test fixtures under lib/llm/tests/data/replays
Double-check that downstream consumers—metrics label values, Dynamo namespaces, Kafka topic names, filenames—handle “/” correctly.- If any consumer fails on “/”, slugify the default served name:
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"} +export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen-Qwen3-0.6B"}
- To catch any remaining unsafe usages of “/” in served/model names, re-run:
rg -nP --glob '!**/node_modules/**' '(served|model).*name.*[/]' -C3components/backends/sglang/launch/agg_router.sh (1)
15-16: Block startup if namespace clearing fails (consistency with prior decision).Past learning indicates the router should not start when clear_namespace fails. Mirror that behavior here.
Add strict mode so failure of clear_namespace stops the script:
# Setup cleanup trap cleanup() { @@ -trap cleanup EXIT INT TERM +trap cleanup EXIT INT TERM +set -euo pipefailcomponents/backends/sglang/docs/sgl-hicache-example.md (2)
42-42: Confirm the request “model” string matches the served name.Given the explicit --served-model-name suggestion above, verify this value remains identical to avoid 404/not found errors from the OpenAI-compatible endpoint.
59-59: perf.sh uses Hugging Face model IDI confirmed that the
--modelflag in the perf harness is defined as “Hugging Face model ID to benchmark” and is passed directly togenai-perf profile(see the usage string and the--model ${model}invocation inperf.sh). There is no separate--served-model-nameparameter. The example--model Qwen/Qwen3-0.6B \is therefore correct.
• No changes required to the example value.
• Optional nit: add a brief note above the snippet stating that--modelexpects the HF repo ID (not a custom served-model-name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/backends/sglang/deploy/disagg.yaml (1)
62-69: Outstanding legacy model references detected; migration incompleteThe repository grep revealed numerous lingering references to
DeepSeek-R1-Distill-Llama-8B,meta-llama/Llama-3.3-70B-Instruct, and other pre-migration model identifiers. These must be updated (or removed) to complete the migration to Qwen3.0 and avoid confusion or breakages.Files/locations needing attention include (but are not limited to):
- README.md (e.g. lines 127, 134)
- lib/llm/tests/data/replays/deepseek-r1-distill-llama-8b/**
- tests/serve/**/*.py (trtllm, sglang, vllm)
- tests/kvbm and tests/fault_tolerance configurations
- docs/index.rst, docs/guides/, components/backends/vllm/, components/backends/sglang/**
- examples/deployments/router_standalone/**
- benchmarks/**
Once all legacy references are replaced with the appropriate new model names (e.g.,
Qwen/Qwen3-0.6Bor slash-free variants), the migration will be consistent across the codebase.Optional served-model-name normalization for this file (to prevent downstream string-handling issues):
components/backends/sglang/deploy/disagg.yaml - --served-model-name Qwen/Qwen3-0.6B + --served-model-name Qwen3-0.6Bcomponents/backends/sglang/deploy/disagg-multinode.yaml (1)
57-57: Namespace mismatch between decode and prefill services.decode uses dynamoNamespace "sglang-disagg-multinode" while prefill uses "sglang-disagg". This will break discovery/routing across services.
Apply this diff to align the namespace:
- dynamoNamespace: sglang-disagg + dynamoNamespace: sglang-disagg-multinode
♻️ Duplicate comments (2)
components/backends/sglang/README.md (1)
196-196: Curl example updated to use the new model.The testing payload correctly references
Qwen/Qwen3-0.6B, matching the model updates throughout the SGLang configuration files.components/backends/sglang/deploy/disagg-multinode.yaml (1)
44-45: Critical: decode still points to Llama while prefill uses Qwen (model mismatch).Decode and prefill must use the exact same model for tokenizer/KV/routing compatibility. Update both flags to Qwen/Qwen3-0.6B.
Apply this diff:
- --model-path meta-llama/Llama-3.3-70B-Instruct - --served-model-name meta-llama/Llama-3.3-70B-Instruct + --model-path Qwen/Qwen3-0.6B + --served-model-name Qwen/Qwen3-0.6B
🧹 Nitpick comments (14)
components/backends/sglang/docs/sgl-hicache-example.md (1)
14-23: Consider specifying --served-model-name to match the request payload.Your curl below sends "model": "Qwen/Qwen3-0.6B". If the frontend routes by served-model-name, explicitly setting it avoids mismatches with any defaults.
Apply this diff in the snippet:
python -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ + --model-path Qwen/Qwen3-0.6B \ + --served-model-name Qwen/Qwen3-0.6B \ --host 0.0.0.0 --port 8000 \components/backends/sglang/launch/disagg.sh (2)
35-42: Decode pinned to GPU 1 may fail on single-GPU hosts; add a guard.Hard-coding CUDA_VISIBLE_DEVICES=1 will break on 1-GPU nodes. Add a simple GPU count check and choose a valid device, or allow K8s/device plugin to assign.
Add this near the top of the script:
# Detect GPUs and pick a safe device index for decode if needed GPU_COUNT=$(nvidia-smi -L 2>/dev/null | wc -l | xargs) if [[ -z "$GPU_COUNT" || "$GPU_COUNT" -lt 2 ]]; then echo "Only $GPU_COUNT GPU(s) detected; running both workers on GPU 0." export DECODE_CUDA_PREFIX="" else export DECODE_CUDA_PREFIX="CUDA_VISIBLE_DEVICES=1 " fiThen change the decode launch:
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ +${DECODE_CUDA_PREFIX}python3 -m dynamo.sglang \
23-25: Optional: normalize served-model-name to a slash-free alias.Same rationale as TRT-LLM configs; avoid '/' in identifiers used by logs/metrics/routes, unless you’ve validated it across the stack.
- --served-model-name Qwen/Qwen3-0.6B \ + --served-model-name Qwen3-0.6B \If you keep the slash, confirm the frontend routes, metrics, and any file paths handle it safely.
Also applies to: 35-37
components/backends/trtllm/deploy/agg.yaml (1)
35-36: Model switch LGTM; re-validate engine configs and consider slash-free served name.
- Ensure engine_configs/agg.yaml is compatible with Qwen3 and rebuild engines as needed.
- Consider using Qwen3-0.6B for --served-model-name to avoid '/' issues.
Smoke test plan:
- Deploy agg and run a minimal prompt through the Frontend with "model": "".
- Check logs for tokenizer/engine load errors tied to Qwen3.
components/backends/sglang/deploy/agg_router.yaml (1)
38-43: Normalize served-model-name and revisit --skip-tokenizer-init trade-offs
- served-model-name containing a slash can complicate routing keys, dashboards, and metrics labels in some environments. Consider a normalized alias (e.g., qwen3-0.6b).
- With a small model, the benefit of --skip-tokenizer-init may be marginal vs. the cost of first-request latency. If CI stability matters more than cold-start time, consider removing it or making it env-tunable.
Apply if you want to adopt a stable alias and drop the skip flag:
- --served-model-name Qwen/Qwen3-0.6B + --served-model-name qwen3-0.6b ... - --skip-tokenizer-initIf you keep the flag, can you confirm that first tokenization happens off the hot path (e.g., during a warm-up) so CI timings remain predictable?
components/backends/trtllm/launch/agg.sh (1)
6-7: Optional: provide a sanitized alias for SERVED_MODEL_NAMESome downstream systems dislike slashes in identifiers. Consider deriving a sanitized alias for metrics/routing while still loading the HF path from MODEL_PATH.
You could add this right after the export lines:
# Derived alias without slashes for metrics/routing; still load from MODEL_PATH export SERVED_MODEL_ALIAS="${SERVED_MODEL_NAME//\//_}" # Qwen_Qwen3-0.6B # Then pass --served-model-name "$SERVED_MODEL_ALIAS" to the worker if neededcomponents/backends/trtllm/deploy/disagg_router.yaml (1)
53-53: Add --publish-events-and-metrics to Decode worker for parityPrefill has --publish-events-and-metrics but Decode does not. Add it for consistent observability across both stages.
- - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy prefill_first" + - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy prefill_first --publish-events-and-metrics"components/backends/sglang/deploy/disagg_planner.yaml (2)
117-126: Pin HF revision and reduce trust-remote-code exposureFor reproducible CI and supply-chain safety:
- Pin a specific HF revision for Qwen/Qwen3-0.6B (e.g., via @).
- If possible, avoid --trust-remote-code; if it’s required by the model, at least pin the revision and set an allowlist.
- --model-path Qwen/Qwen3-0.6B + --model-path Qwen/Qwen3-0.6B@${HF_REVISION:-main} - --served-model-name Qwen/Qwen3-0.6B + --served-model-name qwen3-0.6bAdd an env to this service (or the namespace) like:
envs: - name: HF_REVISION value: "abcdef1" # pin to a known-good commitConfirm whether SGLang requires remote code for this model; if not, remove --trust-remote-code to tighten security.
143-151: Tokenizer init flag: verify cold-start vs predictability--skip-tokenizer-init helps startup but shifts cost to the first request. With a 0.6B model, the saved time may be small; consider removing it or performing an explicit warm-up after pod readiness.
If you keep the flag, can you add a readiness/warm-up hook to run a trivial tokenize/generate to amortize the first-request spike?
components/backends/trtllm/launch/disagg.sh (1)
6-7: Safer defaults on single-GPU nodesCurrent defaults use GPU 0 for prefill and 1 for decode; this fails on 1-GPU runners. Consider auto-detecting GPU count and falling back to device 0 for both when only one GPU is present.
You can insert this after the export block:
# Auto-detect GPU count and adjust decode device if needed GPU_COUNT=$(nvidia-smi -L | wc -l | tr -d ' ') if [[ "${GPU_COUNT:-0}" -lt 2 && "${DECODE_CUDA_VISIBLE_DEVICES}" == "1" ]]; then echo "Only ${GPU_COUNT} GPU(s) detected; setting DECODE_CUDA_VISIBLE_DEVICES=0" export DECODE_CUDA_VISIBLE_DEVICES="0" ficomponents/backends/sglang/deploy/disagg-multinode.yaml (1)
48-48: Confirm intent to keep --skip-tokenizer-init for Qwen.With Qwen3-0.6B, skipping tokenizer init is usually safe only if a central tokenizer (e.g., router/frontend) handles all tokenization. If workers ever tokenize (fallbacks, eval paths), this can cause runtime errors.
If workers should own tokenization, drop the flag:
- --skip-tokenizer-init(Do this for both decode and prefill.)
Also applies to: 75-75
components/backends/trtllm/launch/agg_router.sh (1)
1-4: Optional: harden the script with strict mode.Improve failure visibility and cleanup reliability.
#!/bin/bash +set -Eeuo pipefailcomponents/backends/trtllm/launch/disagg_router.sh (2)
14-21: Ensure decode is cleaned up on signals; avoid orphaned processes.Trap currently kills frontend and prefill PIDs; decode runs in foreground and may survive certain termination paths. Track decode PID and include it in cleanup, or kill the whole process group.
Option A — background decode and wait:
cleanup() { echo "Cleaning up background processes..." - kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true - wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true + kill $DYNAMO_PID $PREFILL_PID $DECODE_PID 2>/dev/null || true + wait $DYNAMO_PID $PREFILL_PID $DECODE_PID 2>/dev/null || true echo "Cleanup complete." } trap cleanup EXIT INT TERM @@ -# run decode worker -CUDA_VISIBLE_DEVICES=$DECODE_CUDA_VISIBLE_DEVICES python3 -m dynamo.trtllm \ +# run decode worker +CUDA_VISIBLE_DEVICES=$DECODE_CUDA_VISIBLE_DEVICES python3 -m dynamo.trtllm \ --model-path "$MODEL_PATH" \ --served-model-name "$SERVED_MODEL_NAME" \ --extra-engine-args "$DECODE_ENGINE_ARGS" \ --disaggregation-mode decode \ --disaggregation-strategy "$DISAGGREGATION_STRATEGY" \ - "${EXTRA_DECODE_ARGS[@]}" + "${EXTRA_DECODE_ARGS[@]}" & +DECODE_PID=$! +wait $DECODE_PIDOption B — kill the whole process group on exit:
-trap cleanup EXIT INT TERM +trap 'trap - EXIT; kill -- -$$' EXIT INT TERM(Note: Option B is simpler but more forceful.)
Also applies to: 49-56
31-37: Publishing metrics logic is fine. Consider a single toggle for clarity.Current conditional works; if you prefer explicitness, allow an env var like PUBLISH_METRICS=1 to override strategy-based defaults.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
components/backends/sglang/README.md(1 hunks)components/backends/sglang/deploy/agg.yaml(1 hunks)components/backends/sglang/deploy/agg_router.yaml(1 hunks)components/backends/sglang/deploy/disagg-multinode.yaml(1 hunks)components/backends/sglang/deploy/disagg.yaml(2 hunks)components/backends/sglang/deploy/disagg_planner.yaml(2 hunks)components/backends/sglang/docs/sgl-hicache-example.md(3 hunks)components/backends/sglang/launch/agg.sh(1 hunks)components/backends/sglang/launch/agg_router.sh(2 hunks)components/backends/sglang/launch/disagg.sh(2 hunks)components/backends/trtllm/deploy/agg.yaml(1 hunks)components/backends/trtllm/deploy/agg_router.yaml(1 hunks)components/backends/trtllm/deploy/disagg.yaml(2 hunks)components/backends/trtllm/deploy/disagg_router.yaml(2 hunks)components/backends/trtllm/launch/agg.sh(1 hunks)components/backends/trtllm/launch/agg_router.sh(1 hunks)components/backends/trtllm/launch/disagg.sh(1 hunks)components/backends/trtllm/launch/disagg_router.sh(1 hunks)
🔇 Additional comments (24)
components/backends/sglang/launch/agg_router.sh (2)
23-24: Model references updated correctly for SGLang worker.The SGLang worker invocation now uses the smaller
Qwen/Qwen3-0.6Bmodel, which aligns with the PR objective to reduce CI resource usage and startup time.
33-34: Model references updated correctly for second worker instance.The second SGLang worker instance (on CUDA device 1) now uses the smaller
Qwen/Qwen3-0.6Bmodel, maintaining consistency with the first worker.components/backends/sglang/deploy/agg.yaml (1)
35-36: Kubernetes deployment config updated consistently.Both
--model-pathand--served-model-nameare correctly updated to useQwen/Qwen3-0.6Bin the SGLangDecodeWorker container args, maintaining consistency with other deployment configurations.components/backends/sglang/launch/agg.sh (2)
23-24: Model references updated correctly.The worker invocation now uses the smaller
Qwen/Qwen3-0.6Bmodel for both--model-pathand--served-model-name, which supports the PR's goal of reducing resource usage.
28-28: Ignore incorrect removal concernA search across the SGLang backend confirms that the
--skip-tokenizer-initflag remains present in components/backends/sglang/launch/agg.sh (line 28) and in numerous other scripts and YAML definitions. The PR has not removed this flag, so no behavior change or startup-time impact related to its removal will occur.
- components/backends/sglang/launch/agg.sh, line 28:
--skip-tokenizer-initis still included- Similar occurrences in deploy and launch scripts (e.g., disagg.sh, agg_router.sh) and in YAML configs
Likely an incorrect or invalid review comment.
components/backends/trtllm/deploy/agg_router.yaml (1)
38-39: TRTLLM deployment config updated consistently.Both
--model-pathand--served-model-nameare correctly updated to useQwen/Qwen3-0.6Bin the TRTLLMWorker container args, maintaining consistency across both SGLang and TRTLLM backends.components/backends/sglang/docs/sgl-hicache-example.md (3)
14-22: Switching to Qwen/Qwen3-0.6B looks correct and aligns with the PR goal.The model path change is appropriate for a lightweight CI/startup profile. Keeping --skip-tokenizer-init is fine for cold-start wins. Note that the first request will lazily init the tokenizer.
42-42: Payload model name matches the new HF repo.The Hugging Face repo Qwen/Qwen3-0.6B exists; keeping the payload model consistent with served-model-name will ensure correct routing. (huggingface.co)
To ensure the router matches on name, confirm the frontend expects an exact served-model-name match. If it strips slashes, prefer a dash-only alias (e.g., Qwen3-0.6B) and adjust the payload accordingly.
59-59: No filesystem paths derived from--model, slash-safeI’ve reviewed
benchmarks/llm/perf.shand confirmed:
- Artifact directories are created as
artifacts_root_dir/artifacts_<index>via a numeric counter, with no use of the model identifier in the path.- The
--modelvalue is only passed through togenai-perfand embedded in the JSON config; it isn’t used when constructing any filesystem paths.Therefore, using a Hugging Face model ID that contains
/(e.g.Qwen/Qwen3-0.6B) will not be treated as a directory separator in this script.components/backends/trtllm/deploy/disagg.yaml (2)
33-33: Confirm TRT-LLM engine configs are compatible with Qwen3.You’re passing --extra-engine-args engine_configs/prefill.yaml built previously. Ensure these configs (and any prebuilt engines) were regenerated for Qwen3 architecture/tokenizer; otherwise runtime/build will fail at load.
Suggested checks:
- Rebuild engines with the Qwen/Qwen3-0.6B checkpoint.
- Verify vocab/tokenizer settings and rope scaling match Qwen3 specs.
50-50: Please verify slash usage for--served-model-nameacross all componentsWe scanned the repo and found that
--served-model-nameis used with slash-containing values in scores of scripts, YAML manifests, tests, and documentation. Changing this one invocation incomponents/backends/trtllm/deploy/disagg.yamlwithout updating the others will likely break:• Bash launch scripts under
components/backends/trtllm/launch/
• Kubernetes/Slurm job manifests undercomponents/backends/trtllm/deploy/
• Performance-sweep helpers and their SLURM wrappers
• Tests intests/serve/test_trtllm.py(which readSERVED_MODEL_NAME)
• Examples and READMEs in bothcomponents/backends/sglangandcomponents/backends/trtllmConsider one of two approaches:
Central sanitization in code
Modifytrtllm_utils.py(or wherever metrics/resource-names are generated) to automatically replace/with a safe character (e.g.-) when deriving filenames or labels from the served name. This avoids per-file edits.Consistent aliasing via env/args
If you opt to supply a slash-free alias manually, update every script and manifest together. For this file, an example optional diff:- - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy decode_first" + - "python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen3-0.6B --extra-engine-args engine_configs/decode.yaml --disaggregation-mode decode --disaggregation-strategy decode_first"• If you proceed with manual aliasing, be sure to update the 50+ other occurrences (e.g., in
launch/*.sh,deploy/*.yaml,tests/serve, and various.mddocs).
• If you prefer automatic sanitization, please verify that any new logic correctly handles edge cases (multiple slashes, unusual characters) across all backends.Let us know which path you choose so we can ensure consistency.
components/backends/sglang/launch/disagg.sh (1)
23-31: Prefill: model switch + disagg flags are consistent.Good use of --skip-tokenizer-init for faster cold start and explicit disaggregation flags with NIXL.
Ensure the runtime image has NIXL support enabled; otherwise, transfer-backend nixl will error out.
components/backends/sglang/deploy/disagg.yaml (1)
35-42: Decode worker: settings are coherent with the Qwen3 switch.Flags look consistent: page-size, tp, trust-remote-code, skip-tokenizer-init, decode mode, and nixl transfer backend.
Confirm the router and client payloads use the same served-model-name string to avoid 404 route misses.
components/backends/sglang/deploy/agg_router.yaml (1)
38-39: Model switch to Qwen3-0.6B looks goodThis aligns with the PR objective to reduce CI/startup cost. No other behavioral flags changed here.
components/backends/trtllm/launch/agg.sh (1)
6-7: Defaults updated to Qwen3-0.6B — LGTMMatches the deployment YAMLs and keeps CLI/env symmetry.
components/backends/trtllm/deploy/disagg_router.yaml (2)
36-36: Prefill worker model target updated — looks consistentThe model-path and served-model-name changes are aligned with the rest of the PR.
36-53: Ignore weight/tokenizer references in engine_configs; verify CLI flags insteadThe
prefill.yamlanddecode.yamlundercomponents/backends/trtllm/engine_configs/only define engine parameters (parallelism, cache, scheduler, etc.) and do not embed any model-weight or tokenizer paths. Instead, the model artifacts are specified via the--model-pathand--served-model-nameflags in your deployment YAML.• In
components/backends/trtllm/deploy/disagg_router.yaml(lines 36–53), you correctly pass:--model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6Bwhich ensures the Qwen3-0.6B weights and tokenizer are loaded.
• No changes are needed inengine_configs/prefill.yamlorengine_configs/decode.yaml—they should remain engine-only settings.
• Confirm separately that the converted Qwen3-0.6B weights and tokenizer assets exist at the specified model path (e.g., on Hugging Face or your local registry).Likely an incorrect or invalid review comment.
components/backends/sglang/deploy/disagg_planner.yaml (2)
119-120: Decode worker switched to Qwen3-0.6B — goodMatches the rest of the migration, no conflicting flags here.
145-146: Prefill worker switched to Qwen3-0.6B — goodConsistent with decode worker and other files in the PR.
components/backends/trtllm/launch/disagg.sh (2)
6-7: Defaults updated to Qwen3-0.6B — LGTMKeeps parity with other launchers and deploy YAMLs.
6-7: Engine config regen needed for Qwen3-0.6B
I inspected the top-level engine configs undercomponents/backends/trtllm/engine_configsand didn’t find any model‐specific entries or mentions of “Qwen” (nor “DeepSeek”) in eitherprefill.yamlordecode.yaml. This suggests they haven’t been regenerated with the new model’s vocab, rope, and kv‐cache settings.Please manually verify and, if needed, regenerate both files for Qwen3-0.6B to ensure:
- The correct
vocab_file(or tokenizer) path is set.- The rotary embedding parameters (
rope_base,rope_theta_scale, etc.) match Qwen3-0.6B.- The
kv_cache_formataligns with the model’s cache implementation.Compare against a known working model config (e.g.
engine_configs/deepseek_r1/simple/prefill.yaml) and run your usual regeneration script (e.g.tools/regenerate_engine_config.py --model Qwen/Qwen3-0.6B ...) to avoid subtle runtime errors.components/backends/sglang/deploy/disagg-multinode.yaml (1)
46-47: Verify TP/gpu topology across multinode.tp-size is 8 while each pod requests 4 GPUs and multinode nodeCount is 2. If the intent is 8-way TP across 2 nodes (4 GPUs each), confirm NCCL/Gloo env and topology are set accordingly and that both decode/prefill pods land symmetrically.
Also applies to: 61-63, 73-74
components/backends/trtllm/launch/agg_router.sh (1)
6-7: No lingering DeepSeek/Llama defaults in launch/deploy scriptsI ran the provided grep over all TRT-LLM files and confirmed that the only matches for “DeepSeek”, “deepseek-ai”, “meta-llama” or “Llama-3” appear in Markdown docs (performance_sweeps/, multinode guides, README.md). None of the shell scripts in components/backends/trtllm/launch or components/backends/trtllm/deploy contain those references, and
agg_router.shis correctly set to Qwen/Qwen3-0.6B. All good to merge.components/backends/trtllm/launch/disagg_router.sh (1)
6-7: Defaults updated to Qwen3-0.6B: aligns with the migration.
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Krishnan Prashanth <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Overview:
Use smaller
Qwen/Qwen3-0.6Bmodel instead ofdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bfor CI tests and smaller start up time. This also fixes CI timeout issues.Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Chores
Documentation