-
Notifications
You must be signed in to change notification settings - Fork 693
docs: add Llama4 eagle3 one model example and configs #2087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add Llama4 eagle3 one model example and configs #2087
Conversation
|
👋 Hi jhaotingc! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThis update introduces three new YAML configuration files for the Eagle One Model under the Llama4 backend in the TRTLLM engine, covering prefill, decode, and aggregate scenarios with Eagle3 one-model speculative decoding. Additionally, documentation is expanded to describe these configurations and provide example usage instructions. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
components/backends/trtllm/llama4_plus_eagle.md (1)
65-78: Update paths to new one-model configsThe examples still point to
.../eagle/eagle_*.yamlwhile the new files live undereagle_one_model/(and one has a.ymlextension). Using these commands verbatim will raise a file-not-found error.Please sync the doc paths with the actual filenames.
🧹 Nitpick comments (6)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (2)
25-31: Unify boolean casing foreagle3_one_modelacross configs
eagle3_one_modelis set toTruehere, while the same flag is lowercase (true) ineagle_agg.yml. YAML parsers usually accept both, but the mixed style hurts readability and can trip up simple grep/templating tools.- eagle3_one_model: True + eagle3_one_model: true
1-35: Extension inconsistency may confuse usersThis file uses
.yaml, whileeagle_agg.ymluses.yml. Pick one extension for the whole tree to avoid broken glob patterns in deployment scripts.components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1)
1-41: File-extension differs from sibling configsThis file is
.yml; the other two are.yaml. Align for consistency (rename to.yamlor vice-versa).components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (1)
25-31: Match boolean style with other configsSame flag/style issue as pre-fill config.
- eagle3_one_model: True + eagle3_one_model: truecomponents/backends/trtllm/llama4_plus_eagle.md (2)
37-41: Fix typos & markdown-lint warnings
- “congis” → “configs”
- Bullet list uses
*but rest of doc uses-- Grammar: “may got ran” → “might be run”
-* The congis in `engine_configs/llama4/eagle_one_model` are tested with 8xH100 cluster. Be sure to change the `NUM_GPUS_PER_NODE` accordingly or change TP/EP size in config. 1 8xH100 node for aggregated .yml file, 2 8xH100 for prefill/decode .yml file. -* The current `./multinode/start_frontend_services.sh` may got ran `NUM_GPUS_PER_NODE` times depending on how srun/mpi is launched, beware that the frontend service only needs to be ran once. +- The configs in `engine_configs/llama4/eagle_one_model` were validated on an 8×H100 cluster. Adjust `NUM_GPUS_PER_NODE` or the TP/EP sizes as needed (1 node for aggregated, 2 nodes for prefill/decode). +- The current `./multinode/start_frontend_services.sh` might be run `NUM_GPUS_PER_NODE` times depending on how *srun* / MPI is launched; the frontend service only needs to start once.
86-98: Add language tag to fenced code blockMarkdown-lint (
MD040) flags the block; addingbashimproves rendering and syntax highlighting.-``` +```bash ...
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml(1 hunks)components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml(1 hunks)components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml(1 hunks)components/backends/trtllm/llama4_plus_eagle.md(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (1)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (2)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Learnt from: tanmayv25
PR: #1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (1)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
components/backends/trtllm/llama4_plus_eagle.md (2)
Learnt from: tanmayv25
PR: #1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
🪛 LanguageTool
components/backends/trtllm/llama4_plus_eagle.md
[grammar] ~39-~39: Ensure spelling is correct
Context: ...TensorRT-LLM/commits/v1.0.0rc2/). * The congis in `engine_configs/llama4/eagle_one_mod...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.17.2)
components/backends/trtllm/llama4_plus_eagle.md
38-38: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
39-39: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
40-40: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
86-86: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (3)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (1)
16-24: Consider addingcuda_graph_configfor parityThe pre-fill engine is missing a
cuda_graph_configblock present in the decode/agg configs. If this omission is intentional (e.g., graphs offer no benefit for batch-size 1), leave a short comment so future maintainers don’t assume it was forgotten.components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1)
25-31: Path & weight availability check
pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3assumes the model is public and the runtime has HF creds. If the weight is gated/private, add a comment or doc note; otherwise deployment will fail silently.components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (1)
19-22: Validatemax_seq_lencommentThe inline comment says
8704 = 8192 ISL + 512 OSL; confirm the serving stack actually supports 512 OSL tokens whenmax_num_tokensis capped at 2048. If not, adjustmax_seq_lenor add an explanatory comment.
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
Outdated
Show resolved
Hide resolved
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
Outdated
Show resolved
Hide resolved
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
Outdated
Show resolved
Hide resolved
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml
Show resolved
Hide resolved
8d19829 to
51936e7
Compare
51936e7 to
2aab137
Compare
Signed-off-by: Jhao-Ting Chen <[email protected]>
Signed-off-by: Jhao-Ting Chen <[email protected]>
2aab137 to
0942166
Compare
Overview:
Tested functionality on EOS (8xH100 cluster).
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation