-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][doc] Add table with one-line deployment commands to quickstart #8173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/bot run |
📝 WalkthroughWalkthroughDocumentation reorganized to replace quick-starts with deployment guides, standardize YAML-based configuration references, and simplify launch commands. Added multiple example config YAMLs and updated examples to use them. Implemented a helper to auto-update version placeholders across all Markdown files. Minor textual fixes in a benchmark doc. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant Helper as docs/source/helper.py
participant FS as Docs Markdown (*.md)
participant Ver as tensorrt_llm/version.py
Dev->>Helper: run update_version()
Helper->>Ver: load __version__
Ver-->>Helper: return version (e.g., x.y.z)
Helper->>FS: glob all Markdown files
loop For each Markdown file
Helper->>FS: read content
Helper->>Helper: replace version placeholders/URLs
Helper->>FS: write updated content
end
Helper-->>Dev: completed updates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)
8-8
: Align “OpenAI-compatible” capitalization.Other occurrences in this doc (and surrounding docs) use “OpenAI-compatible” with a lowercase “c”. Keeping this bullet consistent avoids jitter in the rendered text.
Apply this diff:
- * Launch the OpenAI-Compatible Server with NGC container + * Launch the OpenAI-compatible server with NGC containerexamples/configs/qwen3-disagg-prefill.yaml (1)
1-9
: LGTM! Configuration appropriate for disaggregated prefill scenario.The configuration is valid. Note that this file shares ~90% of settings with
qwen3.yaml
(same max_batch_size, max_num_tokens, kv_cache settings, etc.), differing mainly intrust_remote_code: true
and the absence of explicit CUDA graph batch sizes.Consider whether these configs could leverage YAML anchors/aliases or a shared base config to reduce duplication while maintaining clarity. However, keeping them separate may be preferable for documentation clarity and ease of use.
examples/configs/llama-4-scout.yaml (1)
1-13
: LGTM! Configuration is correct.This configuration is identical to
llama-3.3-70b.yaml
. While this duplication may be intentional for model-specific discoverability and ease of use, consider whether a shared base configuration or YAML anchors could reduce maintenance overhead.If Llama-4 Scout and Llama-3.3-70B share identical deployment characteristics, you could:
- Use a shared base config with model-specific overrides, or
- Add a comment explaining why the configs are identical despite different model names
However, keeping them separate may be preferable for user experience and documentation clarity.
docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md (1)
89-96
: Clarify the YAML key name.The heading shows
backend pytorch
without a colon, but the YAML option isbackend: pytorch
. Please add the colon (or separate the value) so readers copy the correct key/value form.docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md (1)
91-97
: Add the colon to thebackend
key.Like the other guides, this heading should read
backend: pytorch
so the YAML syntax is accurate. Please update the inline code snippet accordingly.docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md (1)
105-112
: Fix thebackend
YAML notation.Please change the inline code to
backend: pytorch
so the example matches valid YAML syntax.docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md (1)
102-110
: Use proper YAML syntax forbackend
.The inline code should read
backend: pytorch
; without the colon it’s misleading. Please update the heading to show the correct key/value form.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
(1 hunks)docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md
(3 hunks)docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md
(3 hunks)docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md
(3 hunks)docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md
(3 hunks)docs/source/deployment-guide/index.rst
(1 hunks)docs/source/helper.py
(1 hunks)docs/source/quick-start-guide.md
(4 hunks)examples/configs/README.md
(1 hunks)examples/configs/deepseek-r1-deepgemm.yaml
(1 hunks)examples/configs/deepseek-r1-latency.yaml
(1 hunks)examples/configs/deepseek-r1-throughput.yaml
(1 hunks)examples/configs/gpt-oss-120b-latency.yaml
(1 hunks)examples/configs/gpt-oss-120b-throughput.yaml
(1 hunks)examples/configs/llama-3.3-70b.yaml
(1 hunks)examples/configs/llama-4-scout.yaml
(1 hunks)examples/configs/qwen3-disagg-prefill.yaml
(1 hunks)examples/configs/qwen3.yaml
(1 hunks)examples/models/core/qwen/README.md
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
docs/source/helper.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
docs/source/helper.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
docs/source/helper.py
🔇 Additional comments (9)
examples/configs/deepseek-r1-latency.yaml (1)
1-15
: Latency profile config looks solid.Coverage of TP, EP, KV cache, and relaxed MTP params matches the intended low-latency tuning. Thanks for keeping it compact.
examples/configs/deepseek-r1-throughput.yaml (1)
1-17
: Throughput configuration LGTM.High-batch capture, fp8 KV cache, and MTP params are coherently set for throughput workloads.
examples/configs/deepseek-r1-deepgemm.yaml (1)
1-20
: DeepGEMM profile matches expectations.Extends the throughput template with the DEEPGEMM MOE backend while keeping capture/KV/MTP settings aligned—looks good end to end.
examples/configs/llama-3.3-70b.yaml (1)
1-13
: LGTM! Configuration looks appropriate for Llama-3.3-70b.The settings are well-balanced for a 70B model deployment with single-GPU configuration (TP=1) and fp8 KV cache optimization.
examples/configs/README.md (1)
1-5
: LGTM! Clear and helpful documentation.The README effectively explains the purpose of the config files and how to use them with
trtllm-serve
.docs/source/deployment-guide/index.rst (1)
9-12
: LGTM! Straightforward rename aligning with the new documentation structure.The change from "quick-start-recipe" to "deployment-guide" improves clarity and consistency.
examples/configs/qwen3.yaml (1)
1-21
: LGTM! Well-configured for Qwen3 with granular CUDA graph batch sizes.The explicit batch_sizes list in
cuda_graph_config
enables optimized graph caching for common batch sizes.examples/configs/gpt-oss-120b-latency.yaml (1)
1-15
: LGTM! Well-tuned latency configuration for GPT-OSS 120B.The high parallelism settings (TP=8, EP=8) and latency-focused parameters (
stream_interval: 20
,num_postprocess_workers: 4
) are appropriate for a large MoE model deployment.docs/source/helper.py (1)
348-370
: Confirm targeted version replacement. 6 of 120 Markdown files contain the placeholder and will be updated; scanning all .md files has negligible performance impact. Verify that no additional files need this update and that no exclusions are necessary.
| [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | Any | Max Throughput | [gpt-oss-120b-throughput.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-throughput.yaml) | `trtllm-serve openai/gpt-oss-120b --extra_llm_api_options /app/tensorrt_llm/examples/configs/gpt-oss-120b-throughput.yaml` | | ||
| [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | Any | Min Latency | [gpt-oss-120b-latency.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-latency.yaml) | `trtllm-serve openai/gpt-oss-120b --extra_llm_api_options /app/tensorrt_llm/examples/configs/gpt-oss-120b-latency.yaml` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct GPU requirements for GPT-OSS 120B.
Labeling the GPU requirement as “Any” is incorrect. The referenced config (gpt-oss-120b-throughput.yaml
) sets tensor_parallel_size: 8
and moe_expert_parallel_size: 8
, which assumes clustered Blackwell-class GPUs (e.g., B200/GB200) with sufficient memory and interconnect. Please update the table to list the actual supported GPU SKUs and parallelism expectations so users don’t attempt this on unsupported hardware.
🤖 Prompt for AI Agents
docs/source/quick-start-guide.md around lines 111-112: the table incorrectly
lists the GPU requirement for gpt-oss-120b as "Any"; update the GPU column and
optionally an adjacent notes column to reflect the actual supported SKUs and
required parallelism (e.g., "Blackwell-class GPUs (B200/GB200) with
NVLink/High-speed interconnect; tensor_parallel_size: 8,
moe_expert_parallel_size: 8") and ensure the throughput/latency rows reference
these requirements so users know the memory and networking expectations before
attempting to run the provided YAML configs.
backend: DEEPGEMM | ||
max_num_tokens: 3200 | ||
EOF | ||
EXTRA_LLM_API_FILE=/app/tensorrt_llm/examples/configs/deepseek-r1-deepgemm.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to parametrize as TRTLLM_ROOT or CODE_DIR instead of /app/tensorrt_llm
? as the exact root is user/container-specific right?
i know that pulling the official docker scripts will mount tensorrt_llm in /code/tensorrt_llm
- but i personally put it in $HOME/tensorrt_llm
in my dev workflow etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, TRTLLM_ROOT / CODE_DIR would still have to be defined manually right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yepyep
@anish-shanbhag Thank you for elevating the other parts of the documentation by fixing typos and issues even if you didn't have to 😄 |
…kstart Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
93947ce
to
db38972
Compare
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. | ||
|
||
```shell | ||
EXTRA_LLM_API_FILE=/app/tensorrt_llm/examples/configs/qwen3.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nv-guomingz in TRT-LLM NGC container, will the newly added .yaml file part of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, those newly added .yaml files will be part of it, please refer to this.
Hi Anish, thanks for submitting the PR. @litaotju @nv-guomingz Hi Tao/Guoming, this PR touches files which you are familiar with, can you also help review it? Thanks |
|
||
|
||
## Launch Docker on a node with NVIDIA GPUs deployed | ||
## Launch Docker Container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the original version is much more simplicity.
The principle for quick start guide is to make it as simple as possible.
You can also directly load pre-quantized models [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor. | ||
To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples). | ||
|
||
## Quick Start for Popular Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to move this part into deployment-guide section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laikhtewari could u please comment this change on quick start guide.md?
os.path.join(os.path.dirname(__file__), | ||
"../../tensorrt_llm/version.py")) | ||
"""Replace the placeholder container version in all docs source files.""" | ||
version_path = (Path(__file__).parent.parent.parent / "tensorrt_llm" / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- @litaotju , this line change will update the deployment guide folder from fixed version 1.0.0rc6 to latest release version ,e.g, 1.20.0rc1, is it expected behavior?
--ep_size 1 \ | ||
--trust_remote_code \ | ||
--extra_llm_api_options ${EXTRA_LLM_API_FILE} | ||
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
This is a first step in making it easier for users to leverage known best LLM API configs for popular models. The PR makes a few main changes:
trtllm-serve
. This change moves all of these configs into a dedicatedexamples/configs
directory which is available automatically in the TRTLLM container.trtllm-serve
CLI options and LLM API options for configuration; this change aims to standardize around keeping all options within the config files.Quick Start for Popular Models
table within the Quick Start Guide that contains one-linetrtllm-serve
commands to deploy popular models including DSR1, gpt-oss, etc.Subsequent changes will aim to streamline this even further, including:
trtllm-serve
to automatically leverage these configs when possibleThe table looks like this in the rendered docs:
Summary by CodeRabbit
New Features
Documentation
Chores
Test Coverage
N/A
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.