Skip to content

Conversation

anish-shanbhag
Copy link

@anish-shanbhag anish-shanbhag commented Oct 7, 2025

Description

This is a first step in making it easier for users to leverage known best LLM API configs for popular models. The PR makes a few main changes:

  1. Currently, the model deployment guides instruct users to write LLM API options to a file and then pass it to trtllm-serve. This change moves all of these configs into a dedicated examples/configs directory which is available automatically in the TRTLLM container.
  2. The deployment guides used a mix of trtllm-serve CLI options and LLM API options for configuration; this change aims to standardize around keeping all options within the config files.
  3. The main change is the addition of a Quick Start for Popular Models table within the Quick Start Guide that contains one-line trtllm-serve commands to deploy popular models including DSR1, gpt-oss, etc.

Subsequent changes will aim to streamline this even further, including:

  1. Improving the configs to account for a broader range of ISL/OSL scenarios; and
  2. Adding new logic to trtllm-serve to automatically leverage these configs when possible

The table looks like this in the rendered docs:

image

Summary by CodeRabbit

  • New Features

    • Added ready-to-use YAML configs for DeepSeek-R1 (latency/throughput/DeepGEMM), GPT-OSS 120B (latency/throughput), Llama 3.3 70B, Llama4 Scout, and Qwen3 (incl. disaggregated prefill).
    • Introduced examples/configs README and a “Quick Start for Popular Models” table in the quick-start guide.
  • Documentation

    • Migrated “Quick Start Recipes” to “Deployment Guides” with YAML-based configuration, simplified launch commands, updated paths/tags, and clarified option names.
    • Minor copy edits and terminology fixes.
  • Chores

    • Automated version placeholder updates across all Markdown docs.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@anish-shanbhag
Copy link
Author

/bot run

@anish-shanbhag anish-shanbhag marked this pull request as ready for review October 8, 2025 22:13
@anish-shanbhag anish-shanbhag requested review from a team as code owners October 8, 2025 22:13
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

📝 Walkthrough

Walkthrough

Documentation reorganized to replace quick-starts with deployment guides, standardize YAML-based configuration references, and simplify launch commands. Added multiple example config YAMLs and updated examples to use them. Implemented a helper to auto-update version placeholders across all Markdown files. Minor textual fixes in a benchmark doc.

Changes

Cohort / File(s) Summary of changes
Deployment guides restructuring
docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md, .../deployment-guide-for-gpt-oss-on-trtllm.md, .../deployment-guide-for-llama3.3-70b-on-trtllm.md, .../deployment-guide-for-llama4-scout-on-trtllm.md
Converted “Quick Start” docs into “Deployment Guide” format, switched to YAML-based LLM API options, updated image tags to placeholder x.y.z, simplified launch commands, renamed options (e.g., tp/ep -> tensor_parallel_size/moe_expert_parallel_size), added “Recommended Performance Settings.”
Deployment index updates
docs/source/deployment-guide/index.rst
Replaced quick-start toctree entries with new deployment-guide references.
Quick start overhaul
docs/source/quick-start-guide.md
Renamed sections, clarified Docker usage, added “Quick Start for Popular Models” with model table and commands, updated examples to reference prebuilt YAML configs.
Serve benchmark doc tweaks
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
Minor wording, hyphenation, and capitalization fixes; no functional changes.
Qwen examples: config externalization
examples/models/core/qwen/README.md
Replaced inline here-doc configs with references to YAML files via --extra_llm_api_options; added “Recommended Performance Settings” and environment variable usage.
New model config YAMLs (DeepSeek R1)
examples/configs/deepseek-r1-deepgemm.yaml, .../deepseek-r1-latency.yaml, .../deepseek-r1-throughput.yaml
Added backend, batching, KV cache, parallelism, CUDA graph, speculative decoding, and MOE settings for DeepSeek R1 profiles.
New model config YAMLs (GPT-OSS 120B)
examples/configs/gpt-oss-120b-latency.yaml, .../gpt-oss-120b-throughput.yaml
Added latency/throughput profiles with TP/EP, KV cache fraction, CUDA graph, MOE backend, attention DP tuning, streaming and workers settings.
New model config YAMLs (Llama)
examples/configs/llama-3.3-70b.yaml, .../llama-4-scout.yaml
Added configs with backend, batch/token limits, KV cache fp8, TP/EP sizing, CUDA graph padding.
New model config YAMLs (Qwen3)
examples/configs/qwen3.yaml, .../qwen3-disagg-prefill.yaml
Added PyTorch configs with batch/token limits, KV cache fraction, attention DP, logging, CUDA graph batch sizes (qwen3).
Configs README
examples/configs/README.md
New README describing recommended LLM API configs and how to use them with trtllm-serve.
Docs helper enhancement
docs/source/helper.py
update_version now discovers all Markdown files, reads tensorrt_llm/__version__, and replaces placeholders across docs via glob iteration; added docstring and Path-based paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant Helper as docs/source/helper.py
  participant FS as Docs Markdown (*.md)
  participant Ver as tensorrt_llm/version.py

  Dev->>Helper: run update_version()
  Helper->>Ver: load __version__
  Ver-->>Helper: return version (e.g., x.y.z)
  Helper->>FS: glob all Markdown files
  loop For each Markdown file
    Helper->>FS: read content
    Helper->>Helper: replace version placeholders/URLs
    Helper->>FS: write updated content
  end
  Helper-->>Dev: completed updates
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly follows the required ticket and type format and succinctly summarizes the primary change of adding a one-line deployment commands table to the Quick Start guide.
Description Check ✅ Passed The pull request description includes distinct Description, Test Coverage, and PR Checklist sections consistent with the repository template, and it provides clear explanations of the changes, test considerations, and checklist items.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)

8-8: Align “OpenAI-compatible” capitalization.

Other occurrences in this doc (and surrounding docs) use “OpenAI-compatible” with a lowercase “c”. Keeping this bullet consistent avoids jitter in the rendered text.

Apply this diff:

- * Launch the OpenAI-Compatible Server with NGC container
+ * Launch the OpenAI-compatible server with NGC container
examples/configs/qwen3-disagg-prefill.yaml (1)

1-9: LGTM! Configuration appropriate for disaggregated prefill scenario.

The configuration is valid. Note that this file shares ~90% of settings with qwen3.yaml (same max_batch_size, max_num_tokens, kv_cache settings, etc.), differing mainly in trust_remote_code: true and the absence of explicit CUDA graph batch sizes.

Consider whether these configs could leverage YAML anchors/aliases or a shared base config to reduce duplication while maintaining clarity. However, keeping them separate may be preferable for documentation clarity and ease of use.

examples/configs/llama-4-scout.yaml (1)

1-13: LGTM! Configuration is correct.

This configuration is identical to llama-3.3-70b.yaml. While this duplication may be intentional for model-specific discoverability and ease of use, consider whether a shared base configuration or YAML anchors could reduce maintenance overhead.

If Llama-4 Scout and Llama-3.3-70B share identical deployment characteristics, you could:

  • Use a shared base config with model-specific overrides, or
  • Add a comment explaining why the configs are identical despite different model names

However, keeping them separate may be preferable for user experience and documentation clarity.

docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md (1)

89-96: Clarify the YAML key name.

The heading shows backend pytorch without a colon, but the YAML option is backend: pytorch. Please add the colon (or separate the value) so readers copy the correct key/value form.

docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md (1)

91-97: Add the colon to the backend key.

Like the other guides, this heading should read backend: pytorch so the YAML syntax is accurate. Please update the inline code snippet accordingly.

docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md (1)

105-112: Fix the backend YAML notation.

Please change the inline code to backend: pytorch so the example matches valid YAML syntax.

docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md (1)

102-110: Use proper YAML syntax for backend.

The inline code should read backend: pytorch; without the colon it’s misleading. Please update the heading to show the correct key/value form.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9298f1b and db56451.

📒 Files selected for processing (19)
  • docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1 hunks)
  • docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md (3 hunks)
  • docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md (3 hunks)
  • docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md (3 hunks)
  • docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md (3 hunks)
  • docs/source/deployment-guide/index.rst (1 hunks)
  • docs/source/helper.py (1 hunks)
  • docs/source/quick-start-guide.md (4 hunks)
  • examples/configs/README.md (1 hunks)
  • examples/configs/deepseek-r1-deepgemm.yaml (1 hunks)
  • examples/configs/deepseek-r1-latency.yaml (1 hunks)
  • examples/configs/deepseek-r1-throughput.yaml (1 hunks)
  • examples/configs/gpt-oss-120b-latency.yaml (1 hunks)
  • examples/configs/gpt-oss-120b-throughput.yaml (1 hunks)
  • examples/configs/llama-3.3-70b.yaml (1 hunks)
  • examples/configs/llama-4-scout.yaml (1 hunks)
  • examples/configs/qwen3-disagg-prefill.yaml (1 hunks)
  • examples/configs/qwen3.yaml (1 hunks)
  • examples/models/core/qwen/README.md (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • docs/source/helper.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • docs/source/helper.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • docs/source/helper.py
🔇 Additional comments (9)
examples/configs/deepseek-r1-latency.yaml (1)

1-15: Latency profile config looks solid.

Coverage of TP, EP, KV cache, and relaxed MTP params matches the intended low-latency tuning. Thanks for keeping it compact.

examples/configs/deepseek-r1-throughput.yaml (1)

1-17: Throughput configuration LGTM.

High-batch capture, fp8 KV cache, and MTP params are coherently set for throughput workloads.

examples/configs/deepseek-r1-deepgemm.yaml (1)

1-20: DeepGEMM profile matches expectations.

Extends the throughput template with the DEEPGEMM MOE backend while keeping capture/KV/MTP settings aligned—looks good end to end.

examples/configs/llama-3.3-70b.yaml (1)

1-13: LGTM! Configuration looks appropriate for Llama-3.3-70b.

The settings are well-balanced for a 70B model deployment with single-GPU configuration (TP=1) and fp8 KV cache optimization.

examples/configs/README.md (1)

1-5: LGTM! Clear and helpful documentation.

The README effectively explains the purpose of the config files and how to use them with trtllm-serve.

docs/source/deployment-guide/index.rst (1)

9-12: LGTM! Straightforward rename aligning with the new documentation structure.

The change from "quick-start-recipe" to "deployment-guide" improves clarity and consistency.

examples/configs/qwen3.yaml (1)

1-21: LGTM! Well-configured for Qwen3 with granular CUDA graph batch sizes.

The explicit batch_sizes list in cuda_graph_config enables optimized graph caching for common batch sizes.

examples/configs/gpt-oss-120b-latency.yaml (1)

1-15: LGTM! Well-tuned latency configuration for GPT-OSS 120B.

The high parallelism settings (TP=8, EP=8) and latency-focused parameters (stream_interval: 20, num_postprocess_workers: 4) are appropriate for a large MoE model deployment.

docs/source/helper.py (1)

348-370: Confirm targeted version replacement. 6 of 120 Markdown files contain the placeholder and will be updated; scanning all .md files has negligible performance impact. Verify that no additional files need this update and that no exclusions are necessary.

Comment on lines +111 to +112
| [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | Any | Max Throughput | [gpt-oss-120b-throughput.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-throughput.yaml) | `trtllm-serve openai/gpt-oss-120b --extra_llm_api_options /app/tensorrt_llm/examples/configs/gpt-oss-120b-throughput.yaml` |
| [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | Any | Min Latency | [gpt-oss-120b-latency.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-latency.yaml) | `trtllm-serve openai/gpt-oss-120b --extra_llm_api_options /app/tensorrt_llm/examples/configs/gpt-oss-120b-latency.yaml` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Correct GPU requirements for GPT-OSS 120B.

Labeling the GPU requirement as “Any” is incorrect. The referenced config (gpt-oss-120b-throughput.yaml) sets tensor_parallel_size: 8 and moe_expert_parallel_size: 8, which assumes clustered Blackwell-class GPUs (e.g., B200/GB200) with sufficient memory and interconnect. Please update the table to list the actual supported GPU SKUs and parallelism expectations so users don’t attempt this on unsupported hardware.

🤖 Prompt for AI Agents
docs/source/quick-start-guide.md around lines 111-112: the table incorrectly
lists the GPU requirement for gpt-oss-120b as "Any"; update the GPU column and
optionally an adjacent notes column to reflect the actual supported SKUs and
required parallelism (e.g., "Blackwell-class GPUs (B200/GB200) with
NVLink/High-speed interconnect; tensor_parallel_size: 8,
moe_expert_parallel_size: 8") and ensure the throughput/latency rows reference
these requirements so users know the memory and networking expectations before
attempting to run the provided YAML configs.

backend: DEEPGEMM
max_num_tokens: 3200
EOF
EXTRA_LLM_API_FILE=/app/tensorrt_llm/examples/configs/deepseek-r1-deepgemm.yaml
Copy link
Collaborator

@venkywonka venkywonka Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to parametrize as TRTLLM_ROOT or CODE_DIR instead of /app/tensorrt_llm ? as the exact root is user/container-specific right?
i know that pulling the official docker scripts will mount tensorrt_llm in /code/tensorrt_llm - but i personally put it in $HOME/tensorrt_llm in my dev workflow etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, TRTLLM_ROOT / CODE_DIR would still have to be defined manually right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yepyep

@venkywonka
Copy link
Collaborator

@anish-shanbhag Thank you for elevating the other parts of the documentation by fixing typos and issues even if you didn't have to 😄

Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

```shell
EXTRA_LLM_API_FILE=/app/tensorrt_llm/examples/configs/qwen3.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nv-guomingz in TRT-LLM NGC container, will the newly added .yaml file part of it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, those newly added .yaml files will be part of it, please refer to this.

@juney-nvidia juney-nvidia requested review from litaotju and nv-guomingz and removed request for Wanli-Jiang and Shixiaowei02 October 10, 2025 00:33
@juney-nvidia
Copy link
Collaborator

@anish-shanbhag

Hi Anish, thanks for submitting the PR.

@litaotju @nv-guomingz Hi Tao/Guoming, this PR touches files which you are familiar with, can you also help review it?

Thanks
June



## Launch Docker on a node with NVIDIA GPUs deployed
## Launch Docker Container
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the original version is much more simplicity.

The principle for quick start guide is to make it as simple as possible.

You can also directly load pre-quantized models [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor.
To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples).

## Quick Start for Popular Models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to move this part into deployment-guide section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laikhtewari could u please comment this change on quick start guide.md?

os.path.join(os.path.dirname(__file__),
"../../tensorrt_llm/version.py"))
"""Replace the placeholder container version in all docs source files."""
version_path = (Path(__file__).parent.parent.parent / "tensorrt_llm" /
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • @litaotju , this line change will update the deployment guide folder from fixed version 1.0.0rc6 to latest release version ,e.g, 1.20.0rc1, is it expected behavior?

--ep_size 1 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
Copy link
Collaborator

@nv-guomingz nv-guomingz Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants