[None][chore] Final mass integration of release/1.1 #9960

mikeiovine · 2025-12-12T17:35:51Z

Description

PRs excluded:

[None][infra] add attribution files for release/1.1 #9495: Attributions on main should be more up to date.
16a7d18: Automated lock file update
[TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue #9824: Duplicate of [TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue #9823
[https://nvbugs/5537738][fix] Add fp8 post-quant allgather support to release 1.1 #8322: Already in [https://nvbugs/5537738][fix] Add fp8 post-quant allgather support #8008.
As usual, all PRs which either only add waives or already exist on main.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Added support for new hardware: NVIDIA Blackwell SM103, B300, and GB300.
- New model support: GPT-OSS, Hunyuan-Dense, Hunyuan-MoE, Seed-OSS, and Qwen3 variants.
- Disaggregated serving improvements and speculative decoding enhancements.
Documentation
- Comprehensive benchmarking guides reorganized and expanded.
- Performance overview updated with new benchmark results and model support tables.
- Release notes for v1.1 added with features, API changes, and infrastructure updates.
Infrastructure
- Updated dependency management and Docker build configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

mikeiovine · 2025-12-12T17:36:29Z

/bot run --disable-fail-fast

coderabbitai · 2025-12-12T17:38:03Z

📝 Walkthrough

Walkthrough

Adds dependency installation from constraints.txt to Docker build stages, updates benchmarking and performance documentation with new sections and reorganization, adds Blackwell GPU support details across documentation, introduces TensorRT-LLM Release 1.1 notes, modifies MOE post-quantization logic gating, removes redundant logging, and updates test skip configuration.

Changes

Cohort / File(s)	Summary
Docker Build Stages `docker/Dockerfile.multi`	Adds steps to copy and install dependencies from constraints.txt at two points in the devel build stage: before triton and after OSS attribution, then removes the file.
Benchmarking Documentation `docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md`, `docs/source/developer-guide/perf-benchmarking.md`	Reorganizes with new Table of Contents, restructures sections with heading level adjustments, adds Online Serving Benchmarking subsection, updates cross-references and command naming conventions.
Performance & Model Documentation `docs/source/developer-guide/perf-overview.md`, `docs/source/features/quantization.md`, `docs/source/legacy/reference/support-matrix.md`, `docs/source/models/supported-models.md`, `docs/source/overview.md`, `docs/source/examples/dynamo_k8s_example.rst`	Updates hardware support matrices to include Blackwell SM100/SM103 variants, adds GB300 NVL72 entry, corrects formatting in GPU listings, simplifies Dynamo Cloud deployment description, and updates footnote references for feature support.
Release Notes `docs/source/release-notes.md`	Adds comprehensive TensorRT-LLM Release 1.1 content covering model support, features, benchmarks, infrastructure changes, API changes, and bug fixes. Minor formatting adjustments to existing 1.0 notes.
MOE Quantization Logic `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`	Introduces `run_post_quant_allgather` flag computed from `self.use_dp` and `self.parallel_size > 1` to gate routing, quantization, and padding behavior in FP8 and W4A16 MXFP4 paths.
Executor Logging Cleanup `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Removes redundant debug log line that duplicated LLM arguments logging.
Test Configuration `tests/integration/test_lists/waives.txt`	Updates test skip list to replace DeepSeekR1 throughput_mtp_trtllm variant with throughput_mtp variant.
Example Documentation `examples/auto_deploy/README.md`	Updates Mixed-precision Quantization section to reference "TensorRT Model Optimizer" instead of "Model Optimizer".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Specific areas requiring attention:
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py: Logic changes introducing post-quantization allgather gating require verification of correctness across FP8 and W4A16 paths
- docs/source/release-notes.md: Large documentation addition (Release 1.1) should be reviewed for accuracy and completeness of listed features, API changes, and known issues
- docs/source/developer-guide/perf-overview.md: Extensive reorganization with updated benchmark commands and model listings should be validated for consistency

Possibly related PRs

[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss #7192: Modifies integration test waiver entries for DeepSeekR1 throughput variants in the same test skip list configuration.

Suggested reviewers

Shixiaowei02

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description lacks clear explanation of what changes are included and why. Only lists excluded PRs without describing the actual integration scope.	Add a detailed description of the key changes being integrated from release/1.1 into main, including categories of changes (features, fixes, docs, infrastructure) and rationale for the mass integration.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][chore] Final mass integration of release/1.1' clearly describes the main purpose of this PR: integrating release/1.1 changes into main.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2025-12-12T17:41:53Z

PR_Github #28065 [ run ] triggered by Bot. Commit: 81f6493

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (3)

docker/Dockerfile.multi (1)

94-97: Duplicate dependency install: please justify or dedupe.
Running the same pip3 install -r /tmp/constraints.txt twice likely adds build time and risk; if the second block is required, add a brief comment explaining why, otherwise drop it.

docs/source/developer-guide/perf-benchmarking.md (1)

19-37: ToC addition is helpful; consider trimming now-redundant guidance line if you touch this again.
No blocker—just note the doc now points to the serve benchmarking guide multiple times (fine, but a bit repetitive).
docs/source/release-notes.md (1)
7-60: Please align tense/labels and tighten wording for release-note readability/consistency

A few small consistency nits in the new 1.1 section:

Mixed tense: “Add … support” vs “Added … support” (the rest of the file mostly uses past tense).

Heading label: - **Benchmark** vs elsewhere - Benchmark: / “Benchmarks”.

Suggested small editorial diff (example—apply consistently across the 1.1 bullets):
-### Key Features and Enhancements
+### Key Features and Enhancements

 - **Model Support**
-  - Add GPT-OSS model support.
-  - Add Hunyuan-Dense model support. Thanks to the contribution from @sorenwu.
-  - Add Hunyuan-MoE model support. Thanks to the contribution from @qianbiaoxiang.
-  - Add Seed-OSS model support. Thanks to the contribution from @Nekofish-L.
+  - Added GPT-OSS model support.
+  - Added Hunyuan-Dense model support. Thanks to the contribution from @sorenwu.
+  - Added Hunyuan-MoE model support. Thanks to the contribution from @qianbiaoxiang.
+  - Added Seed-OSS model support. Thanks to the contribution from @Nekofish-L.

-- **Benchmark**
+- **Benchmarks**

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 246a877 and 81f6493.

📒 Files selected for processing (14)

docker/Dockerfile.multi (1 hunks)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (3 hunks)
docs/source/developer-guide/perf-benchmarking.md (3 hunks)
docs/source/developer-guide/perf-overview.md (6 hunks)
docs/source/examples/dynamo_k8s_example.rst (1 hunks)
docs/source/features/quantization.md (1 hunks)
docs/source/legacy/reference/support-matrix.md (2 hunks)
docs/source/models/supported-models.md (1 hunks)
docs/source/overview.md (1 hunks)
docs/source/release-notes.md (7 hunks)
examples/auto_deploy/README.md (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (0 hunks)
tests/integration/test_lists/waives.txt (0 hunks)

💤 Files with no reviewable changes (2)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tests/integration/test_lists/waives.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

🧠 Learnings (38)

📓 Common learnings

Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

docs/source/legacy/reference/support-matrix.md
docs/source/features/quantization.md
docs/source/overview.md

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

docs/source/legacy/reference/support-matrix.md
docs/source/developer-guide/perf-benchmarking.md
docs/source/overview.md
docs/source/release-notes.md

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

docs/source/legacy/reference/support-matrix.md
docs/source/overview.md
docs/source/developer-guide/perf-overview.md

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

docs/source/legacy/reference/support-matrix.md
docs/source/overview.md
docs/source/release-notes.md

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/legacy/reference/support-matrix.md
examples/auto_deploy/README.md
docs/source/developer-guide/perf-benchmarking.md
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
docs/source/overview.md
docs/source/release-notes.md

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

docs/source/legacy/reference/support-matrix.md
docs/source/features/quantization.md
docs/source/overview.md

📚 Learning: 2025-09-24T03:31:28.908Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.

Applied to files:

docs/source/legacy/reference/support-matrix.md

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

docs/source/legacy/reference/support-matrix.md
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-06T03:47:16.802Z

Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

docs/source/legacy/reference/support-matrix.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
docs/source/developer-guide/perf-overview.md
docs/source/release-notes.md

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
docs/source/developer-guide/perf-overview.md
docs/source/release-notes.md

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/developer-guide/perf-overview.md
docs/source/release-notes.md

📚 Learning: 2025-11-27T09:23:18.742Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 9511
File: tests/integration/defs/examples/serve/test_serve.py:136-186
Timestamp: 2025-11-27T09:23:18.742Z
Learning: In TensorRT-LLM testing, when adding test cases based on RCCA commands, the command format should be copied exactly as it appears in the RCCA case, even if it differs from existing tests. For example, some RCCA commands for trtllm-serve may omit the "serve" subcommand while others include it.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
docs/source/developer-guide/perf-overview.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/developer-guide/perf-overview.md
docs/source/release-notes.md

📚 Learning: 2025-08-18T08:42:02.640Z

Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/developer-guide/perf-overview.md

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/developer-guide/perf-overview.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/developer-guide/perf-overview.md
docs/source/release-notes.md

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md
docs/source/developer-guide/perf-overview.md

📚 Learning: 2025-08-09T02:04:49.623Z

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

docs/source/developer-guide/perf-benchmarking.md

📚 Learning: 2025-08-20T15:04:42.885Z

Learnt from: dbari
Repo: NVIDIA/TensorRT-LLM PR: 7095
File: docker/Dockerfile.multi:168-168
Timestamp: 2025-08-20T15:04:42.885Z
Learning: In docker/Dockerfile.multi, wildcard COPY for benchmarks (${CPP_BUILD_DIR}/benchmarks/*Benchmark) is intentionally used instead of directory copy because the benchmarks directory contains various other build artifacts during C++ builds, and only specific benchmark executables should be copied to the final image.

Applied to files:

docker/Dockerfile.multi

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
docs/source/release-notes.md

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
docs/source/release-notes.md

📚 Learning: 2025-10-20T17:07:18.745Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-09-18T05:41:45.847Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7120
File: tensorrt_llm/llmapi/llm.py:690-697
Timestamp: 2025-09-18T05:41:45.847Z
Learning: Kimi model support is currently focused on the PyTorch backend path, with TRT path support potentially coming later.

Applied to files:

docs/source/overview.md

📚 Learning: 2025-11-24T17:09:17.870Z

Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+

Applied to files:

docs/source/overview.md

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

docs/source/developer-guide/perf-overview.md

📚 Learning: 2025-09-09T18:31:44.336Z

Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 7658
File: .github/CODEOWNERS:160-164
Timestamp: 2025-09-09T18:31:44.336Z
Learning: The ruleset for `release/**` branch patterns in the NVIDIA/TensorRT-LLM repository covers NIM-specific release branches like `release/1.0.1-NIM`, ensuring proper code ownership enforcement.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-10-17T13:21:31.724Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

docs/source/release-notes.md

🧬 Code graph analysis (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (3)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

has_deepseek_fp8_block_scales (681-684)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)

fp8_block_scale_moe_runner (677-756)

bf16_mxe2m1_block_scale_moe_runner (1517-1610)

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py (1)

w3_w1_weight (1103-1105)

🪛 LanguageTool

docs/source/developer-guide/perf-benchmarking.md

[grammar] ~12-~12: Use a hyphen to join words.
Context: ...m-serve` command, which starts an OpenAI compatible server that supports the foll...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md

23-23: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

24-24: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

25-25: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

26-26: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

docs/source/developer-guide/perf-overview.md

156-156: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

207-207: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

241-241: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (13)

examples/auto_deploy/README.md (1)

93-99: Terminology update looks good.

The change from generic "Model Optimizer" to "TensorRT Model Optimizer" aligns with the PR's documentation standardization effort. The hyperlink remains valid and the AutoQuantize explanation is accurate and clear.

docs/source/legacy/reference/support-matrix.md (2)

133-142: Verify GB300 NVL72 naming + link stability (avoid stale marketing URLs).
Since this is a legacy support matrix, please double-check the “GB300 NVL72” product name and that the URL is the intended long-lived reference (marketing URLs sometimes move).

159-165: Confirm SM103 inclusion and scope for Blackwell precision list.
The addition of SM103 expands the documented precision support surface; please verify it matches the actual supported SMs for Blackwell in TensorRT-LLM 1.1 (and aligns with other docs in this PR).

docs/source/models/supported-models.md (1)

44-45: Please validate the SM eligibility notes match actual feature gating.
These footnotes expand hardware eligibility (SM103); make sure this matches the implementation (and any existing constraints in release/1.1).

docs/source/examples/dynamo_k8s_example.rst (1)

4-7: Verify the Dynamo Quick Start URL is a stable permalink.
Given docs paths can change, please confirm this exact URL is correct/current and prefer the most stable entrypoint if NVIDIA docs provide one.

docker/Dockerfile.multi (1)

74-77: constraints.txt installed via -r: confirm this is intended semantics.
If this file is truly a constraints file, consider switching to pip install -c /tmp/constraints.txt -r requirements.txt (or rename to requirements-*.txt if it’s the definitive pinned set).

docs/source/overview.md (1)

56-60: Please verify product naming + FP4 optimization claim scope.
Ensure the Blackwell bullet (B300/GB300/RTX Pro 6000 SE FP4) matches what TRT-LLM 1.1 actually supports and what you’re comfortable claiming publicly.

docs/source/features/quantization.md (2)

121-130: Confirm SM103 quantization capability claims in the hardware matrix.
Adding Blackwell(sm100/103) broadens stated support—please verify it matches actual kernel availability and any internal gating.

131-133: Verify the MXFP8 recipe note is technically precise for sm100/103.
This is a low-level statement about recipes/scales; worth double-checking against the implementation/docs you’re sourcing from.

docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)

38-41: Doc restructure reads well; please keep container tags pointing to published NGC images.
The new “Preparation / Start / Benchmark” layout is clearer. Just ensure any non-placeholder container tags used elsewhere in this doc (or adjacent docs) only reference actually published NGC tags. Based on learnings, container tag versions in docs can get ahead of what’s published.

Also applies to: 51-56, 101-104

docs/source/developer-guide/perf-benchmarking.md (1)

444-445: Cross-linking to ModelOpt + serve benchmarking guide looks good.
The updated ModelOpt link and the new Online Serving Benchmarking pointer improve discoverability.

Also applies to: 489-493

docs/source/release-notes.md (2)

119-120: Spot-check these single-line edits for unintended doc drift vs release/1.1

These look like small touch-ups, but given this is a mass-integration PR, please sanity-check they exactly match release/1.1 intent (no accidental cherry-pick drift):

Line 119-120 (LoRA bullet)

Line 158-159 (KV cache reuse for multimodal)

Line 217-218 (API removal bullets)

Line 276-277 (known issue)

Suggested quick “compare to release branch” action: confirm these exact lines match release/1.1 (or are intentional main-only deltas).

Also applies to: 158-159, 217-218, 276-277

54-61: NGC tags and dependency versions are accurate and published.

Verification confirms both NGC base images (nvcr.io/nvidia/pytorch:25.10-py3 and nvcr.io/nvidia/tritonserver:25.10-py3) are published. All Python dependencies match requirements.txt: PyTorch 2.9.0, ModelOpt 0.37, transformers 4.56.0, and xgrammar 0.1.25 are correctly pinned. NIXL 0.5.0 is a system-level dependency handled via build configuration rather than Python requirements.

docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md

docs/source/developer-guide/perf-benchmarking.md

docs/source/developer-guide/perf-overview.md

docs/source/release-notes.md

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

examples/auto_deploy/README.md

tensorrt-cicd · 2025-12-13T02:04:17Z

PR_Github #28065 [ run ] completed with state SUCCESS. Commit: 81f6493
/LLM/main/L0_MergeRequest_PR pipeline #21437 completed with status: 'FAILURE'

docker/Dockerfile.multi

tests/integration/test_lists/waives.txt

mikeiovine · 2025-12-13T15:24:25Z

/bot run --disable-fail-fast

mikeiovine · 2025-12-13T15:25:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-13T15:30:25Z

PR_Github #28113 [ run ] triggered by Bot. Commit: dbbd89f

tensorrt-cicd · 2025-12-13T15:31:33Z

PR_Github #28114 [ run ] triggered by Bot. Commit: dbbd89f

tensorrt-cicd · 2025-12-13T15:31:35Z

PR_Github #28113 [ run ] completed with state ABORTED. Commit: dbbd89f

tensorrt-cicd · 2025-12-13T15:52:18Z

PR_Github #28114 [ run ] completed with state FAILURE. Commit: dbbd89f
/LLM/main/L0_MergeRequest_PR pipeline #21473 completed with status: 'FAILURE'

mikeiovine · 2025-12-15T16:29:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-15T16:35:26Z

PR_Github #28419 [ run ] triggered by Bot. Commit: a7a2d80

tensorrt-cicd · 2025-12-15T20:59:52Z

PR_Github #28419 [ run ] completed with state SUCCESS. Commit: a7a2d80
/LLM/main/L0_MergeRequest_PR pipeline #21749 completed with status: 'FAILURE'

mikeiovine · 2025-12-15T21:07:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-15T21:12:56Z

PR_Github #28444 [ run ] triggered by Bot. Commit: a7a2d80

tensorrt-cicd · 2025-12-15T22:00:41Z

PR_Github #28444 [ run ] completed with state SUCCESS. Commit: a7a2d80
/LLM/main/L0_MergeRequest_PR pipeline #21773 completed with status: 'FAILURE'

mikeiovine · 2025-12-15T22:34:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-15T22:40:49Z

PR_Github #28453 [ run ] triggered by Bot. Commit: 8866243

tensorrt-cicd · 2025-12-16T02:35:37Z

PR_Github #28453 [ run ] completed with state FAILURE. Commit: 8866243
/LLM/main/L0_MergeRequest_PR pipeline #21782 completed with status: 'FAILURE'

mikeiovine · 2025-12-16T03:50:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-16T03:57:08Z

PR_Github #28508 [ run ] triggered by Bot. Commit: a92d96e

tensorrt-cicd · 2025-12-16T10:19:28Z

PR_Github #28508 [ run ] completed with state SUCCESS. Commit: a92d96e
/LLM/main/L0_MergeRequest_PR pipeline #21832 completed with status: 'FAILURE'

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

…VIDIA#9206) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

…Release 1.1 (NVIDIA#9723) Signed-off-by: Zachary Patel <22306219+zbpatel@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Co-authored-by: Laikh Tewari <laikhtewari1@gmail.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

…9887) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>

mikeiovine · 2025-12-16T15:49:04Z

/bot skip --comment "Docs only"

mikeiovine · 2025-12-16T15:50:39Z

~~There have been a few CI stability issues this week.~~ ~~In the interest of getting this in before I leave for the holidays, I've excluded the only code change (#8322) and asked the author @ChristinaZ to cherry pick this into main separately.~~

Looks like it was already merged to main: #8008

tensorrt-cicd · 2025-12-16T15:55:35Z

PR_Github #28597 [ skip ] triggered by Bot. Commit: f865ed7

tensorrt-cicd · 2025-12-16T16:38:30Z

PR_Github #28597 [ skip ] completed with state SUCCESS. Commit: f865ed7
Skipping testing for commit f865ed7

mikeiovine requested review from a team as code owners December 12, 2025 17:35

mikeiovine requested review from HuiGao-NV, Shixiaowei02, bmarimuthu-nv and kaiyux December 12, 2025 17:35

mikeiovine force-pushed the mass-integrate-1.1 branch from 30e4893 to 81f6493 Compare December 12, 2025 17:36

mikeiovine changed the title ~~[None][chore] Finale mass integration of release/1.1~~ [None][chore] Final mass integration of release/1.1 Dec 12, 2025

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

nvchenghaoz approved these changes Dec 12, 2025

View reviewed changes

achartier approved these changes Dec 12, 2025

View reviewed changes

lucaslie reviewed Dec 12, 2025

View reviewed changes

examples/auto_deploy/README.md Outdated Show resolved Hide resolved

chzblych approved these changes Dec 13, 2025

View reviewed changes

docker/Dockerfile.multi Outdated Show resolved Hide resolved

chzblych reviewed Dec 13, 2025

View reviewed changes

tests/integration/test_lists/waives.txt Show resolved Hide resolved

mikeiovine force-pushed the mass-integrate-1.1 branch from 81f6493 to 2adf10c Compare December 13, 2025 15:23

mikeiovine force-pushed the mass-integrate-1.1 branch from 2adf10c to dbbd89f Compare December 13, 2025 15:24

mikeiovine force-pushed the mass-integrate-1.1 branch from a7a2d80 to 8866243 Compare December 15, 2025 22:34

mikeiovine force-pushed the mass-integrate-1.1 branch from 8866243 to a92d96e Compare December 16, 2025 03:50

kaiyux and others added 8 commits December 16, 2025 10:47

mikeiovine force-pushed the mass-integrate-1.1 branch from a92d96e to f865ed7 Compare December 16, 2025 15:48

mikeiovine merged commit dba9036 into NVIDIA:main Dec 16, 2025
7 checks passed

mikeiovine deleted the mass-integrate-1.1 branch December 16, 2025 18:33

coderabbitai bot mentioned this pull request Jan 21, 2026

[TRTLLM-10324][chore] Remove 2-model based MTP #10895

Open

1 task

[None][chore] Final mass integration of release/1.1 #9960

[None][chore] Final mass integration of release/1.1 #9960

Uh oh!

Conversation

mikeiovine commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

mikeiovine commented Dec 12, 2025

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

tensorrt-cicd commented Dec 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Dec 13, 2025

Uh oh!

Uh oh!

Uh oh!

mikeiovine commented Dec 13, 2025

Uh oh!

mikeiovine commented Dec 13, 2025

Uh oh!

tensorrt-cicd commented Dec 13, 2025

Uh oh!

tensorrt-cicd commented Dec 13, 2025

Uh oh!

tensorrt-cicd commented Dec 13, 2025

Uh oh!

tensorrt-cicd commented Dec 13, 2025

Uh oh!

mikeiovine commented Dec 15, 2025

Uh oh!

tensorrt-cicd commented Dec 15, 2025

Uh oh!

tensorrt-cicd commented Dec 15, 2025

Uh oh!

mikeiovine commented Dec 15, 2025

Uh oh!

tensorrt-cicd commented Dec 15, 2025

Uh oh!

tensorrt-cicd commented Dec 15, 2025

Uh oh!

mikeiovine commented Dec 15, 2025

Uh oh!

tensorrt-cicd commented Dec 15, 2025

Uh oh!

tensorrt-cicd commented Dec 16, 2025

Uh oh!

mikeiovine commented Dec 16, 2025

Uh oh!

tensorrt-cicd commented Dec 16, 2025

Uh oh!

mikeiovine commented Dec 12, 2025 •

edited

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

mikeiovine commented Dec 16, 2025 •

edited

Loading