[OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer #9679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

cjluo-nv merged 1 commit into NVIDIA:main from cjluo-nv:update_modelopt

Dec 7, 2025

ATTRIBUTIONS-Python.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -25486,7 +25486,7 @@ limitations under the License. @@
     ```
     ### URLs
-      - `Homepage`: https://github.com/NVIDIA/TensorRT-Model-Optimizer
+      - `Homepage`: https://github.com/NVIDIA/Model-Optimizer
     ## nvidia-modelopt-core (0.33.1)
@@ Expand All / @@ -25513,7 +25513,7 @@ limitations under the License. @@
     ```
     ### URLs
-      - `Homepage`: https://github.com/NVIDIA/TensorRT-Model-Optimizer
+      - `Homepage`: https://github.com/NVIDIA/Model-Optimizer
     ## nvidia-nccl-cu12 (2.27.3)
@@ Expand Down @@

README.md

-Original file line number
+Diff line change
@@ Expand Up @@
     [➡️ link](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml)
-    * [2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12
+    * [2024/08/20] 🏎️SDXL with #Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12
     [➡️ link](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/)
     * [2024/08/13] 🐍 DIY Code Completion with #Mamba ⚡ #TensorRT #LLM for speed 🤖 NIM for ease ☁️ deploy anywhere
@@ Expand Down Expand Up @@
     * [2024/05/21] ✨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM ✨👀 📚 Marvelous Modal Manual:
     Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.com/docs/examples/trtllm_llama)
-    * [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
+    * [2024/05/08] NVIDIA Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
     * [2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚[➡️ link](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/)
@@ Expand Down @@

...urce/blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.md

-Original file line number
+Diff line change
@@ Expand Up @@
     The wo GEMM is the final linear layer within the multi-head attention block that produces the final outputs. While DeepSeek R1's MLA modifies the initial projections for keys and values, the wo GEMM operator remains a critical and standard component for finalizing the attention computation. In the term, "wo" is the abbreviation for the weight matrix for the output.
-    We've evaluated that quantizing the wo GEMM to FP4 still satisfies the accuracy requirements, maintaining a similar MTP accept rate (AR) while improving end-to-end performance. The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) team has published checkpoints that additionally quantize the wo module in attention layers to FP4 on HuggingFace:
+    We've evaluated that quantizing the wo GEMM to FP4 still satisfies the accuracy requirements, maintaining a similar MTP accept rate (AR) while improving end-to-end performance. The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) team has published checkpoints that additionally quantize the wo module in attention layers to FP4 on HuggingFace:
     * https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2
     * https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4-v2
@@ Expand Down @@

...ng_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md

-Original file line number
+Diff line change
@@ Expand Up @@
     *TensorRT LLM already supports [FP8 Attention](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#fp8-kv-cache-and-mla) while for this latency scenario low-precision attention computation doesn't help with performance so we choose to use bf16 precision for the Attention Modules.
-    ** nvfp4 model checkpoint is generated by the [NVIDIA TensorRT Model Optimizer toolkit](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+    ** nvfp4 model checkpoint is generated by the [NVIDIA Model Optimizer toolkit](https://github.com/NVIDIA/Model-Optimizer).
     *** RouterGEMM uses bf16 inputs/weights with fp32 outputs for numerical stability
@@ Expand Down @@

...s/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md

-Original file line number
+Diff line change
@@ Expand Up @@
     * FP8 KV cache and FP8 attention, rather than BF16 precision.
     * FP4 Allgather for better communication bandwidth utilization.
-    The checkpoint used in this blog is hosted in [nvidia/DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), generated by [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). The accuracy score of common dataset on this FP4 checkpoint and TensorRT LLM implementations are:
+    The checkpoint used in this blog is hosted in [nvidia/DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), generated by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer). The accuracy score of common dataset on this FP4 checkpoint and TensorRT LLM implementations are:
     | Precision | GPQA Diamond | MATH-500
     | :-- | :-- | :-- |
@@ Expand Down @@

docs/source/developer-guide/perf-benchmarking.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -423,10 +423,10 @@ checkpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkp
  
    - [`nvidia/Llama-3.1-70B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)

    - [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)

    To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/deployment/1_tensorrt_llm.html).

    To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/1_tensorrt_llm.html).

    `trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration

    file is present in checkpoints quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)

    file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)

    and describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpoints

    above:

docs/source/developer-guide/perf-overview.md

-Original file line number
+Diff line change
@@ Expand Up @@
     The performance numbers below were collected using the steps described in this document.
-    Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
+    Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/Model-Optimizer/#) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
     *(NEW for v1.0) RTX 6000 Pro Blackwell Server Edition Benchmarks:*
@@ Expand Down @@

docs/source/features/auto_deploy/support_matrix.md

-Original file line number
+Diff line change
@@ Expand Up @@
     ### Precision Support
-    AutoDeploy supports models with various precision formats, including quantized checkpoints generated by [`TensorRT-Model-Optimizer`](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+    AutoDeploy supports models with various precision formats, including quantized checkpoints generated by [`Model-Optimizer`](https://github.com/NVIDIA/Model-Optimizer).
     **Supported precision types include:**
@@ Expand Down @@

docs/source/features/quantization.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -23,7 +23,7 @@ The default PyTorch backend supports FP4 and FP8 quantization on the latest Blac
  
    ### Running Pre-quantized Models

    TensorRT LLM can directly run [pre-quantized models](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) generated with the [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).

    TensorRT LLM can directly run [pre-quantized models](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) generated with the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).

    ```python

    from tensorrt_llm import LLM

    @@ -54,8 +54,8 @@ If a pre-quantized model is not available on the [Hugging Face Hub](https://hugg
  
    Follow this step-by-step guide to quantize a model:

    ```bash

    git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git

    cd TensorRT-Model-Optimizer/examples/llm_ptq

    git clone https://github.com/NVIDIA/Model-Optimizer.git

    cd Model-Optimizer/examples/llm_ptq

    scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf

    ```

    @@ -108,4 +108,4 @@ FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe (E4M3 act/w
  
    ## Quick Links

    - [Pre-quantized Models by ModelOpt](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)

    - [ModelOpt Support Matrix](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/0_support_matrix.html)

    - [ModelOpt Support Matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html)

docs/source/legacy/performance/perf-benchmarking.md

-Original file line number
+Diff line change
@@ Expand Up @@
     - [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)
     `trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration
-    file is present in checkpoints quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
+    file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
     and describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpoints
     above:
@@ Expand Down @@

docs/source/torch/auto_deploy/support_matrix.md

-Original file line number
+Diff line change
@@ Expand Up @@
     ### Precision Support
-    AutoDeploy supports models with various precision formats, including quantized checkpoints generated by [`TensorRT-Model-Optimizer`](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+    AutoDeploy supports models with various precision formats, including quantized checkpoints generated by [`Model-Optimizer`](https://github.com/NVIDIA/Model-Optimizer).
     **Supported precision types include:**
@@ Expand Down @@

docs/source/torch/features/quantization.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,7 +1,7 @@
  
    # Quantization

    The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub,

    which are generated by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).

    which are generated by [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).

    ```python

    from tensorrt_llm._torch import LLM

    @@ -12,7 +12,7 @@ llm.generate("Hello, my name is")
  
    Or you can try the following commands to get a quantized model by yourself:

    ```bash

    git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git

    cd TensorRT-Model-Optimizer/examples/llm_ptq

    git clone https://github.com/NVIDIA/Model-Optimizer.git

    cd Model-Optimizer/examples/llm_ptq

    scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf

    ```

examples/auto_deploy/README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -90,16 +90,16 @@ python lm_eval_ad.py \
  
    --model autodeploy --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,world_size=2 --tasks mmlu

    ```

    ### Mixed-precision Quantization using TensorRT Model Optimizer

    ### Mixed-precision Quantization using Model Optimizer

    TensorRT Model Optimizer [AutoQuantize](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) algorithm is a PTQ algorithm from ModelOpt which quantizes a model by searching for the best quantization format per-layer while meeting the performance constraint specified by the user. This way, `AutoQuantize` enables to trade-off model accuracy for performance.

    Model Optimizer [AutoQuantize](https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) algorithm is a PTQ algorithm from ModelOpt which quantizes a model by searching for the best quantization format per-layer while meeting the performance constraint specified by the user. This way, `AutoQuantize` enables to trade-off model accuracy for performance.

    Currently `AutoQuantize` supports only `effective_bits` as the performance constraint (for both weight-only quantization and weight & activation quantization). See

    [AutoQuantize documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) for more details.

    [AutoQuantize documentation](https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) for more details.

    #### 1. Quantize a model with ModelOpt

    Refer to [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_autodeploy/README.md) for generating quantized model checkpoint.

    Refer to [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_autodeploy/README.md) for generating quantized model checkpoint.

    #### 2. Deploy the quantized model with AutoDeploy

examples/disaggregated/README.md

-Original file line number
+Diff line change
@@ Expand Up @@
     ### Prerequisites
     To enable mixed precision serving, you will need:
-. A quantized checkpoint created with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
+. A quantized checkpoint created with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
 . The original unquantized checkpoint (Can also be quantized)
 . Both checkpoints must use the same KV cache dtype to ensure compatibility during transfer
@@ Expand Down @@

examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py

-Original file line number
+Diff line change
@@ Expand Up @@
         llm_kwargs = {}
         if use_modelopt_ckpt:
-            # This is a Llama-3.1-8B combined with Medusa heads provided by TensorRT Model Optimizer.
+            # This is a Llama-3.1-8B combined with Medusa heads provided by Model Optimizer.
             # Both the base model (except lm_head) and Medusa heads have been quantized in FP8.
             model = model_dir or "nvidia/Llama-3.1-8B-Medusa-FP8"
@@ Expand Down Expand Up @@
         parser.add_argument(
             '--use_modelopt_ckpt',
             action='store_true',
-            help="Use FP8-quantized checkpoint from TensorRT Model Optimizer.")
+            help="Use FP8-quantized checkpoint from Model Optimizer.")
         # TODO: remove this arg after ModelOpt ckpt is public on HF
         parser.add_argument('--model_dir', type=Path, default=None)
         args = parser.parse_args()
@@ Expand Down @@

examples/llm-api/_tensorrt_engine/quickstart_example.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -9,7 +9,7 @@ def main(): @@
         build_config.max_num_tokens = 1024
         # Model could accept HF model name, a path to local HF model,
-        # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
+        # or Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
         llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                   build_config=build_config)
@@ Expand Down @@

examples/llm-api/llm_inference.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -7,7 +7,7 @@ @@
     def main():
         # Model could accept HF model name, a path to local HF model,
-        # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
+        # or Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
         llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
         # Sample prompts.
@@ Expand Down @@

examples/llm-api/quickstart_example.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -4,7 +4,7 @@ @@
     def main():
         # Model could accept HF model name, a path to local HF model,
-        # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
+        # or Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
         llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
         # Sample prompts.
@@ Expand Down @@

examples/medusa/README.md

-Original file line number
+Diff line change
@@ Expand Up @@
     The TensorRT LLM Medusa example code is located in [`examples/medusa`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with Medusa decoding support.
     In this example, we demonstrate the usage of two models:
 . The Vucuna 7B model from Hugging Face [`FasterDecoding/medusa-vicuna-7b-v1.3`](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3) with its Medusa heads [`medusa-vicuna-7b-v1.3`](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3).
-. The quantized checkpoint [`nvidia/Llama-3.1-8B-Medusa-FP8`](https://huggingface.co/nvidia/Llama-3.1-8B-Medusa-FP8) on Hugging Face by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt). This model is based on [Llama-3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and enhanced with Medusa heads, with both the base model (except lm_head) and Medusa heads already quantized in FP8.
+. The quantized checkpoint [`nvidia/Llama-3.1-8B-Medusa-FP8`](https://huggingface.co/nvidia/Llama-3.1-8B-Medusa-FP8) on Hugging Face by [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) (ModelOpt). This model is based on [Llama-3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and enhanced with Medusa heads, with both the base model (except lm_head) and Medusa heads already quantized in FP8.
     ### Build TensorRT engine(s)
     Get the weights by downloading base model [`vicuna-7b-v1.3`](https://huggingface.co/lmsys/vicuna-7b-v1.3) and Medusa Heads [`medusa-vicuna-7b-v1.3`](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3) from HF.
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer #9679

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!