NVIDIA-NeMo · meatybobby · Oct 31, 2025 · Nov 4, 2025 · Nov 4, 2025
diff --git a/docs/mm/nemo_2/gen_nemo2_ckpt.md b/docs/mm/nemo_2/gen_nemo2_ckpt.md
@@ -2,6 +2,8 @@
 
 To run the code examples, you will need a NeMo 2.0 checkpoint. Follow the steps below to generate a NeMo 2.0 checkpoint, which you can then use to test the export and deployment workflows for NeMo 2.0 models.
 
+## Setup
+
 1. Pull down and run [NeMo Framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use:
 
    ```shell
@@ -15,7 +17,34 @@ To run the code examples, you will need a NeMo 2.0 checkpoint. Follow the steps
    ```shell
    huggingface-cli login
    ```
-
+
+## Generate Qwen VL Checkpoint (for In-Framework Deployment)
+
+This checkpoint is used for in-framework deployment examples.
+
+3. Run the following Python code to generate the NeMo 2.0 checkpoint:
+
+   ```python
+   from nemo.collections.llm import import_ckpt
+   from nemo.collections import vlm
+   from pathlib import Path
+
+   if __name__ == '__main__':
+      # Specify the Hugging Face model ID
+      hf_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
+
+      # Import the model and convert to NeMo 2.0 format
+      import_ckpt(
+         model=vlm.Qwen2VLModel(vlm.Qwen25VLConfig3B(), model_version='qwen25-vl')
+         source=f"hf://{hf_model_id}",  # Hugging Face model source
+         output_path=Path('/opt/checkpoints/qwen25_vl_3b')
+      )
+   ```
+
+## Generate Llama 3.2-Vision Checkpoint (for TensorRT-LLM Deployment)
+
+This checkpoint is used for optimized TensorRT-LLM deployment examples.
+
 3. Run the following Python code to generate the NeMo 2.0 checkpoint:
 
    ```python

diff --git a/docs/mm/nemo_2/in-framework.md b/docs/mm/nemo_2/in-framework.md
@@ -1,5 +1,135 @@
-# Deploy NeMo 2.0 Multimodal Models
+# Deploy NeMo 2.0 Multimodal Models with Triton Inference Server
 
-## Optimized Inference for Multimodal Models using TensorRT
+This section explains how to deploy [NeMo 2.0](https://github.com/NVIDIA-NeMo/NeMo) multimodal models with the NVIDIA Triton Inference Server.
 
-Will be updated soon.
+## Quick Example
+
+1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](gen_nemo2_ckpt.md) to generate a NeMo 2.0 multimodal checkpoint.
+
+2. In a terminal, go to the folder where the ``qwen25_vl_3b`` is located. Pull and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use:
+
+   ```shell
+   docker pull nvcr.io/nvidia/nemo:vr
+
+   docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \
+       -v ${PWD}/:/opt/checkpoints/ \
+       -w /opt/Export-Deploy \
+       --name nemo-fw \
+       nvcr.io/nvidia/nemo:vr
+   ```
+
+3. Using a NeMo 2.0 multimodal model, run the following deployment script to verify that everything is working correctly. The script directly serves the NeMo 2.0 model on the Triton server:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen25_vl_3b --triton_model_name qwen
+   ```
+
+4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%).
+
+5. In a separate terminal, access the running container as follows:
+
+   ```shell
+   docker exec -it nemo-fw bash
+   ```
+
+6. To send a query to the Triton server, run the following script with an image:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/multimodal/query_inframework.py \
+       --model_name qwen \
+       --prompt "Describe this image" \
+       --image /path/to/image.jpg \
+       --max_output_len 100
+   ```
+
+## Use a Script to Deploy NeMo 2.0 Multimodal Models on a Triton Inference Server
+
+You can deploy a multimodal model from a NeMo checkpoint on Triton using the provided script.
+
+### Deploy a NeMo Multimodal Model
+
+Executing the script will directly deploy the NeMo 2.0 multimodal model and start the service on Triton.
+
+1. Start the container using the steps described in the **Quick Example** section.
+
+2. To begin serving the downloaded model, run the following script:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen25_vl_3b --triton_model_name qwen
+   ```
+
+   The following parameters are defined in the ``deploy_inframework_triton.py`` script:
+
+   - ``-nc``, ``--nemo_checkpoint``: Path to the NeMo 2.0 checkpoint file to deploy. (Required)
+   - ``-tmn``, ``--triton_model_name``: Name to register the model under in Triton. (Required)
+   - ``-tmv``, ``--triton_model_version``: Version number for the model in Triton. Default: 1
+   - ``-sp``, ``--server_port``: Port for the REST server to listen for requests. Default: 8080
+   - ``-sa``, ``--server_address``: HTTP address for the REST server. Default: 0.0.0.0
+   - ``-trp``, ``--triton_port``: Port for the Triton server to listen for requests. Default: 8000
+   - ``-tha``, ``--triton_http_address``: HTTP address for the Triton server. Default: 0.0.0.0
+   - ``-tps``, ``--tensor_parallel_size``: Tensor parallelism size. Default: 1
+   - ``-pps``, ``--pipeline_parallel_size``: Pipeline parallelism size. Default: 1
+   - ``-mbs``, ``--max_batch_size``: Max batch size of the model. Default: 4
+   - ``-dm``, ``--debug_mode``: Enable debug mode. (Flag; set to enable)
+   - ``-pd``, ``--params_dtype``: Data type for model parameters. Choices: float16, bfloat16, float32. Default: bfloat16
+   - ``-ibts``, ``--inference_batch_times_seqlen_threshold``: Inference batch times sequence length threshold. Default: 1000
+
+   *Note: Some parameters may be ignored or have no effect depending on the model and deployment environment. Refer to the script's help message for the most up-to-date list.*
+
+3. To deploy a different model, just change the ``--nemo_checkpoint`` argument in the script.
+
+
+## How To Send a Query
+
+You can send queries to the Triton Inference Server using either the provided script or the available APIs.
+
+### Send a Query using the Script
+This script allows you to interact with the multimodal model via HTTP requests, sending prompts and images and receiving generated responses directly from the Triton server.
+
+The example below demonstrates how to use the query script to send a prompt and image to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below.
+
+
+```shell
+python /opt/Export-Deploy/scripts/deploy/multimodal/query_inframework.py \
+    --model_name qwen \
+    --processor_name Qwen/Qwen2.5-VL-3B-Instruct \
+    --prompt "What is in this image?" \
+    --image /path/to/image.jpg \
+    --max_output_len 100
+```
+
+**All Parameters:**
+- `-u`, `--url`: URL for the Triton server (default: 0.0.0.0)
+- `-mn`, `--model_name`: Name of the Triton model (required)
+- `-pn`, `--processor_name`: Processor name for qwen-vl models (default: Qwen/Qwen2.5-VL-7B-Instruct)
+- `-p`, `--prompt`: Prompt text (mutually exclusive with --prompt_file; required if --prompt_file not given)
+- `-pf`, `--prompt_file`: File to read the prompt from (mutually exclusive with --prompt; required if --prompt not given)
+- `-i`, `--image`: Path or URL to input image file (required)
+- `-mol`, `--max_output_len`: Max output token length (default: 50)
+- `-mbs`, `--max_batch_size`: Max batch size for inference (default: 4)
+- `-tk`, `--top_k`: Top-k sampling (default: 1)
+- `-tpp`, `--top_p`: Top-p sampling (default: 0.0)
+- `-t`, `--temperature`: Sampling temperature (default: 1.0)
+- `-rs`, `--random_seed`: Random seed for generation (optional)
+- `-it`, `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0)
+
+
+### Send a Query using the NeMo APIs
+
+Please see the below if you would like to use APIs to send a query.
+
+```python
+from nemo_deploy.multimodal import NemoQueryMultimodalPytorch
+from PIL import Image
+
+nq = NemoQueryMultimodalPytorch(url="localhost:8000", model_name="qwen")
+output = nq.query_multimodal(
+    prompts=["What is in this image?"],
+    images=[Image.open("/path/to/image.jpg")],
+    max_length=100,
+    top_k=1,
+    top_p=0.0,
+    temperature=1.0,
+)
+print(output)
+```
diff --git a/docs/mm/nemo_2/optimized/tensorrt-llm.md b/docs/mm/nemo_2/optimized/tensorrt-llm.md
@@ -7,13 +7,9 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 MM to Tensor
 
 The following table shows the supported models.
 
-| Model Name   | NeMo Precision | TensorRT Precision |
-| :----------  | -------------- |--------------------|
-| Neva         | bfloat16       | bfloat16           |
-| Video Neva   | bfloat16       | bfloat16           |
-| LITA/VITA    | bfloat16       | bfloat16           |
-| VILA         | bfloat16       | bfloat16           |
-| SALM         | bfloat16       | bfloat16           |
+| Model Name       | NeMo Precision | TensorRT Precision |
+| :--------------- | -------------- |--------------------|
+| Llama 3.2-Vision | bfloat16       | bfloat16           |
 
 
 ### Access the Models with a Hugging Face Token
@@ -34,7 +30,7 @@ If you want to run inference using the LLama3 model, you'll need to generate a H
 
 ### Export and Deploy a NeMo Multimodal Checkpoint to TensorRT-LLM
 
-This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Neva will be used as an example model. Please consult the table above for a complete list of supported models.
+This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Llama 3.2-Vision will be used as an example model. Please consult the table above for a complete list of supported models.
 
 
 1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](../gen_nemo2_ckpt.md) to generate a NeMo 2.0 Llama Vision Instruct checkpoint.
@@ -71,8 +67,6 @@ This section provides an example of how to quickly and easily deploy a NeMo chec
    ```shell
    python /opt/Export-Deploy/scripts/deploy/multimodal/query.py -mn mllama -mt=mllama -int="What is in this image?" -im=/path/to/image.jpg
    ```
-
-6. To export and deploy a different model, such as Video Neva, change the *model_type* and *modality* in the *scripts/deploy/multimodal/deploy_triton.py* script.
 
 
 ### Use a Script to Run Inference on a Triton Server
@@ -89,7 +83,7 @@ After executing the script, it will export the model to TensorRT and then initia
 2. To begin serving the model, run the following script:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva
+   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_mllama.nemo --model_type mllama --triton_model_name mllama
    ```
 
    The following parameters are defined in the ``deploy_triton.py`` script:
@@ -118,14 +112,9 @@ After executing the script, it will export the model to TensorRT and then initia
 
 3. To export and deploy a different model, such as Video Neva, change the *model_type* and *modality* in the *scripts/deploy/multimodal/deploy_triton.py* script. Please see the table below to learn more about which *model_type* and *modality* is used for a multimodal model.
 
-   | Model Name  | model_type   | modality   |
-   | :---------- | ------------ |------------|
-   | Neva        |  neva        | vision     |  
-   | Video Neva  |  video-neva  | vision     |
-   | LITA        |  lita        | vision     |
-   | VILA        |  vila        | vision     |
-   | VITA        |  vita        | vision     |
-   | SALM        |  salm        | audio      |
+   | Model Name        | model_type   | modality   |
+   | :---------------- | ------------ |------------|
+   | Llama 3.2-Vision  |  mllama      | vision     |
 
 
 4. Stop the running container and then run the following command to specify an empty directory:
@@ -135,15 +124,15 @@ After executing the script, it will export the model to TensorRT and then initia
 
    docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr
 
-   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision
+   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_mllama.nemo --model_type mllama --triton_model_name mllama --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision
    ```
 
    The checkpoint will be exported to the specified folder after executing the script mentioned above.
 
 5. To load the exported model directly, run the following script within the container:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type neva --llm_model_type llama --modality vision
+   python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --triton_model_name mllama --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type mllama --modality vision
    ```
 
 #### Send a Query
@@ -160,7 +149,7 @@ The following example shows how to execute the query script within the currently
 1. To use a query script, run the following command. For VILA/LITA/VITA models, the input_text should add ``<image>\n`` before the actual text, such as ``<image>\n What is in this image?``:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name neva --model_type neva --input_text "What is in this image?" --input_media /path/to/image.jpg
+   python /opt/Export-Deploy/scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name mllama --model_type mllama --input_text "What is in this image?" --input_media /path/to/image.jpg
    ```
 
 2. Change the url and the ``model_name`` based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well. ``input_media`` is the path to the image or audio file you want to use as input. 
@@ -173,15 +162,15 @@ Up until now, we have used scripts for exporting and deploying Multimodal models
 
 #### Export a Multimodal Model to TensorRT
 
-You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the ``nemo_neva.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. Additionally, the ``/opt/data/image.jpg`` is also assumed to exist.
+You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the ``nemo_mllama.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. Additionally, the ``/opt/data/image.jpg`` is also assumed to exist.
 
 1. Run the following command:
 
    ```python
    from nemo_export.tensorrt_mm_exporter import TensorRTMMExporter
 
    exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision")
-   exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1)
+   exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_mllama.nemo", model_type="mllama" tensor_parallel_size=1)
    output = exporter.forward("What is in this image?", "/opt/data/image.jpg", max_output_token=30, top_k=1, top_p=0.0, temperature=1.0)
    print("output: ", output)
    ```
@@ -191,7 +180,7 @@ You can use the APIs in the export module to export a NeMo checkpoint to TensorR
 
 #### Deploy a Multimodal Model to TensorRT
 
-You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``nemo_neva.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path.
+You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``nemo_mllama.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path.
 
 1. Run the following command:
 
@@ -200,9 +189,9 @@ You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Trit
    from nemo_deploy import DeployPyTriton
 
    exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision")
-   exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1)
+   exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_mllama.nemo", model_type="mllama", tensor_parallel_size=1)
 
-   nm = DeployPyTriton(model=exporter, triton_model_name="neva", port=8000)
+   nm = DeployPyTriton(model=exporter, triton_model_name="mllama", port=8000)
    nm.deploy()
    nm.serve()
    ```
@@ -216,7 +205,7 @@ The NeMo Framework provides NemoQueryMultimodal APIs to send a query to the Trit
    ```python
    from nemo_deploy.multimodal import NemoQueryMultimodal
 
-   nq = NemoQueryMultimodal(url="localhost:8000", model_name="neva", model_type="neva")
+   nq = NemoQueryMultimodal(url="localhost:8000", model_name="mllama", model_type="mllama")
    output = nq.query(input_text="What is in this image?", input_media="/opt/data/image.jpg", max_output_len=30, top_k=1, top_p=0.0, temperature=1.0)
    print(output)
    ```