Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion docs/mm/nemo_2/gen_nemo2_ckpt.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

To run the code examples, you will need a NeMo 2.0 checkpoint. Follow the steps below to generate a NeMo 2.0 checkpoint, which you can then use to test the export and deployment workflows for NeMo 2.0 models.

## Setup

1. Pull down and run [NeMo Framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use:

```shell
Expand All @@ -15,7 +17,34 @@ To run the code examples, you will need a NeMo 2.0 checkpoint. Follow the steps
```shell
huggingface-cli login
```


## Generate Qwen VL Checkpoint (for In-Framework Deployment)

This checkpoint is used for in-framework deployment examples.

3. Run the following Python code to generate the NeMo 2.0 checkpoint:

```python
from nemo.collections.llm import import_ckpt
from nemo.collections import vlm
from pathlib import Path

if __name__ == '__main__':
# Specify the Hugging Face model ID
hf_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

# Import the model and convert to NeMo 2.0 format
import_ckpt(
model=vlm.Qwen2VLModel(vlm.Qwen25VLConfig3B(), model_version='qwen25-vl')
source=f"hf://{hf_model_id}", # Hugging Face model source
output_path=Path('/opt/checkpoints/qwen25_vl_3b')
)
```

## Generate Llama 3.2-Vision Checkpoint (for TensorRT-LLM Deployment)

This checkpoint is used for optimized TensorRT-LLM deployment examples.

3. Run the following Python code to generate the NeMo 2.0 checkpoint:

```python
Expand Down
136 changes: 133 additions & 3 deletions docs/mm/nemo_2/in-framework.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,135 @@
# Deploy NeMo 2.0 Multimodal Models
# Deploy NeMo 2.0 Multimodal Models with Triton Inference Server

## Optimized Inference for Multimodal Models using TensorRT
This section explains how to deploy [NeMo 2.0](https://github.com/NVIDIA-NeMo/NeMo) multimodal models with the NVIDIA Triton Inference Server.

Will be updated soon.
## Quick Example

1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](gen_nemo2_ckpt.md) to generate a NeMo 2.0 multimodal checkpoint.

2. In a terminal, go to the folder where the ``qwen25_vl_3b`` is located. Pull and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use:

```shell
docker pull nvcr.io/nvidia/nemo:vr

docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \
-v ${PWD}/:/opt/checkpoints/ \
-w /opt/Export-Deploy \
--name nemo-fw \
nvcr.io/nvidia/nemo:vr
```

3. Using a NeMo 2.0 multimodal model, run the following deployment script to verify that everything is working correctly. The script directly serves the NeMo 2.0 model on the Triton server:

```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen25_vl_3b --triton_model_name qwen
```

4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%).

5. In a separate terminal, access the running container as follows:

```shell
docker exec -it nemo-fw bash
```

6. To send a query to the Triton server, run the following script with an image:

```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/query_inframework.py \
--model_name qwen \
--prompt "Describe this image" \
--image /path/to/image.jpg \
--max_output_len 100
```

## Use a Script to Deploy NeMo 2.0 Multimodal Models on a Triton Inference Server

You can deploy a multimodal model from a NeMo checkpoint on Triton using the provided script.

### Deploy a NeMo Multimodal Model

Executing the script will directly deploy the NeMo 2.0 multimodal model and start the service on Triton.

1. Start the container using the steps described in the **Quick Example** section.

2. To begin serving the downloaded model, run the following script:

```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen25_vl_3b --triton_model_name qwen
```

The following parameters are defined in the ``deploy_inframework_triton.py`` script:

- ``-nc``, ``--nemo_checkpoint``: Path to the NeMo 2.0 checkpoint file to deploy. (Required)
- ``-tmn``, ``--triton_model_name``: Name to register the model under in Triton. (Required)
- ``-tmv``, ``--triton_model_version``: Version number for the model in Triton. Default: 1
- ``-sp``, ``--server_port``: Port for the REST server to listen for requests. Default: 8080
- ``-sa``, ``--server_address``: HTTP address for the REST server. Default: 0.0.0.0
- ``-trp``, ``--triton_port``: Port for the Triton server to listen for requests. Default: 8000
- ``-tha``, ``--triton_http_address``: HTTP address for the Triton server. Default: 0.0.0.0
- ``-tps``, ``--tensor_parallel_size``: Tensor parallelism size. Default: 1
- ``-pps``, ``--pipeline_parallel_size``: Pipeline parallelism size. Default: 1
- ``-mbs``, ``--max_batch_size``: Max batch size of the model. Default: 4
- ``-dm``, ``--debug_mode``: Enable debug mode. (Flag; set to enable)
- ``-pd``, ``--params_dtype``: Data type for model parameters. Choices: float16, bfloat16, float32. Default: bfloat16
- ``-ibts``, ``--inference_batch_times_seqlen_threshold``: Inference batch times sequence length threshold. Default: 1000

*Note: Some parameters may be ignored or have no effect depending on the model and deployment environment. Refer to the script's help message for the most up-to-date list.*

3. To deploy a different model, just change the ``--nemo_checkpoint`` argument in the script.


## How To Send a Query

You can send queries to the Triton Inference Server using either the provided script or the available APIs.

### Send a Query using the Script
This script allows you to interact with the multimodal model via HTTP requests, sending prompts and images and receiving generated responses directly from the Triton server.

The example below demonstrates how to use the query script to send a prompt and image to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below.


```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/query_inframework.py \
--model_name qwen \
--processor_name Qwen/Qwen2.5-VL-3B-Instruct \
--prompt "What is in this image?" \
--image /path/to/image.jpg \
--max_output_len 100
```

**All Parameters:**
- `-u`, `--url`: URL for the Triton server (default: 0.0.0.0)
- `-mn`, `--model_name`: Name of the Triton model (required)
- `-pn`, `--processor_name`: Processor name for qwen-vl models (default: Qwen/Qwen2.5-VL-7B-Instruct)
- `-p`, `--prompt`: Prompt text (mutually exclusive with --prompt_file; required if --prompt_file not given)
- `-pf`, `--prompt_file`: File to read the prompt from (mutually exclusive with --prompt; required if --prompt not given)
- `-i`, `--image`: Path or URL to input image file (required)
- `-mol`, `--max_output_len`: Max output token length (default: 50)
- `-mbs`, `--max_batch_size`: Max batch size for inference (default: 4)
- `-tk`, `--top_k`: Top-k sampling (default: 1)
- `-tpp`, `--top_p`: Top-p sampling (default: 0.0)
- `-t`, `--temperature`: Sampling temperature (default: 1.0)
- `-rs`, `--random_seed`: Random seed for generation (optional)
- `-it`, `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0)


### Send a Query using the NeMo APIs

Please see the below if you would like to use APIs to send a query.

```python
from nemo_deploy.multimodal import NemoQueryMultimodalPytorch
from PIL import Image

nq = NemoQueryMultimodalPytorch(url="localhost:8000", model_name="qwen")
output = nq.query_multimodal(
prompts=["What is in this image?"],
images=[Image.open("/path/to/image.jpg")],
max_length=100,
top_k=1,
top_p=0.0,
temperature=1.0,
)
print(output)
```
45 changes: 17 additions & 28 deletions docs/mm/nemo_2/optimized/tensorrt-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,9 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 MM to Tensor

The following table shows the supported models.

| Model Name | NeMo Precision | TensorRT Precision |
| :---------- | -------------- |--------------------|
| Neva | bfloat16 | bfloat16 |
| Video Neva | bfloat16 | bfloat16 |
| LITA/VITA | bfloat16 | bfloat16 |
| VILA | bfloat16 | bfloat16 |
| SALM | bfloat16 | bfloat16 |
| Model Name | NeMo Precision | TensorRT Precision |
| :--------------- | -------------- |--------------------|
| Llama 3.2-Vision | bfloat16 | bfloat16 |


### Access the Models with a Hugging Face Token
Expand All @@ -34,7 +30,7 @@ If you want to run inference using the LLama3 model, you'll need to generate a H

### Export and Deploy a NeMo Multimodal Checkpoint to TensorRT-LLM

This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Neva will be used as an example model. Please consult the table above for a complete list of supported models.
This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Llama 3.2-Vision will be used as an example model. Please consult the table above for a complete list of supported models.


1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](../gen_nemo2_ckpt.md) to generate a NeMo 2.0 Llama Vision Instruct checkpoint.
Expand Down Expand Up @@ -71,8 +67,6 @@ This section provides an example of how to quickly and easily deploy a NeMo chec
```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/query.py -mn mllama -mt=mllama -int="What is in this image?" -im=/path/to/image.jpg
```

6. To export and deploy a different model, such as Video Neva, change the *model_type* and *modality* in the *scripts/deploy/multimodal/deploy_triton.py* script.


### Use a Script to Run Inference on a Triton Server
Expand All @@ -89,7 +83,7 @@ After executing the script, it will export the model to TensorRT and then initia
2. To begin serving the model, run the following script:

```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_mllama.nemo --model_type mllama --triton_model_name mllama
```

The following parameters are defined in the ``deploy_triton.py`` script:
Expand Down Expand Up @@ -118,14 +112,9 @@ After executing the script, it will export the model to TensorRT and then initia

3. To export and deploy a different model, such as Video Neva, change the *model_type* and *modality* in the *scripts/deploy/multimodal/deploy_triton.py* script. Please see the table below to learn more about which *model_type* and *modality* is used for a multimodal model.

| Model Name | model_type | modality |
| :---------- | ------------ |------------|
| Neva | neva | vision |
| Video Neva | video-neva | vision |
| LITA | lita | vision |
| VILA | vila | vision |
| VITA | vita | vision |
| SALM | salm | audio |
| Model Name | model_type | modality |
| :---------------- | ------------ |------------|
| Llama 3.2-Vision | mllama | vision |


4. Stop the running container and then run the following command to specify an empty directory:
Expand All @@ -135,15 +124,15 @@ After executing the script, it will export the model to TensorRT and then initia

docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr

python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_mllama.nemo --model_type mllama --triton_model_name mllama --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision
```

The checkpoint will be exported to the specified folder after executing the script mentioned above.

5. To load the exported model directly, run the following script within the container:

```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type neva --llm_model_type llama --modality vision
python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --triton_model_name mllama --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type mllama --modality vision
```

#### Send a Query
Expand All @@ -160,7 +149,7 @@ The following example shows how to execute the query script within the currently
1. To use a query script, run the following command. For VILA/LITA/VITA models, the input_text should add ``<image>\n`` before the actual text, such as ``<image>\n What is in this image?``:

```shell
python /opt/Export-Deploy/scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name neva --model_type neva --input_text "What is in this image?" --input_media /path/to/image.jpg
python /opt/Export-Deploy/scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name mllama --model_type mllama --input_text "What is in this image?" --input_media /path/to/image.jpg
```

2. Change the url and the ``model_name`` based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well. ``input_media`` is the path to the image or audio file you want to use as input.
Expand All @@ -173,15 +162,15 @@ Up until now, we have used scripts for exporting and deploying Multimodal models

#### Export a Multimodal Model to TensorRT

You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the ``nemo_neva.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. Additionally, the ``/opt/data/image.jpg`` is also assumed to exist.
You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the ``nemo_mllama.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. Additionally, the ``/opt/data/image.jpg`` is also assumed to exist.

1. Run the following command:

```python
from nemo_export.tensorrt_mm_exporter import TensorRTMMExporter

exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision")
exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1)
exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_mllama.nemo", model_type="mllama" tensor_parallel_size=1)
output = exporter.forward("What is in this image?", "/opt/data/image.jpg", max_output_token=30, top_k=1, top_p=0.0, temperature=1.0)
print("output: ", output)
```
Expand All @@ -191,7 +180,7 @@ You can use the APIs in the export module to export a NeMo checkpoint to TensorR

#### Deploy a Multimodal Model to TensorRT

You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``nemo_neva.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path.
You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``nemo_mllama.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path.

1. Run the following command:

Expand All @@ -200,9 +189,9 @@ You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Trit
from nemo_deploy import DeployPyTriton

exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision")
exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1)
exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_mllama.nemo", model_type="mllama", tensor_parallel_size=1)

nm = DeployPyTriton(model=exporter, triton_model_name="neva", port=8000)
nm = DeployPyTriton(model=exporter, triton_model_name="mllama", port=8000)
nm.deploy()
nm.serve()
```
Expand All @@ -216,7 +205,7 @@ The NeMo Framework provides NemoQueryMultimodal APIs to send a query to the Trit
```python
from nemo_deploy.multimodal import NemoQueryMultimodal

nq = NemoQueryMultimodal(url="localhost:8000", model_name="neva", model_type="neva")
nq = NemoQueryMultimodal(url="localhost:8000", model_name="mllama", model_type="mllama")
output = nq.query(input_text="What is in this image?", input_media="/opt/data/image.jpg", max_output_len=30, top_k=1, top_p=0.0, temperature=1.0)
print(output)
```
Expand Down