Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 0 additions & 26 deletions examples/compress/Dockerfile

This file was deleted.

48 changes: 37 additions & 11 deletions examples/compress/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,33 @@
# Compress Algorithm Tutorial

This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
This tutorial demonstrates how to compress large language models using the Compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture.
The supported modifications are:

In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model by searching for the optimal `ffn_intermediate_size` across MLP layers and `attention op/noop`. This results in a heterogeneous architecture while reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
- `ffn_intermediate_size`: different FFN intermediate sizes
- `attention op/noop`: complete removal of attention layers

To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements.

In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.

## Environment

- [Dockerfile](./Dockerfile) to use.
- 2x NVIDIA H100 80GB HBM3 (1 card will be good as well).
- Install TensorRT-Model-Optimizer in editable mode with the corresponding dependencies:
```bash
pip install -e .[dev,compress]
Comment thread
kevalmorabia97 marked this conversation as resolved.
Outdated
```
- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.

## Compress the Model

1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.

Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB.
**_NOTE:_**
How to choose `intermediate_size_list`?
The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a 256) to avoid issues with tensor operations in subsequent steps.

Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.

2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).

Expand All @@ -23,7 +37,7 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg
python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
```

3. Run the compression script.
3. Run the compression script.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
Expand All @@ -42,7 +56,7 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg
[2025-11-02 12:52:34] Compress Progress 8/8: compression pipeline completed (multi-gpu)
```

This will generate the following network architecture (see `log.txt`):
Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review:

```bash
...
Expand Down Expand Up @@ -96,12 +110,12 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg

30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction.

## Re-run MIP Search with different memory constraints
## Re-run MIP Search with different constraints

If you want to try different memory constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag.
If you want to try different constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag.
This assumes pruning, replacement library building, NAS scoring, and subblock stats calculation have already been completed.

Set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.
For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress"
Expand Down Expand Up @@ -151,7 +165,7 @@ validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True)
Average losses = {'lm_loss': 1.2425934937782586, 'token_accuracy_top_1': 0.703862190246582, 'token_accuracy_top_5': 0.8954982757568359, 'token_accuracy_top_10': 0.9336576461791992
```

On the other hand, if you set `target_memory: 28_000`, you would observe that for some layers the intermediate FFN size starts to reduce (see `log.txt`):
On the other hand, if you set `target_memory: 28_000`, you'll observe that the intermediate FFN sizes are significantly reduced in certain layers (see `log.txt` for details):

```bash
block_5: attention no_op ffn intermediate_11520
Expand All @@ -166,6 +180,18 @@ block_13: attention no_op ffn intermediate_11520
block_14: attention no_op ffn intermediate_3072
```

## Evaluation

Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on [Massive Multitask Language Understanding](https://huggingface.co/datasets/cais/mmlu) benchmark.

```bash
lm_eval --model hf \
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
```

## Advanced usage

Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios.
2 changes: 2 additions & 0 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar

</div>

For more advanced pruning strategies, such as the [Puzzle methodology](https://arxiv.org/pdf/2411.19146), please see [Puzzle pruning example](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/feature/compress/examples/compress).

## Pre-Requisites

For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.07`) which has all the dependencies installed.
Expand Down
4 changes: 4 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,10 @@
"fire",
"hydra-core==1.3.2",
"omegaconf==2.3.0",
"wandb~=0.17.5",
"frozendict>=2.4.4",
"lru-dict",
"mip",

@kevalmorabia97 kevalmorabia97 Nov 12, 2025

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mip installation fails in the pytorch docker container: https://github.com/NVIDIA/TensorRT-Model-Optimizer/actions/runs/19281849124/job/55134742589?pr=539#step:7:591

Does it need any additional linux dependencies?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked. We need to apt install libffi-dev before we pip install mip

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove frozendict and mip from this list and we will deal with them when we move other components

],
}

Expand Down
Loading