Skip to content
Merged
Show file tree
Hide file tree
Changes from 115 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
e82164f
Add anymodel directories to feature/puzzletron
danielkorzekwa Mar 4, 2026
2099df3
Make any_model conversion working.
danielkorzekwa Mar 5, 2026
eb5cf8a
Update child_init.py with anymodel version
danielkorzekwa Mar 5, 2026
c9de41c
fix attention pruning
danielkorzekwa Mar 5, 2026
3c1bc1f
Add trust_remote_code to load_model_config (default to false)
danielkorzekwa Mar 5, 2026
8357136
Make activation scoring working
danielkorzekwa Mar 5, 2026
6cc2194
Comment all tested models aside of llama_3_1_8b_instruct
danielkorzekwa Mar 5, 2026
ee4e1e3
Delete not needed decilm test
danielkorzekwa Mar 5, 2026
449b523
Fix broken tests
danielkorzekwa Mar 5, 2026
fb27bba
Update puzzletron_nas_pluging to any_model version
danielkorzekwa Mar 5, 2026
b350f82
Correct test resources used by tests.
danielkorzekwa Mar 5, 2026
fafe5a3
Disable puzzletron tests (will be enabled after all any_model logic i…
danielkorzekwa Mar 5, 2026
e988248
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
c717852
Comment out not implemented models.
danielkorzekwa Mar 6, 2026
030f126
format python docs
danielkorzekwa Mar 6, 2026
8dcdfbf
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
70df0df
Use trust_remote_code in force_cache_dynamic_modules()
danielkorzekwa Mar 6, 2026
bb56662
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
ecd953e
Fix anymodel pruning
danielkorzekwa Mar 6, 2026
ee8f538
Fix buid docs issue.
danielkorzekwa Mar 6, 2026
c9b76a1
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
6e3af61
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 6, 2026
0ad6d92
Merging build_library_and_stats
danielkorzekwa Mar 6, 2026
995eb1a
Merging anymodel: calc_one_block_scores
danielkorzekwa Mar 6, 2026
34081c9
Mering any_model: calc_one_block_scores
danielkorzekwa Mar 6, 2026
ed5c00f
merge any_model: mip_and_realize_models
danielkorzekwa Mar 6, 2026
993b5ec
Add all anymodel models but gptoss
danielkorzekwa Mar 6, 2026
6e9f03b
Make nemotron-nano-12b-v2 to work (set trust_remote_code=true)
danielkorzekwa Mar 9, 2026
e8b7a7d
merge anymodel for nemotron-3-nano-30b-a3b-base-bf16
danielkorzekwa Mar 9, 2026
47414d5
Clarify readme and avoid reusing the same reference in llama_converter.
danielkorzekwa Mar 9, 2026
a8305d8
Fix tied-embedding handling before writing the safetensors index.
danielkorzekwa Mar 9, 2026
68421a5
Fix NaN ranking currently selects NaNs as “best” experts by default.
danielkorzekwa Mar 9, 2026
d6b8028
Code clean up.
danielkorzekwa Mar 9, 2026
ecd2341
Code clean up.
danielkorzekwa Mar 10, 2026
f9d845d
code clean up
danielkorzekwa Mar 10, 2026
d171b01
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 10, 2026
722da90
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
934ab2f
code clean up
danielkorzekwa Mar 10, 2026
0f14ec3
Merge branch 'dkorzekwa/anymodel_pruning' into dkorzekwa/anymodel_bui…
danielkorzekwa Mar 10, 2026
dcb9e02
remove not needed comment
danielkorzekwa Mar 10, 2026
0c9ea5d
Merge branch 'dkorzekwa/anymodel_build_library_and_stats' into dkorze…
danielkorzekwa Mar 10, 2026
5b310e2
Merge branch 'dkorzekwa/any_model_calc_one_block_scores' into dkorzek…
danielkorzekwa Mar 10, 2026
4f82b1c
Merge branch 'dkorzekwa/mip_and_realize_models' into dkorzekwa/any_mo…
danielkorzekwa Mar 10, 2026
176a435
Fix a broken test_puzzletron test on 2 gpus.
danielkorzekwa Mar 10, 2026
02e2c9b
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
92c4419
Merge branch 'dkorzekwa/anymodel_pruning' into dkorzekwa/anymodel_bui…
danielkorzekwa Mar 10, 2026
aa1eb3e
Merge branch 'dkorzekwa/anymodel_build_library_and_stats' into dkorze…
danielkorzekwa Mar 10, 2026
2b84a96
Merge branch 'dkorzekwa/any_model_calc_one_block_scores' into dkorzek…
danielkorzekwa Mar 10, 2026
fb838c0
Merge branch 'dkorzekwa/mip_and_realize_models' into dkorzekwa/any_mo…
danielkorzekwa Mar 10, 2026
13378ff
Add gpt-oss model
danielkorzekwa Mar 11, 2026
47ca0e3
Add comments about a broken test
danielkorzekwa Mar 11, 2026
96112f7
Fix a broken gptoss test
danielkorzekwa Mar 12, 2026
cb6b182
Add mamba to puzzletron dependencies.
danielkorzekwa Mar 12, 2026
670bb34
Update mamba-ssm and casual-conv1d dependences (remove pinpoint versi…
danielkorzekwa Mar 13, 2026
0e1b591
Install mamba-ssm and causal-conv1d in testenv:cuda13-gpu-puzzletron
danielkorzekwa Mar 13, 2026
ca845ec
Fix installing dependencies in testenv:cuda13-gpu-puzzletron
danielkorzekwa Mar 13, 2026
be825bc
Fix anymodel for qwen3 8B in 2 gpus
danielkorzekwa Mar 13, 2026
7fd1afa
Fix pipeline parallelism issue for wen3-vl-30b-a3b-instruct-qwen3_vl-…
danielkorzekwa Mar 13, 2026
7d7b609
Fix multi-gpu issue for nemotron-nano-12b-v2
danielkorzekwa Mar 13, 2026
249af9d
Fix no_op in any_model
danielkorzekwa Mar 13, 2026
b80583c
Merge branch 'feature/puzzletron' into dkorzekwa/any_model_other_models
danielkorzekwa Mar 13, 2026
88b1b13
Merge any_model tutorial
danielkorzekwa Mar 13, 2026
c0da9c0
Merge mbridge distillation for any_model
danielkorzekwa Mar 13, 2026
1dd742e
Fix nemotron_h_model_descriptor.
danielkorzekwa Mar 14, 2026
4a6ebbe
Fix tox -e build-docs
danielkorzekwa Mar 14, 2026
585f0ed
pin mamba/casual-conv1d versions to fix failing assertion for test_pu…
danielkorzekwa Mar 14, 2026
7fb5d9a
Fix for installing mamba-ssm
danielkorzekwa Mar 14, 2026
75d3d69
Fix broken test for nemotron-3-nano-30b-a3b-base-bf16
danielkorzekwa Mar 14, 2026
0e5722d
code clean up
danielkorzekwa Mar 14, 2026
2dd9735
Make test_puzzletron test deterministic
danielkorzekwa Mar 15, 2026
3561de5
Comment out all models but nemotron-3-nano-30b-a3b-base-bf16 to check…
danielkorzekwa Mar 15, 2026
27866de
Implement Qwen3VLRemoveExpertsIndependentHook
danielkorzekwa Mar 15, 2026
f5fbbcf
MR branch for the remaining difference between dkorzekwa/any_model an…
danielkorzekwa Mar 16, 2026
a012fe6
Remove not needed nvidia licence header
danielkorzekwa Mar 16, 2026
52922a4
# Initialize weights to ensure all parameters are properly initialized
danielkorzekwa Mar 16, 2026
c234fb4
Fix non-deterministic test_puzzletron test
danielkorzekwa Mar 16, 2026
53dcd10
Fix for unsetting CUDA_VISIBLE_DEVICES
danielkorzekwa Mar 16, 2026
69d9648
increase numeric tolerance for test_puzzletron.py
danielkorzekwa Mar 17, 2026
4a692dc
Disable lm_loss assertion for nemotron-3-nano-30b-a3b-base-bf16 (not …
danielkorzekwa Mar 17, 2026
e795f0c
Removing incorrect licence file. gpt_oss_pruned_to_mxfp4.py was not a…
danielkorzekwa Mar 17, 2026
631306c
Fix hardcoded trust_remote_code
danielkorzekwa Mar 17, 2026
dc77be2
Merge branch 'dkorzekwa/any_model_other_models' into dkorzekwa/anymod…
danielkorzekwa Mar 17, 2026
b76e0ef
Merge branch 'dkorzekwa/anymodel_gptoss' into dkorzekwa/anymodel_tuto…
danielkorzekwa Mar 17, 2026
109b185
Merge branch 'dkorzekwa/anymodel_tutorial' into dkorzekwa/anymodel_mb…
danielkorzekwa Mar 17, 2026
b0972e4
Merge branch 'dkorzekwa/anymodel_mbridgedist' into dkorzekwa/remainin…
danielkorzekwa Mar 17, 2026
5cadc65
Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_gptoss
danielkorzekwa Mar 17, 2026
151081c
Delete not needed yaml files for test_puzzletron.
danielkorzekwa Mar 17, 2026
36daa6d
Delete not needed mypy exclusion for removed hf_configs files.
danielkorzekwa Mar 17, 2026
960b8ce
Merge branch 'dkorzekwa/anymodel_gptoss' into dkorzekwa/anymodel_tuto…
danielkorzekwa Mar 17, 2026
854d96b
Merge branch 'dkorzekwa/anymodel_tutorial' into dkorzekwa/anymodel_mb…
danielkorzekwa Mar 17, 2026
cf06997
Merge branch 'dkorzekwa/anymodel_mbridgedist' into dkorzekwa/remainin…
danielkorzekwa Mar 17, 2026
b47f846
Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_tutorial
danielkorzekwa Mar 17, 2026
13f5edc
Merge branch 'dkorzekwa/anymodel_tutorial' into dkorzekwa/anymodel_mb…
danielkorzekwa Mar 17, 2026
b4c71cc
Merge branch 'dkorzekwa/anymodel_mbridgedist' into dkorzekwa/remainin…
danielkorzekwa Mar 17, 2026
67444f4
Delete not used decilm dummy blocks and create_dummy_model()
danielkorzekwa Mar 17, 2026
944f6f9
Delete not used decilm converters
danielkorzekwa Mar 17, 2026
7ee045a
Delete not used decilm code
danielkorzekwa Mar 17, 2026
cd1bf88
removing decilm not used code.
danielkorzekwa Mar 18, 2026
e2fa0b3
Remove dead decilm code
danielkorzekwa Mar 18, 2026
fb48618
Delete megatron_lm_tokenizer
danielkorzekwa Mar 18, 2026
5297a1c
Delete nemo export/import for decilm version of puzzletron
danielkorzekwa Mar 18, 2026
cbba0b0
Delete dead code.
danielkorzekwa Mar 18, 2026
e0fb3c1
Delete DeciLMForCausalLM
danielkorzekwa Mar 19, 2026
dbaab53
Remove unused save_checkpoint_as_symlinks()
danielkorzekwa Mar 19, 2026
9c943fd
code clean up
danielkorzekwa Mar 19, 2026
098d7c1
remove megatron_tokenizer
danielkorzekwa Mar 19, 2026
5d0efa1
Delete copy_deci_lm_hf_code
danielkorzekwa Mar 19, 2026
ead68bb
Delete DeciLMPreTrainModel and DeciLMModel
danielkorzekwa Mar 19, 2026
2d91afc
Delete not used code from replacement_library.py
danielkorzekwa Mar 19, 2026
492cbaf
Delete not used decilm code
danielkorzekwa Mar 19, 2026
1834c76
Delete not used decilm code
danielkorzekwa Mar 19, 2026
f096d11
remove dead replacement_library code
danielkorzekwa Mar 19, 2026
dc52a81
Delete not used transformers code
danielkorzekwa Mar 19, 2026
b9178a3
Delete unused decilm code
danielkorzekwa Mar 19, 2026
9c496bb
Import clean up.
danielkorzekwa Mar 19, 2026
d6ccd8f
Merge branch 'feature/puzzletron' into dkorzekwa/decilm_hf_code_clean…
danielkorzekwa Mar 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ repos:
examples/speculative_decoding/main.py|
examples/speculative_decoding/medusa_utils.py|
examples/speculative_decoding/server_generate.py|
examples/puzzletron/evaluation/lm_eval_anymodel.py|
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
experimental/dms/models/qwen3/configuration_qwen3_dms.py|
experimental/dms/models/qwen3/modeling_qwen3_dms.py|
)$
Expand Down
14 changes: 14 additions & 0 deletions examples/puzzletron/GPTOSS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@

## GptOss

With this release Puzzle algorithm supports only experts removal for `Gpt-Oss`.

This model comes as a quantized checkpoint i.e. MoE experts matrices are quantized with _MXFP4_ format.
In the prunning steps puzzle utilizes decompressed model (back to BF16) for statistics and scores computation.
This means, during the conversion to puzzle format we decompress the model and store it as a BF16.
Once the pruning is done i.e. experts to be removed are identified and the process is finished, user may want to get back the _MXFP4_ format of the checkpoint.
To do so, there is an additional script, that takes the original and the pruned checkpoint and outputs pruned checkpoint in _MXFP4_ format.

```bash
python -m modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_pruned_to_mxfp4 --student-path /workspaces/any_model_gpt_oss/mip/puzzle_solutions/stats_num_params_18014757184/solutions--checkpoints/solution_0/ --original-path /workspaces/source_model_checkpoints/openai_gpt-oss-20b/ --output-path /workspaces/any_model_gpt_oss/mip/puzzle_solutions/stats_num_params_18014757184/solutions--checkpoints/mxfp4-ckpt/ --num-layers 24
```
77 changes: 53 additions & 24 deletions examples/puzzletron/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,22 @@ The supported modifications are:

To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements.

In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric. Other supported models should be compressed in a similar way. For GptOss there is one [additional step to be performed](GPTOSS.md).

> **Note:** Other models are also supported. See the [configs](./configs/) directory for additional model configurations (e.g., Llama-3.2-3B-Instruct on 1x H100, Qwen2.5-7B-Instruct on 1x H100, Qwen3-8B on 1x H100, Nemotron-Nano-12B-v2 on 1x H100, Mistral-Small-24B-Instruct-2501 on 4x H100). For information on adding support for new models, see the [AnyModel Guide](../../modelopt/torch/puzzletron/anymodel/README.md).

## Environment

- Install Model-Optimizer in editable mode with the corresponding dependencies:
- Install Model-Optimizer in editable mode with the corresponding dependencies (run from the repo root):

```bash
pip install -e .[hf,puzzletron]
pip install -r requirements.txt
pip install -r examples/puzzletron/requirements.txt
```

> **Note:** NeMo containers may ship `nvidia-lm-eval` which may conflict with `lm-eval` that is used for evaluation.
> If so, run `pip uninstall nvidia-lm-eval -y` before installing requirements.

- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.

- To make use of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), you need to accept the terms and conditions for the corresponding model and the dataset in the Huggingface Hub. Log in to the Huggingface Hub and enter your HF token.
Expand Down Expand Up @@ -133,7 +138,7 @@ This assumes pruning, replacement library building, NAS scoring, and subblock st
For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.

```bash
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
```

This will generate the following network architecture (see `log.txt`):
Expand Down Expand Up @@ -195,18 +200,54 @@ block_13: attention no_op ffn intermediate_11520
block_14: attention no_op ffn intermediate_3072
```

### MIP Sweep Mode

The **MIP sweep mode** lets you explore multiple memory compression rates in a single run and compare the accuracy-memory trade-offs.

#### Quick Start

1. Enable sweep in your config YAML (e.g., `llama-3_1-8B_pruneffn_memory.yaml`):

```yaml
mip:
sweep:
enabled: true
memory_compression_rates: [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
output_csv: ${puzzle_dir}/mip_sweep_results.csv
```

2. Run the sweep:

```bash
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
```

3. View results: The CSV file contains compression rates, memory usage, and accuracy metrics for each configuration.

#### Example Results

<img src="mip_sweep_example.png" alt="MIP Sweep Results" width="600">

The plot shows how token accuracy changes with different compression rates. Higher compression (0.5 = 50% of original memory) reduces accuracy, while lower compression maintains accuracy closer to the teacher model.

## Evaluation

Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on [Massive Multitask Language Understanding](https://huggingface.co/datasets/cais/mmlu) benchmark.
Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly.

```bash
lm_eval --model hf \
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
python examples/puzzletron/evaluation/lm_eval_anymodel.py \
--model hf \
--model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
```

For a quick smoke test, add `--limit 10`.

> **Alternative:** For server-based evaluation via an OpenAI-compatible endpoint,
> see [evaluation/nemo_evaluator_instructions.md](./evaluation/nemo_evaluator_instructions.md).

## Inference Performance Benchmarking

Now let's evaluate how much speedup we get with the compressed model in terms of throughput and latency.
Expand Down Expand Up @@ -234,21 +275,9 @@ vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --

## Knowledge Distillation

To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one. For this, we will use [NeMo framework](https://github.com/NVIDIA-NeMo/NeMo) with the [nemo:25.07](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.07) container.

First, convert the HF model to NeMo format:
To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one.

```bash
python -m nemo_export/convert_hf_to_nemo --input-ckpt-path path/to/HF-model --output-ckpt-path path/to/save/model-nemo
```

Now you can utilize all the training features available in NeMo, including distillation. Please refer to the [NeMo distillation documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).

[Optional] Once distillation is complete, you can convert the distilled model back to the HuggingFace format.

```bash
python -m nemo_export/convert_nemo_to_hf --input-ckpt-path path/to/nemo-model --output-ckpt-path path/to/save/model-HF
```
See [mbridge_distillation/README.md](./mbridge_distillation/README.md) for instructions on using Megatron-Bridge for knowledge distillation.

## Advanced Usage

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
defaults:
- pruning: ffn_pruning
- scoring: ../validate_solutions_defaults
- realize_model: ../validate_solutions_defaults
- bypass:
- override hydra/hydra_logging: disabled
- _self_

puzzle_dir: ???
descriptor: gpt_oss
teacher_dir: ${puzzle_dir}/ckpts/teacher/
replacement_library_path: ${puzzle_dir}/replacement_library.json
dataset_path: ??? # path to Nemotron-Post-Training-Dataset-v2

skip_realize_model: false

build_replacement_library:
add_ffn_no_ops: true
add_attention_no_ops: true

calc_subblock_stats:
batch_sizes: [64, 96, 128]
prefill_seq_len: 4096
generation_seq_len: 4096
num_active_tokens_override: # Optional override for sequence lengths
prefill_queue_size: 0
allocate_prefill_query: false
benchmark_iterations: # Set to a number (e.g., 1000) to enable runtime benchmarking
merge_with_existing_stats: false
subblock_stats_filename: "subblock_stats.json"
moe_stats_filename: "moe_stats.json"
runtime_stats:
backend: trt_torch

scoring:
descriptor: ${descriptor}
solutions_to_validate:
skip_existing_solutions: true

replacement_library_path: ${replacement_library_path}
solutions_path: ${to_path:${puzzle_dir}/single_sequence_replacement_solutions.json}
teacher_dir: ${to_path:${teacher_dir}}
output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation

eval_samples: 128
micro_batch_size: 1
seed: 42
shuffle_seed: 444
dataset_path: ${dataset_path}

mip:
single_block_replacement_validation_dir: ${to_path:${scoring.output_dir}}
subblock_stats_path: ${to_path:${puzzle_dir}/${calc_subblock_stats.subblock_stats_filename}}
output_path: ${to_path:${puzzle_dir}/mip/puzzle_solutions}
gathered_metrics_path:
puzzle_profile:

# puzzle_profile:
objective: metrics.cosine_embedding_loss_hidden_states
bigger_is_better: false

subblock_stats_args:
- batch_size: 96
weights_dtype: torch.bfloat16
activations_dtype: torch.bfloat16
kv_cache_dtype: torch.bfloat16

report_additional_costs:
- stats.memory_mib
- stats.num_params
- stats.num_kv_heads
- stats.has_attention
- stats.has_ffn
- stats.kv_cache_memory_mib
- stats.attention_memory_mib
- stats.ffn_memory_mib
- stats.ffn_num_params
- stats.attention_num_params

human_constraints:
target_memory: 45_000
num_params: 3_000_000_000

mip_constraints:
metric_overrides:
max_seconds_per_solution: 60

realize_model:
descriptor: ${descriptor}
teacher_dir: ${to_path:${teacher_dir}}
tokenizer_name: ${to_path:${teacher_dir}}
replacement_library_path: ${replacement_library_path}
save_models: true
solutions_path: # Filled dynamically

# Validate params
skip_validation: false # To enable validation of the model solution set `skip_validation` as False
eval_samples: 128
micro_batch_size: 1
seed: 42
shuffle_seed: 444
dataset_path: ${dataset_path}

nccl_timeout_minutes: ${timedelta_minutes:10}

# This section redirects Hydra outputs
hydra:
run:
dir: ${puzzle_dir}/hydra_logs/${now:%Y-%m-%d}/${now:%H-%M-%S}

Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
defaults:
- gptoss-20b
- _self_

# Input Hugging Face model to compress
input_hf_model_path: /workspace/hf_models/openai/gpt-oss-20b

# Dataset path for pruning and NAS scoring
dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2

# Working directory for compression outputs
puzzle_dir: /workspace/puzzle_dir

# MIP memory constraint (in MiB)
mip:
human_constraints:
target_memory: 16_000 # 45 GiB
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
defaults:
- pruning_defaults

eval_samples: 2500 #10
activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/expert_removal/${pruning.experiment_id}

pruning_mixin:
_target_: modelopt.torch.puzzletron.pruning.expert_removal_pruning_mixin.ExpertRemovalPruningMixIn
layer_descriptor:
_target_: modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_model_descriptor.GptOssExpertRemovalLayerDescriptor
target_name: "mlp.router"

hook_class: ${get_object:modelopt.torch.nas.plugins.megatron_hooks.base_hooks.RankedChoiceVotingHook}
activation_hooks_kwargs: # Additional kwargs to pass to the hook init

num_experts_to_keep_list: [24, 16, 8] # num_experts in teacher is 128
mlp_init_mode: "ExpertRemoval"
mlp_init_config_yaml:
expert_scores_key: "expert_ranks"
layer_prefix_template: "model.layers.{layer_idx}.mlp.router"

Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
defaults:
- /validate_model_defaults

model_name_or_path: ${teacher_dir}
experiment_id: ${pruning.eval_samples}samples_diverse_mini
activations_log_dir: ???
activation_hooks_kwargs: ???

descriptor: ${descriptor}

# Data:
eval_samples: 10_000
micro_batch_size: 1
dataset_path: ${dataset_path}
val_dataset_name: train

# Prune ckpts
pruned_ckpts_output_dir: ${puzzle_dir}/pruning/${pruning.experiment_id}

## FFN pruning
ffn_list:
mlp_init_mode: "Truncate" # PruneByActivationsLog

## KV-heads pruning
n_heads_in_group_list:
gqa_init_mode: "AverageKV"

## Hidden dimension pruning
hidden_size_list:
hidden_size_init_mode: "PruneByChannelRanking"
linear_init_mode: "FromTeacher"

mlp_init_config_yaml:
activations_log_dir: ${pruning.activations_log_dir}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
model_dtype: torch.bfloat16 # dtype to cast the model for validate_model
autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
shuffle_seed: 81436
seed: 42
fim_rate: 0
fim_spm_rate: 0
source_datasets_to_discard:
varlen: false
write_results: false
calc_losses_on_cpu: false
activations_log_dir:
model_name_or_path:
load_dataset_fn: ${get_object:modelopt.torch.puzzletron.utils.data.dataloaders.load_from_disk_fn}

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- /validate_model_defaults
- _self_

solutions_to_validate:
skip_validation: false
save_models: false
bigger_is_better: false
sort_solutions_by:
calculate_full_score_ablations: false

Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ defaults:
- _self_

puzzle_dir: ???
descriptor: llama
teacher_dir: ${puzzle_dir}/ckpts/teacher/
replacement_library_path: ${puzzle_dir}/replacement_library.json
dataset_path: ??? # ppath to Nemotron-Post-Training-Dataset-v2
Expand All @@ -32,6 +33,7 @@ calc_subblock_stats:
backend: trt_torch

scoring:
descriptor: ${descriptor}
solutions_to_validate:
skip_existing_solutions: true

Expand Down Expand Up @@ -84,6 +86,7 @@ mip:
max_seconds_per_solution: 60

realize_model:
descriptor: ${descriptor}
teacher_dir: ${to_path:${teacher_dir}}
tokenizer_name: ${to_path:${teacher_dir}}
replacement_library_path: ${replacement_library_path}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ puzzle_dir: /workspace/puzzle_dir
mip:
human_constraints:
target_memory: 78_000 # 78 GiB
# Memory sweep configuration (optional)
sweep:
enabled: false
memory_compression_rates: [0.5, 0.6, 0.7, 0.8, 0.9]
output_csv: ${puzzle_dir}/mip_sweep_results.csv

# FFN intermediate sizes to search over (heterogeneous architecture)
pruning:
Expand Down
Loading
Loading