Skip to content
Merged
Show file tree
Hide file tree
Changes from 79 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
e82164f
Add anymodel directories to feature/puzzletron
danielkorzekwa Mar 4, 2026
2099df3
Make any_model conversion working.
danielkorzekwa Mar 5, 2026
eb5cf8a
Update child_init.py with anymodel version
danielkorzekwa Mar 5, 2026
c9de41c
fix attention pruning
danielkorzekwa Mar 5, 2026
3c1bc1f
Add trust_remote_code to load_model_config (default to false)
danielkorzekwa Mar 5, 2026
8357136
Make activation scoring working
danielkorzekwa Mar 5, 2026
6cc2194
Comment all tested models aside of llama_3_1_8b_instruct
danielkorzekwa Mar 5, 2026
ee4e1e3
Delete not needed decilm test
danielkorzekwa Mar 5, 2026
449b523
Fix broken tests
danielkorzekwa Mar 5, 2026
fb27bba
Update puzzletron_nas_pluging to any_model version
danielkorzekwa Mar 5, 2026
b350f82
Correct test resources used by tests.
danielkorzekwa Mar 5, 2026
fafe5a3
Disable puzzletron tests (will be enabled after all any_model logic i…
danielkorzekwa Mar 5, 2026
e988248
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
c717852
Comment out not implemented models.
danielkorzekwa Mar 6, 2026
030f126
format python docs
danielkorzekwa Mar 6, 2026
8dcdfbf
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
70df0df
Use trust_remote_code in force_cache_dynamic_modules()
danielkorzekwa Mar 6, 2026
bb56662
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
ecd953e
Fix anymodel pruning
danielkorzekwa Mar 6, 2026
ee8f538
Fix buid docs issue.
danielkorzekwa Mar 6, 2026
c9b76a1
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
6e3af61
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 6, 2026
0ad6d92
Merging build_library_and_stats
danielkorzekwa Mar 6, 2026
995eb1a
Merging anymodel: calc_one_block_scores
danielkorzekwa Mar 6, 2026
34081c9
Mering any_model: calc_one_block_scores
danielkorzekwa Mar 6, 2026
ed5c00f
merge any_model: mip_and_realize_models
danielkorzekwa Mar 6, 2026
993b5ec
Add all anymodel models but gptoss
danielkorzekwa Mar 6, 2026
6e9f03b
Make nemotron-nano-12b-v2 to work (set trust_remote_code=true)
danielkorzekwa Mar 9, 2026
e8b7a7d
merge anymodel for nemotron-3-nano-30b-a3b-base-bf16
danielkorzekwa Mar 9, 2026
47414d5
Clarify readme and avoid reusing the same reference in llama_converter.
danielkorzekwa Mar 9, 2026
a8305d8
Fix tied-embedding handling before writing the safetensors index.
danielkorzekwa Mar 9, 2026
68421a5
Fix NaN ranking currently selects NaNs as “best” experts by default.
danielkorzekwa Mar 9, 2026
d6b8028
Code clean up.
danielkorzekwa Mar 9, 2026
ecd2341
Code clean up.
danielkorzekwa Mar 10, 2026
f9d845d
code clean up
danielkorzekwa Mar 10, 2026
d171b01
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 10, 2026
722da90
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
934ab2f
code clean up
danielkorzekwa Mar 10, 2026
0f14ec3
Merge branch 'dkorzekwa/anymodel_pruning' into dkorzekwa/anymodel_bui…
danielkorzekwa Mar 10, 2026
dcb9e02
remove not needed comment
danielkorzekwa Mar 10, 2026
0c9ea5d
Merge branch 'dkorzekwa/anymodel_build_library_and_stats' into dkorze…
danielkorzekwa Mar 10, 2026
5b310e2
Merge branch 'dkorzekwa/any_model_calc_one_block_scores' into dkorzek…
danielkorzekwa Mar 10, 2026
4f82b1c
Merge branch 'dkorzekwa/mip_and_realize_models' into dkorzekwa/any_mo…
danielkorzekwa Mar 10, 2026
176a435
Fix a broken test_puzzletron test on 2 gpus.
danielkorzekwa Mar 10, 2026
02e2c9b
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
92c4419
Merge branch 'dkorzekwa/anymodel_pruning' into dkorzekwa/anymodel_bui…
danielkorzekwa Mar 10, 2026
aa1eb3e
Merge branch 'dkorzekwa/anymodel_build_library_and_stats' into dkorze…
danielkorzekwa Mar 10, 2026
2b84a96
Merge branch 'dkorzekwa/any_model_calc_one_block_scores' into dkorzek…
danielkorzekwa Mar 10, 2026
fb838c0
Merge branch 'dkorzekwa/mip_and_realize_models' into dkorzekwa/any_mo…
danielkorzekwa Mar 10, 2026
13378ff
Add gpt-oss model
danielkorzekwa Mar 11, 2026
47ca0e3
Add comments about a broken test
danielkorzekwa Mar 11, 2026
96112f7
Fix a broken gptoss test
danielkorzekwa Mar 12, 2026
cb6b182
Add mamba to puzzletron dependencies.
danielkorzekwa Mar 12, 2026
670bb34
Update mamba-ssm and casual-conv1d dependences (remove pinpoint versi…
danielkorzekwa Mar 13, 2026
0e1b591
Install mamba-ssm and causal-conv1d in testenv:cuda13-gpu-puzzletron
danielkorzekwa Mar 13, 2026
ca845ec
Fix installing dependencies in testenv:cuda13-gpu-puzzletron
danielkorzekwa Mar 13, 2026
be825bc
Fix anymodel for qwen3 8B in 2 gpus
danielkorzekwa Mar 13, 2026
7fd1afa
Fix pipeline parallelism issue for wen3-vl-30b-a3b-instruct-qwen3_vl-…
danielkorzekwa Mar 13, 2026
7d7b609
Fix multi-gpu issue for nemotron-nano-12b-v2
danielkorzekwa Mar 13, 2026
249af9d
Fix no_op in any_model
danielkorzekwa Mar 13, 2026
b80583c
Merge branch 'feature/puzzletron' into dkorzekwa/any_model_other_models
danielkorzekwa Mar 13, 2026
1dd742e
Fix nemotron_h_model_descriptor.
danielkorzekwa Mar 14, 2026
4a6ebbe
Fix tox -e build-docs
danielkorzekwa Mar 14, 2026
585f0ed
pin mamba/casual-conv1d versions to fix failing assertion for test_pu…
danielkorzekwa Mar 14, 2026
7fb5d9a
Fix for installing mamba-ssm
danielkorzekwa Mar 14, 2026
75d3d69
Fix broken test for nemotron-3-nano-30b-a3b-base-bf16
danielkorzekwa Mar 14, 2026
0e5722d
code clean up
danielkorzekwa Mar 14, 2026
2dd9735
Make test_puzzletron test deterministic
danielkorzekwa Mar 15, 2026
3561de5
Comment out all models but nemotron-3-nano-30b-a3b-base-bf16 to check…
danielkorzekwa Mar 15, 2026
27866de
Implement Qwen3VLRemoveExpertsIndependentHook
danielkorzekwa Mar 15, 2026
a012fe6
Remove not needed nvidia licence header
danielkorzekwa Mar 16, 2026
52922a4
# Initialize weights to ensure all parameters are properly initialized
danielkorzekwa Mar 16, 2026
c234fb4
Fix non-deterministic test_puzzletron test
danielkorzekwa Mar 16, 2026
53dcd10
Fix for unsetting CUDA_VISIBLE_DEVICES
danielkorzekwa Mar 16, 2026
69d9648
increase numeric tolerance for test_puzzletron.py
danielkorzekwa Mar 17, 2026
4a692dc
Disable lm_loss assertion for nemotron-3-nano-30b-a3b-base-bf16 (not …
danielkorzekwa Mar 17, 2026
e795f0c
Removing incorrect licence file. gpt_oss_pruned_to_mxfp4.py was not a…
danielkorzekwa Mar 17, 2026
631306c
Fix hardcoded trust_remote_code
danielkorzekwa Mar 17, 2026
dc77be2
Merge branch 'dkorzekwa/any_model_other_models' into dkorzekwa/anymod…
danielkorzekwa Mar 17, 2026
5cadc65
Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_gptoss
danielkorzekwa Mar 17, 2026
151081c
Delete not needed yaml files for test_puzzletron.
danielkorzekwa Mar 17, 2026
36daa6d
Delete not needed mypy exclusion for removed hf_configs files.
danielkorzekwa Mar 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ repos:
rev: v1.17.1
hooks:
- id: mypy
# Exclude HF config directories to avoid duplicate module errors (e.g., configuration_nemotron_h.py exists in multiple model configs)
exclude: "tests/gpu/torch/puzzletron/resources/hf_configs/"

- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v21.1.0
Expand Down Expand Up @@ -95,6 +97,7 @@ repos:
modelopt/torch/speculative/eagle/utils.py|
modelopt/torch/speculative/plugins/transformers.py|
modelopt/torch/utils/plugins/megatron_mmlu.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
examples/chained_optimizations/bert_prune_distill_quantize.py|
examples/deepseek/quantize_to_nvfp4.py|
examples/deepseek/ptq.py|
Expand All @@ -113,7 +116,6 @@ repos:
examples/speculative_decoding/server_generate.py|
experimental/dms/models/qwen3/configuration_qwen3_dms.py|
experimental/dms/models/qwen3/modeling_qwen3_dms.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
)$

# Default hook for Apache 2.0 in c/c++/cuda files
Expand Down
68 changes: 23 additions & 45 deletions modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1142,61 +1142,39 @@ def __call__(


class Qwen3VLRemoveExpertsIndependentHook(RemoveExpertsIndependentHook):
"""Expert removal importance hook for Qwen3-VL models.

TODO: Implement get_router_logits_and_routed_experts based on Qwen3-VL MoE forward pass.
"""
"""Expert removal importance hook for Qwen3-VL models."""

def get_router_logits_and_routed_experts(
self, hidden_states: torch.Tensor, router_logits: torch.Tensor | None = None
) -> tuple[torch.Tensor, torch.Tensor]:
"""Extract router logits and expert outputs for Qwen3-VL MoE.

Note: This is a placeholder implementation. Implement based on Qwen3VLMoeSparseMoe forward.
Based on Qwen3VLMoeSparseMoe forward pass.
"""
batch_size = (
hidden_states.shape[0] * hidden_states.shape[1]
if hidden_states.ndim > 2
else hidden_states.shape[0]
)
router_logits_out = torch.zeros(
batch_size, self.num_local_experts, device=hidden_states.device
)
routed_experts = hidden_states.view(-1, hidden_states.shape[-1])
return router_logits_out, routed_experts
orig_shape = hidden_states.shape

# Flatten to (num_tokens, hidden_size) for processing
hidden_states_flat = hidden_states.reshape(-1, self.moe.hidden_size)

class GptOssRemoveExpertsIndependentHook(RemoveExpertsIndependentHook):
"""Expert removal importance hook for GPT-OSS models.
if router_logits is None:
router_logits = self.moe.gate(hidden_states_flat)

routing_weights = torch.nn.functional.softmax(router_logits, dim=-1, dtype=torch.float)
routing_weights, router_indices = torch.topk(routing_weights, self.moe.top_k, dim=-1)
routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)
routing_weights = routing_weights.to(hidden_states_flat.dtype)
router_weights = torch.zeros_like(router_logits).scatter_(
1, router_indices, routing_weights
)

TODO: Implement get_router_logits_and_routed_experts based on GPT-OSS MoE forward pass.
This is a placeholder implementation that allows the framework to run.
"""
# Reshape hidden_states for moe.experts (expects 3D: batch, seq, hidden)
# router_weights and router_indices remain 2D (num_tokens, num_experts)
batch_size = orig_shape[0] if hidden_states.ndim == 3 else 1
hidden_states_3d = hidden_states_flat.reshape(batch_size, -1, self.moe.hidden_size)

def get_router_logits_and_routed_experts(
self, hidden_states: torch.Tensor, router_logits: torch.Tensor | None = None
) -> tuple[torch.Tensor, torch.Tensor]:
"""Extract router logits and expert outputs for GPT-OSS MoE.
routed_out = self.moe.experts(hidden_states_3d, router_weights, router_indices)

Note: This is a placeholder implementation. For proper expert scoring,
implement based on GptOssSparseMoeBlock forward pass.
# Return in same shape as input
routed_out = routed_out.reshape(*orig_shape)

Args:
hidden_states: Input tensor of shape (batch, seq_len, hidden_dim)
router_logits: Optional pre-computed router logits

Returns:
tuple of (router_logits, routed_experts):
- router_logits: Shape (num_tokens, num_local_experts) - zeros as placeholder
- routed_experts: Original hidden states (no-op)
"""
batch_size = (
hidden_states.shape[0] * hidden_states.shape[1]
if hidden_states.ndim > 2
else hidden_states.shape[0]
)
router_logits_out = torch.zeros(
batch_size, self.num_local_experts, device=hidden_states.device
)
routed_experts = hidden_states.view(-1, hidden_states.shape[-1])
return router_logits_out, routed_experts
return router_logits, routed_out
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,11 @@

from typing import Type

import torch

from modelopt.torch.nas.plugins.megatron_hooks.base_hooks import ForwardHook as ActivationsHook
from modelopt.torch.puzzletron.tools.logger import aprint
from modelopt.torch.puzzletron.utils.dummy_modules import DummyBlock, DummyModule


def register_activation_hooks(
Expand Down Expand Up @@ -51,6 +54,16 @@ def register_activation_hooks(
module_names_to_hook = pruning_mixin.get_module_names_to_hook(model)
activation_hooks = dict()
for block_idx, module_name in module_names_to_hook:
try:
module = model.get_submodule(module_name)
except AttributeError:
# Module doesn't exist on this rank's shard (e.g., in distributed setup)
continue

# Skip dummy modules - they don't have real activations to hook
if isinstance(module, (DummyModule, DummyBlock)):
continue

block_config = None
if block_idx is not None:
block_config = model.config.block_configs[block_idx]
Expand All @@ -59,13 +72,25 @@ def register_activation_hooks(
"block_config": block_config,
}

module = model.get_submodule(module_name)
hook = hook_class(module, curr_activation_hooks_kwargs)
module.register_forward_hook(hook)
activation_hooks[module_name] = hook

if len(activation_hooks) == 0:
raise ValueError("couldn't find any hooks")
# In distributed mode, it's okay for a rank to have 0 hooks if it doesn't own
# the target modules (e.g., with hybrid patterns like "*-" where different
# ranks own different layer types). However, we still want to catch real bugs
# where no hooks are found at all.
is_distributed = torch.distributed.is_available() and torch.distributed.is_initialized()
if is_distributed:
aprint(
"No hooks registered on this rank. This is expected if this rank "
"doesn't own any layers matching the hook pattern (e.g., in hybrid "
"patterns with distributed model sharding)."
)
else:
raise ValueError("couldn't find any hooks")

aprint(f"Found the following hooks: {activation_hooks.keys()}")
if len(activation_hooks) > 0:
aprint(f"Found the following hooks: {activation_hooks.keys()}")
return activation_hooks
8 changes: 6 additions & 2 deletions modelopt/torch/puzzletron/anymodel/converter/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,10 @@ def convert_configs_in_dirs(
cls,
input_dir: Path,
output_dir: Path,
trust_remote_code: bool = False,
):
"""Convert config and add block_configs."""
config = load_model_config(input_dir)
config = load_model_config(input_dir, trust_remote_code=trust_remote_code)

block_configs = cls.create_block_configs_from_main_config(config)
out_config = copy.deepcopy(config)
Expand Down Expand Up @@ -179,7 +180,10 @@ def convert(
output_dir: Path to the output AnyModel checkpoint.
"""
cls.copy_checkpoint_files(input_dir, output_dir)
config = cls.convert_configs_in_dirs(input_dir, output_dir)
trust_remote_code = descriptor.requires_trust_remote_code()
config = cls.convert_configs_in_dirs(
input_dir, output_dir, trust_remote_code=trust_remote_code
)
cls.convert_model_weights(
input_dir, output_dir, descriptor=descriptor, num_hidden_layers=config.num_hidden_layers
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,18 @@ def block_config_to_layer_overrides(block_config: BlockConfig) -> Dict[str, Any]
"""
raise NotImplementedError

@staticmethod
def requires_trust_remote_code() -> bool:
"""Whether this model descriptor requires trust_remote_code=True for loading.

Models that use custom code (e.g., via auto_map in config) should override
this to return True.

Returns:
True if trust_remote_code=True is required, False otherwise.
"""
return False

@staticmethod
def mlp_no_op_post_init(decoder_layer: nn.Module):
"""Post-init callback to alter a decoder layer so that FFN/mlp subblock performs as no-op.
Expand Down
14 changes: 7 additions & 7 deletions modelopt/torch/puzzletron/anymodel/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@
# limitations under the License.

# Import models to trigger factory registration
# from modelopt.torch.puzzletron.anymodel.models.gpt_oss_20b import *
from modelopt.torch.puzzletron.anymodel.models.gpt_oss import *
from modelopt.torch.puzzletron.anymodel.models.llama import *
# from modelopt.torch.puzzletron.anymodel.models.mistral_small import *
# from modelopt.torch.puzzletron.anymodel.models.nemotron_h import *
# from modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2 import *
# from modelopt.torch.puzzletron.anymodel.models.qwen2 import *
# from modelopt.torch.puzzletron.anymodel.models.qwen3_8b import *
# from modelopt.torch.puzzletron.anymodel.models.qwen3_vl_30b_a3b_instruct import *
from modelopt.torch.puzzletron.anymodel.models.mistral_small import *
from modelopt.torch.puzzletron.anymodel.models.nemotron_h import *
from modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2 import *
from modelopt.torch.puzzletron.anymodel.models.qwen2 import *
from modelopt.torch.puzzletron.anymodel.models.qwen3_8b import *
from modelopt.torch.puzzletron.anymodel.models.qwen3_vl_30b_a3b_instruct import *
22 changes: 22 additions & 0 deletions modelopt/torch/puzzletron/anymodel/models/gpt_oss/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""GPT-OSS model support for AnyModel."""

from .gpt_oss_converter import GptOssConverter
from .gpt_oss_model_descriptor import GptOssModelDescriptor
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# mypy: ignore-errors

"""GPT-OSS-20B converter for AnyModel compression."""

from typing import List

from transformers import PretrainedConfig

from modelopt.torch.puzzletron.anymodel.converter import Converter, ConverterFactory
from modelopt.torch.puzzletron.decilm.deci_lm_hf_code.block_config import (
AttentionConfig,
BlockConfig,
FFNConfig,
MoEConfig,
)


@ConverterFactory.register_decorator("gpt_oss")
class GptOssConverter(Converter):
"""Converter for GPT-OSS models to AnyModel format.

GPT-OSS is a pure MoE model with 32/128 experts per layer and 4/16 active experts.
All layers use MoE FFN (no standard dense FFN layers).
"""

quantized = "mxfp4"

@staticmethod
def create_block_configs_from_main_config(config: PretrainedConfig) -> List[BlockConfig]:
"""Create block configs for GPT-OSS layers.

GPT-OSS uses MoE for all FFN layers with:
- 32/128 local experts (num_local_experts)
- 4/16 active experts per token (experts_per_token)
- No dense/standard FFN layers
"""
num_hidden_layers = config.num_hidden_layers
num_local_experts = config.num_local_experts
experts_per_token = config.experts_per_token
intermediate_size = config.intermediate_size

block_configs = []
for layer_idx in range(num_hidden_layers):
block_config = BlockConfig(
attention=AttentionConfig(
no_op=False, num_key_value_heads=config.num_key_value_heads
),
ffn=FFNConfig(
no_op=False,
intermediate_size=None, # MoE doesn't use this field
moe=MoEConfig(
num_local_experts=num_local_experts,
num_experts_per_tok=experts_per_token,
expert_intermediate_dim=intermediate_size,
),
),
).to_dict()
block_configs.append(block_config)

return block_configs
Loading
Loading