Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
e82164f
Add anymodel directories to feature/puzzletron
danielkorzekwa Mar 4, 2026
2099df3
Make any_model conversion working.
danielkorzekwa Mar 5, 2026
eb5cf8a
Update child_init.py with anymodel version
danielkorzekwa Mar 5, 2026
c9de41c
fix attention pruning
danielkorzekwa Mar 5, 2026
3c1bc1f
Add trust_remote_code to load_model_config (default to false)
danielkorzekwa Mar 5, 2026
8357136
Make activation scoring working
danielkorzekwa Mar 5, 2026
6cc2194
Comment all tested models aside of llama_3_1_8b_instruct
danielkorzekwa Mar 5, 2026
ee4e1e3
Delete not needed decilm test
danielkorzekwa Mar 5, 2026
449b523
Fix broken tests
danielkorzekwa Mar 5, 2026
fb27bba
Update puzzletron_nas_pluging to any_model version
danielkorzekwa Mar 5, 2026
b350f82
Correct test resources used by tests.
danielkorzekwa Mar 5, 2026
fafe5a3
Disable puzzletron tests (will be enabled after all any_model logic i…
danielkorzekwa Mar 5, 2026
e988248
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
c717852
Comment out not implemented models.
danielkorzekwa Mar 6, 2026
030f126
format python docs
danielkorzekwa Mar 6, 2026
8dcdfbf
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
70df0df
Use trust_remote_code in force_cache_dynamic_modules()
danielkorzekwa Mar 6, 2026
bb56662
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
ecd953e
Fix anymodel pruning
danielkorzekwa Mar 6, 2026
ee8f538
Fix buid docs issue.
danielkorzekwa Mar 6, 2026
c9b76a1
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
6e3af61
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 6, 2026
0ad6d92
Merging build_library_and_stats
danielkorzekwa Mar 6, 2026
995eb1a
Merging anymodel: calc_one_block_scores
danielkorzekwa Mar 6, 2026
34081c9
Mering any_model: calc_one_block_scores
danielkorzekwa Mar 6, 2026
ed5c00f
merge any_model: mip_and_realize_models
danielkorzekwa Mar 6, 2026
993b5ec
Add all anymodel models but gptoss
danielkorzekwa Mar 6, 2026
6e9f03b
Make nemotron-nano-12b-v2 to work (set trust_remote_code=true)
danielkorzekwa Mar 9, 2026
e8b7a7d
merge anymodel for nemotron-3-nano-30b-a3b-base-bf16
danielkorzekwa Mar 9, 2026
47414d5
Clarify readme and avoid reusing the same reference in llama_converter.
danielkorzekwa Mar 9, 2026
a8305d8
Fix tied-embedding handling before writing the safetensors index.
danielkorzekwa Mar 9, 2026
68421a5
Fix NaN ranking currently selects NaNs as “best” experts by default.
danielkorzekwa Mar 9, 2026
d6b8028
Code clean up.
danielkorzekwa Mar 9, 2026
ecd2341
Code clean up.
danielkorzekwa Mar 10, 2026
f9d845d
code clean up
danielkorzekwa Mar 10, 2026
d171b01
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 10, 2026
722da90
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
934ab2f
code clean up
danielkorzekwa Mar 10, 2026
0f14ec3
Merge branch 'dkorzekwa/anymodel_pruning' into dkorzekwa/anymodel_bui…
danielkorzekwa Mar 10, 2026
dcb9e02
remove not needed comment
danielkorzekwa Mar 10, 2026
0c9ea5d
Merge branch 'dkorzekwa/anymodel_build_library_and_stats' into dkorze…
danielkorzekwa Mar 10, 2026
5b310e2
Merge branch 'dkorzekwa/any_model_calc_one_block_scores' into dkorzek…
danielkorzekwa Mar 10, 2026
4f82b1c
Merge branch 'dkorzekwa/mip_and_realize_models' into dkorzekwa/any_mo…
danielkorzekwa Mar 10, 2026
176a435
Fix a broken test_puzzletron test on 2 gpus.
danielkorzekwa Mar 10, 2026
02e2c9b
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
92c4419
Merge branch 'dkorzekwa/anymodel_pruning' into dkorzekwa/anymodel_bui…
danielkorzekwa Mar 10, 2026
aa1eb3e
Merge branch 'dkorzekwa/anymodel_build_library_and_stats' into dkorze…
danielkorzekwa Mar 10, 2026
2b84a96
Merge branch 'dkorzekwa/any_model_calc_one_block_scores' into dkorzek…
danielkorzekwa Mar 10, 2026
fb838c0
Merge branch 'dkorzekwa/mip_and_realize_models' into dkorzekwa/any_mo…
danielkorzekwa Mar 10, 2026
13378ff
Add gpt-oss model
danielkorzekwa Mar 11, 2026
47ca0e3
Add comments about a broken test
danielkorzekwa Mar 11, 2026
96112f7
Fix a broken gptoss test
danielkorzekwa Mar 12, 2026
cb6b182
Add mamba to puzzletron dependencies.
danielkorzekwa Mar 12, 2026
670bb34
Update mamba-ssm and casual-conv1d dependences (remove pinpoint versi…
danielkorzekwa Mar 13, 2026
0e1b591
Install mamba-ssm and causal-conv1d in testenv:cuda13-gpu-puzzletron
danielkorzekwa Mar 13, 2026
ca845ec
Fix installing dependencies in testenv:cuda13-gpu-puzzletron
danielkorzekwa Mar 13, 2026
be825bc
Fix anymodel for qwen3 8B in 2 gpus
danielkorzekwa Mar 13, 2026
7fd1afa
Fix pipeline parallelism issue for wen3-vl-30b-a3b-instruct-qwen3_vl-…
danielkorzekwa Mar 13, 2026
7d7b609
Fix multi-gpu issue for nemotron-nano-12b-v2
danielkorzekwa Mar 13, 2026
249af9d
Fix no_op in any_model
danielkorzekwa Mar 13, 2026
b80583c
Merge branch 'feature/puzzletron' into dkorzekwa/any_model_other_models
danielkorzekwa Mar 13, 2026
88b1b13
Merge any_model tutorial
danielkorzekwa Mar 13, 2026
c0da9c0
Merge mbridge distillation for any_model
danielkorzekwa Mar 13, 2026
1dd742e
Fix nemotron_h_model_descriptor.
danielkorzekwa Mar 14, 2026
4a6ebbe
Fix tox -e build-docs
danielkorzekwa Mar 14, 2026
585f0ed
pin mamba/casual-conv1d versions to fix failing assertion for test_pu…
danielkorzekwa Mar 14, 2026
7fb5d9a
Fix for installing mamba-ssm
danielkorzekwa Mar 14, 2026
75d3d69
Fix broken test for nemotron-3-nano-30b-a3b-base-bf16
danielkorzekwa Mar 14, 2026
0e5722d
code clean up
danielkorzekwa Mar 14, 2026
2dd9735
Make test_puzzletron test deterministic
danielkorzekwa Mar 15, 2026
3561de5
Comment out all models but nemotron-3-nano-30b-a3b-base-bf16 to check…
danielkorzekwa Mar 15, 2026
27866de
Implement Qwen3VLRemoveExpertsIndependentHook
danielkorzekwa Mar 15, 2026
f5fbbcf
MR branch for the remaining difference between dkorzekwa/any_model an…
danielkorzekwa Mar 16, 2026
a012fe6
Remove not needed nvidia licence header
danielkorzekwa Mar 16, 2026
52922a4
# Initialize weights to ensure all parameters are properly initialized
danielkorzekwa Mar 16, 2026
c234fb4
Fix non-deterministic test_puzzletron test
danielkorzekwa Mar 16, 2026
53dcd10
Fix for unsetting CUDA_VISIBLE_DEVICES
danielkorzekwa Mar 16, 2026
69d9648
increase numeric tolerance for test_puzzletron.py
danielkorzekwa Mar 17, 2026
4a692dc
Disable lm_loss assertion for nemotron-3-nano-30b-a3b-base-bf16 (not …
danielkorzekwa Mar 17, 2026
e795f0c
Removing incorrect licence file. gpt_oss_pruned_to_mxfp4.py was not a…
danielkorzekwa Mar 17, 2026
631306c
Fix hardcoded trust_remote_code
danielkorzekwa Mar 17, 2026
dc77be2
Merge branch 'dkorzekwa/any_model_other_models' into dkorzekwa/anymod…
danielkorzekwa Mar 17, 2026
b76e0ef
Merge branch 'dkorzekwa/anymodel_gptoss' into dkorzekwa/anymodel_tuto…
danielkorzekwa Mar 17, 2026
109b185
Merge branch 'dkorzekwa/anymodel_tutorial' into dkorzekwa/anymodel_mb…
danielkorzekwa Mar 17, 2026
b0972e4
Merge branch 'dkorzekwa/anymodel_mbridgedist' into dkorzekwa/remainin…
danielkorzekwa Mar 17, 2026
5cadc65
Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_gptoss
danielkorzekwa Mar 17, 2026
151081c
Delete not needed yaml files for test_puzzletron.
danielkorzekwa Mar 17, 2026
36daa6d
Delete not needed mypy exclusion for removed hf_configs files.
danielkorzekwa Mar 17, 2026
960b8ce
Merge branch 'dkorzekwa/anymodel_gptoss' into dkorzekwa/anymodel_tuto…
danielkorzekwa Mar 17, 2026
854d96b
Merge branch 'dkorzekwa/anymodel_tutorial' into dkorzekwa/anymodel_mb…
danielkorzekwa Mar 17, 2026
cf06997
Merge branch 'dkorzekwa/anymodel_mbridgedist' into dkorzekwa/remainin…
danielkorzekwa Mar 17, 2026
b47f846
Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_tutorial
danielkorzekwa Mar 17, 2026
13f5edc
Merge branch 'dkorzekwa/anymodel_tutorial' into dkorzekwa/anymodel_mb…
danielkorzekwa Mar 17, 2026
b4c71cc
Merge branch 'dkorzekwa/anymodel_mbridgedist' into dkorzekwa/remainin…
danielkorzekwa Mar 17, 2026
a3e20fc
Merge branch 'feature/puzzletron' into dkorzekwa/remainings_from_dkor…
danielkorzekwa Mar 20, 2026
ba85c29
remove not needed licence exception
danielkorzekwa Mar 20, 2026
36cb150
code clean up
danielkorzekwa Mar 20, 2026
808ad3f
Apply suggestion from @kevalmorabia97
kevalmorabia97 Mar 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
import warnings
from typing import Any

from transformers.utils import is_flash_attn_2_available, is_torch_sdpa_available
from transformers.utils import is_flash_attn_2_available # , is_torch_sdpa_available

from .block_config import BlockConfig
from .transformers_4_44_2__configuration_llama import LlamaConfig
Expand Down Expand Up @@ -119,12 +119,8 @@ def _delete_per_layer_attributes(self):
def _choose_llama4_attn_implementation(self, llama4_attn_implementation):
self.llama4_attn_implementation = llama4_attn_implementation
if self.llama4_attn_implementation is None:
if is_torch_sdpa_available():
_print_once("auto-setting llama4_attn_implementation to sdpa")
self.llama4_attn_implementation = "sdpa"
else:
_print_once("auto-setting llama4_attn_implementation to eager")
self.llama4_attn_implementation = "eager"
_print_once("auto-setting llama4_attn_implementation to sdpa")
self.llama4_attn_implementation = "sdpa"

def _choose_llama3_attn_implementation(self, kwargs: dict[str, Any]) -> str:
attn_implementation = kwargs.pop("attn_implementation", None)
Expand Down
4 changes: 3 additions & 1 deletion modelopt/torch/puzzletron/mip/mip_and_realize_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def launch_realize_model(cfg: DictConfig):
validate_puzzle_solutions(args=cfg.realize_model)


def launch_mip_and_realize_model(cfg: DictConfig):
def launch_mip_and_realize_model(cfg: DictConfig) -> list[str]:
# Determine device for distributed operations (NCCL requires CUDA tensors)
device = "cpu"
if dist.size() > 1:
Expand Down Expand Up @@ -69,3 +69,5 @@ def launch_mip_and_realize_model(cfg: DictConfig):
cfg.realize_model.solutions_path = Path(solution_path)
launch_realize_model(cfg)
dist.barrier()

return solution_paths
297 changes: 297 additions & 0 deletions modelopt/torch/puzzletron/mip/sweep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""MIP sweep functionality for exploring multiple memory compression rates."""

import json
from pathlib import Path

import modelopt.torch.puzzletron.mip.mip_and_realize_models as mip_and_realize_models
import modelopt.torch.utils.distributed as dist
from modelopt.torch.puzzletron.tools.logger import mprint


def get_teacher_memory_from_subblock_stats(hydra_cfg) -> float:
"""Calculate teacher model memory from subblock_stats.json.

Replicates the MIP solver's memory calculation logic:
- Loads subblock_stats.json which contains memory measurements for all subblock configs
- Finds the teacher FFN subblock (with full intermediate_size)
- Finds the teacher Attention subblock (full attention, not no_op)
- Calculates: non_block_memory + (ffn_memory + attention_memory) * num_layers

This matches how the MIP solver computes total model memory via _get_block_stats().

Args:
hydra_cfg: Hydra configuration object

Returns:
Total teacher memory in MiB
"""
puzzle_dir = Path(hydra_cfg.puzzle_dir)

# Read config.json directly from the teacher model path
teacher_dir = Path(hydra_cfg.teacher_dir)
config_file = teacher_dir / "config.json"

with open(config_file) as f:
config_dict = json.load(f)

num_layers = config_dict["num_hidden_layers"]
teacher_ffn_intermediate = config_dict["intermediate_size"]
teacher_num_kv_heads = config_dict["num_key_value_heads"]

# Get the MIP configuration
mip_subblock_args = hydra_cfg.mip.subblock_stats_args[0]
batch_size = mip_subblock_args["batch_size"]
weights_dtype = str(mip_subblock_args["weights_dtype"])
activations_dtype = str(mip_subblock_args["activations_dtype"])
kv_cache_dtype = str(mip_subblock_args["kv_cache_dtype"])

# Load subblock_stats.json
subblock_stats_path = puzzle_dir / "subblock_stats.json"
if not subblock_stats_path.exists():
raise FileNotFoundError(
f"subblock_stats.json not found at {subblock_stats_path}. "
"Please run the full pipeline first without --mip-only flag."
)

with open(subblock_stats_path) as f:
subblock_stats_list = json.load(f)

# Find the entry matching our MIP configuration and teacher's n_embd
matching_stats = None
for stats_entry in subblock_stats_list:
args = stats_entry["args"]
if (
args["batch_size"] == batch_size
and args["weights_dtype"] == weights_dtype
and args["activations_dtype"] == activations_dtype
and args["kv_cache_dtype"] == kv_cache_dtype
and args.get("n_embd") == config_dict["hidden_size"]
):
matching_stats = stats_entry
break

if matching_stats is None:
raise ValueError(
f"No subblock_stats entry found for batch_size={batch_size}, "
f"dtypes=({weights_dtype}, {activations_dtype}, {kv_cache_dtype}), "
f"n_embd={config_dict['hidden_size']}"
)

# Get non-block memory (embeddings, LM head, etc.)
total_memory = matching_stats.get("non_block", {}).get("memory_mib", 0.0)

# Find the teacher FFN and Attention subblocks
# Note: Each subblock is EITHER attention OR ffn, not both
# We need to find BOTH and add their memory together
teacher_ffn_subblock = None
teacher_attention_subblock = None

for subblock in matching_stats.get("subblocks", []):
subblock_class = subblock.get("subblock_config_class", "")
subblock_config = subblock.get("subblock_config", {})

# Check for FFN subblocks with teacher's intermediate_size
if "FFN" in subblock_class:
ffn_size = subblock_config.get("intermediate_size")
if ffn_size == teacher_ffn_intermediate and not subblock_config.get("no_op", False):
teacher_ffn_subblock = subblock

# Check for Attention subblocks with teacher's num_key_value_heads
elif "Attention" in subblock_class:
kv_heads = subblock_config.get("num_key_value_heads")
if kv_heads == teacher_num_kv_heads and not subblock_config.get("no_op", False):
teacher_attention_subblock = subblock

if teacher_ffn_subblock is None:
raise ValueError(
f"Could not find teacher FFN subblock with intermediate_size={teacher_ffn_intermediate}"
)

if teacher_attention_subblock is None:
raise ValueError(
f"Could not find teacher Attention subblock with num_key_value_heads={teacher_num_kv_heads}"
)

# Calculate total teacher memory: non_block + (ffn_memory + attention_memory) * num_layers
per_layer_memory = teacher_ffn_subblock["memory_mib"] + teacher_attention_subblock["memory_mib"]
total_memory += per_layer_memory * num_layers

return total_memory


def extract_solution_results(
solution_path: Path,
target_memory_mib: float,
teacher_memory_mib: float,
compression_rate: float,
) -> dict:
"""Extract results from a completed MIP solution.

Args:
solution_path: Path to the solutions.json file (not the directory!)
target_memory_mib: Target memory constraint used for MIP
teacher_memory_mib: Teacher model memory in MiB
compression_rate: Compression rate applied

Returns:
Dictionary containing extracted metrics
"""
result = {
"compression_rate": compression_rate,
"target_memory_mib": target_memory_mib,
"teacher_memory_mib": teacher_memory_mib,
}

# solution_path is the path to solutions.json file, get parent directory
solution_dir = solution_path.parent

# Load solutions.json for actual memory and parameters
solutions_file = solution_dir / "solutions.json"
with open(solutions_file) as f:
solutions_data = json.load(f)
solution = solutions_data[0] # First solution
total_costs = solution.get("total_costs", {})
result["actual_memory_mib"] = total_costs.get("stats.memory_mib", None)
result["num_params"] = total_costs.get("stats.num_params", None)

# Load solution_0.json for accuracy metrics
validation_dir = solution_dir / "solutions--validation"
# TODO: There could be multiple solutions, but we only need the first one. Is it the best solution?
solution_0_file = validation_dir / "solution_0.json"

with open(solution_0_file) as f:
validation_data = json.load(f)
result["lm_loss"] = validation_data.get("lm_loss", {}).get("avg", None)
result["token_accuracy_top_1"] = validation_data.get("token_accuracy_top_1", {}).get(
"avg", None
)
result["token_accuracy_top_5"] = validation_data.get("token_accuracy_top_5", {}).get(
"avg", None
)
result["token_accuracy_top_10"] = validation_data.get("token_accuracy_top_10", {}).get(
"avg", None
)

return result


def write_results_to_csv(results: list, output_csv: str):
"""Write sweep results to CSV file.

Args:
results: List of result dictionaries
output_csv: Path to output CSV file
"""
import csv

# Define CSV columns in desired order
columns = [
"compression_rate",
"target_memory_mib",
"actual_memory_mib",
"teacher_memory_mib",
"num_params",
"lm_loss",
"token_accuracy_top_1",
"token_accuracy_top_5",
"token_accuracy_top_10",
]

# Write CSV
output_path = Path(output_csv)
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=columns)
writer.writeheader()
writer.writerows(results)

mprint(f"Results written to: {output_path}")


def run_mip_sweep(hydra_cfg):
"""Run MIP for multiple memory compression rates and generate CSV with results.

This function is called when mip.sweep.enabled is True in the config.

Args:
hydra_cfg: Hydra configuration object with mip.sweep settings
"""
mprint("=" * 80)
mprint("MIP Sweep Mode Enabled")
mprint("=" * 80)

# Get sweep configuration
sweep_cfg = hydra_cfg.mip.sweep
compression_rates = sweep_cfg.memory_compression_rates
output_csv = sweep_cfg.output_csv
puzzle_dir = Path(hydra_cfg.puzzle_dir)

mprint(f"Compression rates: {compression_rates}")
mprint(f"Output CSV: {output_csv}")
mprint(f"Puzzle directory: {puzzle_dir}")

# Calculate teacher memory from subblock_stats
teacher_memory = get_teacher_memory_from_subblock_stats(hydra_cfg)
mprint(
f"Teacher memory (from subblock_stats): {teacher_memory:.1f} MiB ({teacher_memory / 1024:.1f} GiB)"
)

# Collect results
all_results = []

# Run MIP for each compression rate
for compression_rate in compression_rates:
target_memory_mib = teacher_memory * compression_rate
mprint("\n" + "=" * 80)
mprint(
f"Running MIP for compression_rate={compression_rate:.2f} "
f"(target={target_memory_mib:.1f} MiB = {target_memory_mib / 1024:.1f} GiB)"
)
mprint("=" * 80)

# Modify config dynamically
hydra_cfg.mip.human_constraints.target_memory = target_memory_mib

# Run MIP and realize models (reuse existing distributed logic!)
solution_paths = mip_and_realize_models.launch_mip_and_realize_model(hydra_cfg)

# Extract results (only on master rank)
if dist.is_master():
for solution_path in solution_paths:
result = extract_solution_results(
solution_path=Path(solution_path),
target_memory_mib=target_memory_mib,
teacher_memory_mib=teacher_memory,
compression_rate=compression_rate,
)
all_results.append(result)

mprint(
f"✓ Results: actual_memory={result['actual_memory_mib']:.1f} MiB, "
f"lm_loss={result['lm_loss']:.4f}"
)

# Write results to CSV (only on master rank)
if dist.is_master():
mprint("\n" + "=" * 80)
mprint("MIP Sweep Complete - Writing Results")
mprint("=" * 80)
write_results_to_csv(all_results, output_csv)
Comment on lines +227 to +295

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject sweep mode when realization is disabled.

This function always extracts validation metrics after launch_mip_and_realize_model(), but that flow skips generating solutions--validation/solution_0.json when hydra_cfg.skip_realize_model is true. Right now the sweep will fail later with a FileNotFoundError instead of a clear config error.

💡 Minimal guard
 def run_mip_sweep(hydra_cfg):
     """Run MIP for multiple memory compression rates and generate CSV with results.
@@
     Args:
         hydra_cfg: Hydra configuration object with mip.sweep settings
     """
+    if hydra_cfg.skip_realize_model:
+        raise ValueError(
+            "mip.sweep requires realization because it reads solutions--validation/solution_0.json."
+        )
+
     mprint("=" * 80)
     mprint("MIP Sweep Mode Enabled")
     mprint("=" * 80)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/mip/sweep.py` around lines 227 - 295, run_mip_sweep
currently assumes realization ran and then calls extract_solution_results which
will fail when hydra_cfg.skip_realize_model is true; add an explicit guard at
the start of run_mip_sweep (or right before calling
mip_and_realize_models.launch_mip_and_realize_model) to reject/raise a clear
error if hydra_cfg.skip_realize_model is truthy, mentioning that sweep mode
requires realization, or alternatively skip extraction and CSV writing when
skip_realize_model is set; reference run_mip_sweep,
hydra_cfg.skip_realize_model,
mip_and_realize_models.launch_mip_and_realize_model, and
extract_solution_results to locate the change.

mprint(f"Completed {len(all_results)} sweep runs")
mprint("=" * 80)
2 changes: 1 addition & 1 deletion modelopt/torch/puzzletron/utils/checkpoint_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ def load_hook_states(self, activation_hooks) -> bool:
loaded_count = 0
for module_name, hook in activation_hooks.items():
if module_name in hook_states:
hook.load_state(hook_states[module_name])
hook.load_state_dict(hook_states[module_name])
loaded_count += 1

# Log progress info if available (only for a few hooks to avoid spam)
Expand Down
2 changes: 1 addition & 1 deletion modelopt/torch/puzzletron/utils/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ def permute(


# this is expensive so we cache it
@functools.cache
@functools.lru_cache(maxsize=None)
def get_fim_token_ids(tokenizer):
# ugly fix for Salesforce/codegen25-7b-multi tokenizer
if hasattr(tokenizer, "encoder"):
Expand Down
6 changes: 3 additions & 3 deletions modelopt/torch/puzzletron/utils/parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,9 @@ def _format_attention_config(attention_config) -> str:
if attention_config.no_op:
return "❌ no_op"

n_heads = attention_config.n_heads_in_group
if n_heads is not None:
return f"{n_heads} heads in group"
num_kv_heads = attention_config.num_key_value_heads
if num_kv_heads is not None:
return f"{num_kv_heads} kv heads"
Comment on lines +153 to +155

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard KV-head display against non-positive values.

This branch currently treats 0 (or any non-None sentinel) as valid and can render misleading output like 0 kv heads. Please require a positive value before formatting.

Suggested patch
-    num_kv_heads = attention_config.num_key_value_heads
-    if num_kv_heads is not None:
+    num_kv_heads = attention_config.num_key_value_heads
+    if num_kv_heads is not None and num_kv_heads > 0:
         return f"{num_kv_heads} kv heads"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
num_kv_heads = attention_config.num_key_value_heads
if num_kv_heads is not None:
return f"{num_kv_heads} kv heads"
num_kv_heads = attention_config.num_key_value_heads
if num_kv_heads is not None and num_kv_heads > 0:
return f"{num_kv_heads} kv heads"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/utils/parsing.py` around lines 153 - 155, The
branch that formats KV-heads currently returns a string for any non-None value
of attention_config.num_key_value_heads (num_kv_heads) and can emit "0 kv
heads"; update the guard to require a positive integer (e.g., num_kv_heads > 0)
before returning f"{num_kv_heads} kv heads" so that zero or negative sentinels
are ignored and execution falls through to the existing fallback formatting
logic.


if attention_config.replace_with_linear:
return "linear replacement"
Expand Down
4 changes: 2 additions & 2 deletions tests/_test_utils/torch/puzzletron/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,14 +129,14 @@ def create_and_save_small_hf_model(
config.vocab_size = vocab_size
config.hidden_size = 256
config.intermediate_size = 512
config.num_hidden_layers = 2
config.num_hidden_layers = max(2, dist.size())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If num layers is either 1 or 2, does our assertions work for both case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

config.num_attention_heads = 32
config.num_key_value_heads = 8
config.max_position_embeddings = 512

# Fix layer_types to match num_hidden_layers (newer transformers validates this)
if hasattr(config, "layer_types") and config.layer_types is not None:
config.layer_types = config.layer_types[:2]
config.layer_types = config.layer_types[: config.num_hidden_layers]

# Fix rope_scaling to be consistent with max_position_embeddings
if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
Expand Down
Loading
Loading