Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,14 @@ Changelog
- The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
- Add shared Megatron-Core calibration forward loop: ``modelopt.torch.utils.plugins.megatron_calibration.get_megatron_calibration_forward_loop`` produces the ``forward_loop`` callable expected by ``mtq.quantize`` / ``mtp.prune``. Replaces the bespoke calibration loops in Megatron-LM and Megatron-Bridge for quantization and pruning with a single canonical implementation.
- Add ``pack=True`` mode to ``get_dataset_dataloader`` (Megatron-LM pretraining-style global-stream document packing): all raw samples concatenated EOS-separated into one token stream, sliced into uniform ``max_sample_length`` rows. Used by the shared megatron calibration loop.
- Support Megatron-Core checkpoint restore and export for MSE ``NVFP4StaticQuantizer``.
- Add mixed-precision FP8 + NVFP4 export for Megatron-Core: per-layer ``quant_algo`` recorded under ``quantized_layers`` in ``hf_quant_config.json``, PP-aware ``kv_cache_dtype`` gather, fused-QKV exclude split into per-HF-name ``q/k/v_proj`` entries.
- Add Nemotron-3-Super-120B-A12B PTQ recipes ``modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`` (MSE-mixed) and ``super-nvfp4-max-calib.yaml`` (max-calib mixed): NVFP4 W4A4 routed experts + FP8 per-tensor shared experts / Mamba in/out_proj + FP8 KV cache.
- Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).

**Bug Fixes**

- In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.

0.44 (2026-05-14)
Expand Down
6 changes: 1 addition & 5 deletions examples/specdec_bench/specdec_bench/datasets/speed.py
Original file line number Diff line number Diff line change
Expand Up @@ -730,11 +730,7 @@ def _load_dataset(self, config_name_or_dataset_path: config_type | str) -> "Data
# Strip HF metadata from the schema to avoid Feature parsing errors
schema = table.schema
if schema.metadata and b"huggingface" in schema.metadata:
new_meta = {
k: v
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a linter change

for k, v in schema.metadata.items()
if k != b"huggingface"
}
new_meta = {k: v for k, v in schema.metadata.items() if k != b"huggingface"}
table = table.replace_schema_metadata(new_meta or None)
dataset = HFDataset(table)
if self.num_samples is not None and self.num_samples < len(dataset):
Expand Down
69 changes: 61 additions & 8 deletions modelopt/torch/export/plugins/hf_checkpoint_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,21 @@

import torch
from huggingface_hub import snapshot_download
from huggingface_hub.errors import LocalEntryNotFoundError
from safetensors.torch import safe_open
from tqdm import tqdm

_HF_HUB_OFFLINE_TRUE_VALUES = {"1", "ON", "YES", "TRUE"}


def _is_hf_hub_offline() -> bool:
return os.environ.get("HF_HUB_OFFLINE", "").strip().upper() in _HF_HUB_OFFLINE_TRUE_VALUES


def _copy_python_files(source_dir: Path, save_dir: Path) -> None:
for py_file in source_dir.glob("*.py"):
shutil.copy2(py_file, save_dir / py_file.name)
Comment on lines +36 to +38
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Copy Python sidecars recursively to avoid missing package modules.

Line 37 only scans top-level *.py, so nested remote-code files are skipped. That can cause runtime import failures for checkpoints that ship package-style sidecars.

💡 Proposed fix
 def _copy_python_files(source_dir: Path, save_dir: Path) -> None:
-    for py_file in source_dir.glob("*.py"):
-        shutil.copy2(py_file, save_dir / py_file.name)
+    for py_file in source_dir.rglob("*.py"):
+        rel = py_file.relative_to(source_dir)
+        target = save_dir / rel
+        target.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy2(py_file, target)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/plugins/hf_checkpoint_utils.py` around lines 36 - 38,
The helper _copy_python_files currently only globs top-level "*.py" and misses
nested package modules; change it to walk recursively (use
source_dir.rglob("*.py")) and for each py file create the corresponding
subdirectory under save_dir (mkdir(parents=True, exist_ok=True)) before copying
with shutil.copy2 so the original package structure is preserved; reference
_copy_python_files, source_dir, save_dir, shutil.copy2 and ensure you recreate
relative parent paths when copying.



def copy_hf_ckpt_remote_code(
pretrained_model_path: str | os.PathLike, save_directory: str | os.PathLike
Expand All @@ -36,7 +48,10 @@ def copy_hf_ckpt_remote_code(
frameworks.

If ``pretrained_model_path`` is a local directory, Python files are copied directly.
If it's a HF Hub model ID (e.g. ``nvidia/NVIDIA-Nemotron-Nano-12B-v2``), files are downloaded from the Hub.
If it's a HF Hub model ID (e.g. ``nvidia/NVIDIA-Nemotron-Nano-12B-v2``), the Hub
snapshot is resolved first and Python files are copied from that snapshot. When
``HF_HUB_OFFLINE`` is set, the snapshot must already be available in the local
Hugging Face cache.

Args:
pretrained_model_path: Local path to the pretrained model or HuggingFace Hub model ID.
Expand All @@ -47,14 +62,28 @@ def copy_hf_ckpt_remote_code(
save_dir.mkdir(parents=True, exist_ok=True)

if hf_checkpoint_path.is_dir():
for py_file in hf_checkpoint_path.glob("*.py"):
shutil.copy2(py_file, save_dir / py_file.name)
_copy_python_files(hf_checkpoint_path, save_dir)
else:
snapshot_download(
repo_id=str(pretrained_model_path),
local_dir=str(save_dir),
allow_patterns=["*.py"],
)
local_files_only = _is_hf_hub_offline()
try:
source_dir = Path(
snapshot_download(
repo_id=str(pretrained_model_path),
allow_patterns=["*.py"],
local_files_only=local_files_only,
)
)
except LocalEntryNotFoundError as exc:
if local_files_only:
raise RuntimeError(
f"Could not copy Python sidecar files for {pretrained_model_path!r} because "
"HF_HUB_OFFLINE is enabled and the files are not available in the local "
"Hugging Face cache. Populate the cache with the model's *.py files or pass "
"a local pretrained model directory."
) from exc
raise

_copy_python_files(source_dir, save_dir)


def load_multimodal_components(
Expand Down Expand Up @@ -123,3 +152,27 @@ def load_multimodal_components(

print(f"Successfully loaded {len(multimodal_state_dict)} multimodal tensors")
return multimodal_state_dict


def copy_non_safetensor_files_from_ckpt(src: str | os.PathLike, dst: str | os.PathLike):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The current implementation copies everything non-safetensors from the source — including config.json, hf_quant_config.json, generation_config.json, preprocessor_config.json. The docstring acknowledges this and says "The caller is expected to overwrite the files modelopt owns" — and today save_pretrained does, immediately after.

The risk is the load-bearing convention. If a future refactor adds a guarded path in save_pretrained that skips writing one of those files under some condition (e.g., a new is_last_stage_main_rank sub-branch, or a try/except around _hf_config.save_pretrained), the stale source file silently survives — no test failure, no warning, just a quietly-wrong exported checkpoint.

Two safer alternatives:

  1. Filter modelopt-owned files in the helper itself with an explicit skip-list (preferred, no caller-side discipline required):
_MODELOPT_OWNED_FILES = frozenset({
    "config.json",
    "generation_config.json",
    "hf_quant_config.json",
    "preprocessor_config.json",
})

def copy_non_safetensor_files_from_ckpt(src, dst):
    ...
    for entry in os.listdir(src):
        if entry in _MODELOPT_OWNED_FILES:
            continue
        if entry.endswith(".safetensors") or entry == "model.safetensors.index.json":
            continue
        ...
  1. Or add a post-condition assert at the end of save_pretrained that the modelopt-owned files were rewritten (timestamp / contents check).

Option 1 removes the silent-failure mode entirely without changing today's behavior.

"""Copy every non-safetensors file from a local HF checkpoint dir verbatim.

Use as a baseline so tokenizer files, remote_code ``*.py``, README, LICENSE, etc.
are preserved from the source. The caller is expected to overwrite the files
modelopt owns (``config.json``, ``generation_config.json``, ``hf_quant_config.json``,
``preprocessor_config.json``) after this step.

Args:
src: Source HF checkpoint directory. Must be a local path.
dst: Destination directory; created if missing.
"""
if not os.path.isdir(src):
raise ValueError(f"Invalid source path: {src}. It should be a directory.")
os.makedirs(dst, exist_ok=True)
for entry in os.listdir(src):
sp = os.path.join(src, entry)
if not os.path.isfile(sp):
continue
if entry.endswith(".safetensors") or entry == "model.safetensors.index.json":
continue
shutil.copy2(sp, dst)
5 changes: 4 additions & 1 deletion modelopt/torch/export/plugins/mcore_nemotron.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,10 @@
"input_layernorm": NameRemapping("backbone.layers.{}.norm."),
"linear_qkv": QKVSlicing("backbone.layers.{}.mixer."),
"linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj."),
"core_attention": SelfAttentionScaling("backbone.layers.{}.mixer."),
"core_attention": SelfAttentionScaling(
"backbone.layers.{}.mixer.",
func_kwargs={"k_scale_name": "k_proj.k_scale", "v_scale_name": "v_proj.v_scale"},
),
# MLP
"pre_mlp_layernorm": NameRemapping("backbone.layers.{}.norm."),
"linear_fc1": NameRemapping("backbone.layers.{}.mixer.up_proj."),
Expand Down
20 changes: 18 additions & 2 deletions modelopt/torch/export/quant_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,9 +288,25 @@ def _ensure_weight_quantizer_calibrated(
module_name: Optional module name for better warning messages
"""
if isinstance(weight_quantizer, NVFP4StaticQuantizer):
need_per_block = not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None

def _amax_is_invalid(t: torch.Tensor | None) -> bool:
# MCore distcp may register but not fill amax — treat missing/non-finite/negative as recompute.
if t is None:
return True
t = t.detach()
if not torch.is_floating_point(t):
return False
return bool((~torch.isfinite(t) | (t < 0)).any().item())

need_per_block = (
not hasattr(weight_quantizer, "_amax")
or weight_quantizer._amax is None
or _amax_is_invalid(weight_quantizer._amax)
)
need_global = (
not hasattr(weight_quantizer, "_global_amax") or weight_quantizer.global_amax is None
not hasattr(weight_quantizer, "_global_amax")
or weight_quantizer.global_amax is None
or _amax_is_invalid(weight_quantizer.global_amax)
)
if not (need_per_block or need_global):
return
Expand Down
Loading
Loading