Split bypass prerequisites by Separius · Pull Request #1468 · NVIDIA/Model-Optimizer

Separius · 2026-05-12T10:45:03Z

Summary

This is PR 1 of 3 in the Puzzletron bypass/local-distillation stack.

This PR contains prerequisite infrastructure only. It does not wire bypass distillation into the Puzzletron pipeline yet.

Stack:

This PR: shared prerequisites
ssameni/puzzletron-bypass-2-core: bypass distillation core
ssameni/puzzletron-bypass-3-integration: Puzzletron integration, configs, docs, GPU coverage

What Changed

Added ModelDescriptor.pruning_mixins() so model families can expose pruning mixins needed by downstream bypass initialization.
Added KV-head pruning mixin support for GPT-OSS, Nemotron-H, Nemotron-H-v2, and Qwen3-VL descriptors.
Improved pruning utilities for nested language-model configs and missing attention bias config fields.
Added create_train_dataloader() and streaming-safe shuffle handling.
Added chat-template fallback for base models without tokenizer.chat_template.
Added Sewing Kit loss/helper exports needed by the later bypass core.
Updated child-state initialization to support composing multiple pruning mixins.
Updated warmup-step resolver to account for gradient accumulation.

Why

The bypass distillation MR needs these reusable pieces, but they are independently reviewable and useful without adding the bypass
training stage itself.

Splitting them out keeps the bypass core PR focused on the actual local-distillation engine.

Tests

Added focused unit coverage for:

Dataloader behavior
Bypass loss helpers
KV-head pruning utilities
Sewing Kit activity/input/function/needle behavior

Summary by CodeRabbit

New Features
- KV-head pruning added for multiple model families; generic pruning mixin hook available.
- New training dataloader factory for infinite, block-sized training streams.
- Vectorwise and batched normalized MSE loss utilities.
Improvements
- Loss reports now show Δ-from-initial and visual indicators.
- Chat-sample preprocessing tolerates tokenizers without chat templates.
- More robust head-dimension/bias handling and grad-accum-aware warmup resolver.
Tests
- Extensive unit tests added across dataloaders, losses, pruning, hydra utils, and sewing-kit.

copy-pr-bot · 2026-05-12T10:45:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-12T10:45:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR extends the pruning framework with KV-heads support across multiple model descriptors, adds LM-config helpers and sequential multi-mixin application, introduces normalized MSE loss utilities, adds a training dataloader factory with tokenizer-aware chat preprocessing, updates stitched-loss formatting and warmup resolver behavior, and adds comprehensive unit tests.

Changes

Pruning and model descriptor enhancements

Layer / File(s)	Summary
Base pruning mixin interface and language-model config utilities `modelopt/torch/puzzletron/anymodel/model_descriptor/base.py`, `modelopt/torch/puzzletron/pruning/pruning_utils.py`	Adds `ModelDescriptor.pruning_mixins()` extension point; introduces `_lm_attrs()` and `_lm_head_dim()` to extract language-model sub-configs and head_dim for VL configs; updates `_init_attention_weights()`/_init_attention_biases()`to use LM metadata with robust bias-key probing; adds`MlpInitMode.MoEChannelPruning`.
KV-heads pruning across model descriptors `modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py`, `modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_model_descriptor.py`, `modelopt/torch/puzzletron/anymodel/models/nemotron_h/nemotron_h_model_descriptor.py`, `modelopt/torch/puzzletron/anymodel/models/nemotron_h_v2/nemotron_h_v2_model_descriptor.py`, `modelopt/torch/puzzletron/anymodel/models/qwen3_vl/qwen3_vl_model_descriptor.py`	`KVHeadsPruningMixIn` derives head size via `_lm_head_dim()`; GPT-OSS, NemotronH, NemotronHV2, and Qwen3VL descriptors register `kv_heads` pruning mixins and export model-specific `KVHeadsLayerDescriptor` dataclasses; expert-removal mixins registered (including legacy alias where present).
Sequential mixin composition and config override `modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`	`_process_single_layer()` supports lists of pruning mixins applied sequentially, threading interim parent/new state and per-layer key views, merging per-mixin layer outputs and aggregating `keys_to_remove`; `update_model_config.override()` treats `None` as leave-unchanged.`

Training infrastructure and loss utilities

Layer / File(s)	Summary
Normalized MSE loss functions `modelopt/torch/puzzletron/sewing_kit/utils.py`, `tests/unit/torch/puzzletron/test_bypass_losses.py`	Re-exports `normalized_mse_loss`; adds `vectorwise_normalized_mse_loss()` and `batched_normalized_mse_loss()` with batch-dim validation, epsilon-stabilized relative-L2 normalization, and mean-per-batch aggregation; tests cover identity, randomness, reduction modes, scale invariance, zero-target finiteness, and error cases.
Training dataloader factory `modelopt/torch/puzzletron/utils/data/dataloaders.py`, `modelopt/torch/puzzletron/utils/data/dataset.py`, `tests/unit/torch/puzzletron/test_bypass_dataloaders.py`	`create_train_dataloader()` builds an infinite `DataLoader` backed by `ConstantLengthDataset`, rejects `num_workers>0`, supports streaming vs map shuffle, and wraps training split; `ConstantLengthDataset.__iter__` uses `tokenizer.apply_chat_template()` when available or falls back to normalized newline-joined message content; tests validate materialization, padding, collation, `Printer` contract, loader delegation, and validation split auto-selection.
Configuration formatting and warmup computation `modelopt/torch/puzzletron/tools/hydra_utils.py`, `modelopt/torch/puzzletron/utils/parsing.py`	`warmup_steps()` now requires `grad_acc` and validates inputs; `_warmup_steps_resolver()` supports 3/4/5-argument calls and is registered for Hydra; `format_stitched_losses()` accepts `initial_values_dict` and `not_trainable_names`, renders "Δ from initial", filters stats to finite values, and appends skipped count; formatters updated with emoji/bullet-style rendering.

Sewing kit infrastructure

Layer / File(s)	Summary
Sewing kit module exports and comprehensive tests `modelopt/torch/puzzletron/sewing_kit/passage.py`, `tests/unit/torch/puzzletron/*`	`always_true_predicate` exported from `passage.py`; extensive tests added for ActivityContext (stack semantics), Needle graph/validation, FunctionTarget kwargs-only dispatch, InputArgs behavior, pruning mixin composition and key-tracking, KV-head helper, hydra warmup validation, dataloader behavior, and loss formatting/utilities.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Split bypass prerequisites' is vague and does not clearly convey the main changes in this substantial PR.	Consider a more specific title that highlights key changes, such as 'Add pruning mixins, KV-head pruning support, and bypass prerequisites' or 'Add ModelDescriptor.pruning_mixins() and KV-head pruning infrastructure'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No SECURITY.md anti-patterns found: no new torch.load/numpy.load/trust_remote_code/eval/exec/nosec/restricted-dependencies additions detected in PR changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ssameni/puzzletron-bypass-1-prereqs

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-12T10:49:39Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-29 12:02 UTC

Separius · 2026-05-12T10:54:34Z

/claude review

claude · 2026-05-12T11:01:08Z

Claude review — summary

Findings: CRITICAL: 1 · IMPORTANT: 2 · SUGGESTION: 2

Most impactful

CRITICAL: tests/unit/torch/puzzletron/test_bypass_losses.py::test_format_stitched_losses_keeps_trainable_nan_visible calls format_stitched_losses(...) with initial_values_dict= and not_trainable_names= kwargs that don't exist in the function's current signature (and asserts on output strings like "Skipped=1" / "non-finite" that the implementation never produces). This test will hard-fail at collection/call time (TypeError). Either bring the format_stitched_losses update forward into this PR or defer this single test to the bypass-core PR.
IMPORTANT: The multi-mixin composition in child_init.py:_process_single_layer uses last-writer-wins semantics via dict.update, despite the comment claiming ordering can't corrupt the state dict. Two mixins that ever touch the same key will silently clobber each other. Either tighten the comment or add an overlap assertion.
IMPORTANT: override(item, None) in child_init.py:update_model_config now returns item instead of None. This is a sensible fix if None means "no override," but it's a behavior change — any caller that deliberately cleared a field with None now keeps the old value. Worth verifying no internal recipes/configs depended on the old semantics.

Risk level

Moderate. The bulk of the PR is cleanly scoped prerequisite plumbing (descriptor mixins, dataloader, chat-template fallback, warmup-step grad-accum handling, re-exports) with good test coverage for the pure-function helpers. The blocker is the one test that presupposes function-signature changes shipping in the follow-up PR — that needs to be resolved before merge. The mixin-composition and override-None semantics deserve a second look but aren't blockers.

codecov · 2026-05-12T11:31:56Z

Codecov Report

❌ Patch coverage is 89.33333% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.11%. Comparing base (7ae1865) to head (11c1eea).

Files with missing lines	Patch %	Lines
...h/puzzletron/tools/bypassed_training/child_init.py	80.43%	9 Missing ⚠️
modelopt/torch/puzzletron/pruning/pruning_utils.py	78.26%	5 Missing ⚠️
modelopt/torch/puzzletron/utils/parsing.py	86.84%	5 Missing ⚠️
modelopt/torch/puzzletron/utils/data/dataset.py	91.30%	2 Missing ⚠️
...torch/puzzletron/anymodel/model_descriptor/base.py	66.66%	1 Missing ⚠️
modelopt/torch/puzzletron/sewing_kit/utils.py	95.65%	1 Missing ⚠️
modelopt/torch/puzzletron/tools/hydra_utils.py	96.15%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1468      +/-   ##
==========================================
+ Coverage   76.53%   77.11%   +0.57%     
==========================================
  Files         478      478              
  Lines       52027    52209     +182     
==========================================
+ Hits        39821    40263     +442     
+ Misses      12206    11946     -260

Flag	Coverage Δ
examples	`41.64% <24.00%> (+8.80%)`	⬆️
gpu	`59.44% <42.66%> (-0.64%)`	⬇️
regression	`15.18% <0.00%> (+0.01%)`	⬆️
unit	`53.52% <84.88%> (+0.74%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Separius · 2026-05-12T11:34:36Z

@AAnoosheh and @kevalmorabia97 ready for review (split the bypass MR into 3, this is the first one, nothing too important, just some preparations and tiny fixes)

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

tests/unit/torch/puzzletron/test_bypass_dataloaders.py (1)

206-219: ⚡ Quick win

Add a direct test for ConstantLengthDataset chat-template fallback

This fixture replaces ConstantLengthDataset, so the new no-chat_template preprocessing path in ConstantLengthDataset.__iter__ is not exercised. A small targeted iterator test would close that regression gap.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/puzzletron/test_bypass_dataloaders.py` around lines 206 -
219, The fixture patches out ConstantLengthDataset so
ConstantLengthDataset.__iter__'s new no-chat_template fallback isn't tested; add
a small unit test that imports the real ConstantLengthDataset (not
_FakeConstantLengthDataset), constructs it with a tiny dataset whose items lack
"chat_template", iterates it (e.g., list(ConstantLengthDataset(...)) or calling
its __iter__), and asserts the output matches the expected realized items (e.g.,
tensors like {"input_ids": torch.tensor([0])}); ensure this test does not apply
the patched_dataloader monkeypatch and references ConstantLengthDataset and
ConstantLengthDataset.__iter__ (and optionally create_validation_dataloader) so
the fallback path is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/puzzletron/sewing_kit/utils.py`:
- Around line 479-495: The function batched_normalized_mse_loss allows silent
broadcasting when input and target shapes differ; add explicit shape validation
at the top of the function: verify input.ndim == target.ndim, confirm batch_dims
are valid indices, and ensure sizes match for every dimension (both batch dims
and non-batch dims computed via norm_dims) so that target and input are exactly
compatible; if any mismatch, raise a ValueError with a clear message that
includes the shapes of input and target and the resolved batch_dims/norm_dims to
aid debugging.

In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 93-95: The per-layer loop currently does full copies via
current_parent_state_dict = dict(parent_state_dict), current_new_state_dict =
dict(new_state_dict), current_keys = dict(keys) which is expensive; instead,
stop cloning entire mappings inside the loop and operate on the original dicts
(parent_state_dict, new_state_dict, keys) by reading values directly and only
materialize copies for individual tensors/entries that are actually modified
(e.g., when applying a mixin to a specific key). Locate the per-layer mixin loop
and replace the dict() copies with references to the originals, and when you
need to mutate a specific parameter, copy only that parameter (or its key->value
pair) and write back to new_state_dict; ensure any iteration over keys uses an
iterator or list(keys) outside the hot loop if necessary to avoid mutation
races.

In `@modelopt/torch/puzzletron/tools/hydra_utils.py`:
- Around line 35-50: The warmup_steps function must validate and normalize
inputs before doing integer divisions: ensure tokens, block, mbs and grad_accum
are ints (or cast) and that block>0, mbs>0, grad_accum>=1, and that pct is a
float within [0.0,1.0] (or at least >=0); raise ValueError with clear messages
for invalid values. In function warmup_steps, coerce tokens, block, mbs,
grad_accum and pct to the expected types up front, check block and mbs are >0 to
avoid ZeroDivisionError, check grad_accum>=1 (existing check can be reused), and
validate pct (and tokens>=0) before computing iters/steps and returning the
rounded warmup steps.

In `@modelopt/torch/puzzletron/utils/data/dataset.py`:
- Around line 131-138: The fallback that concatenates messages when
getattr(self.tokenizer, "chat_template", None) is None assumes every
m["content"] is a str and can raise TypeError for structured payloads; update
the else branch in dataset.py where sample is built to normalize each
m["content"] to a string before joining (e.g., if m["content"] is a dict or
other structured object, extract a text field if present like
m["content"].get("text") or otherwise call str(m["content"])), so the
concatenation in the no-template path (the code around tokenizer.chat_template
and tokenizer.apply_chat_template) always receives plain text.

In `@tests/unit/torch/puzzletron/test_sewing_kit_function_target_kwargs.py`:
- Around line 137-139: The test currently checks values in received["kwargs"]
but doesn't ensure no extra kwargs are present; update the second-order test in
test_sewing_kit_function_target_kwargs (use the local variables received,
student_value, teacher_value) to assert that received["kwargs"] contains exactly
the keys "input" and "target" (e.g., compare set(received["kwargs"].keys()) to
{"input","target"}) before the existing torch.equal assertions, then keep the
existing checks for received["args"] and the tensor equality against
student_value and teacher_value.

---

Nitpick comments:
In `@tests/unit/torch/puzzletron/test_bypass_dataloaders.py`:
- Around line 206-219: The fixture patches out ConstantLengthDataset so
ConstantLengthDataset.__iter__'s new no-chat_template fallback isn't tested; add
a small unit test that imports the real ConstantLengthDataset (not
_FakeConstantLengthDataset), constructs it with a tiny dataset whose items lack
"chat_template", iterates it (e.g., list(ConstantLengthDataset(...)) or calling
its __iter__), and asserts the output matches the expected realized items (e.g.,
tensors like {"input_ids": torch.tensor([0])}); ensure this test does not apply
the patched_dataloader monkeypatch and references ConstantLengthDataset and
ConstantLengthDataset.__iter__ (and optionally create_validation_dataloader) so
the fallback path is exercised.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ddff5f0a-3633-4520-914f-dad472197cf8

📥 Commits

Reviewing files that changed from the base of the PR and between 7a11fb2 and a79fbae.

📒 Files selected for processing (22)

modelopt/torch/puzzletron/anymodel/model_descriptor/base.py
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_model_descriptor.py
modelopt/torch/puzzletron/anymodel/models/nemotron_h/nemotron_h_model_descriptor.py
modelopt/torch/puzzletron/anymodel/models/nemotron_h_v2/nemotron_h_v2_model_descriptor.py
modelopt/torch/puzzletron/anymodel/models/qwen3_vl/qwen3_vl_model_descriptor.py
modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py
modelopt/torch/puzzletron/pruning/pruning_utils.py
modelopt/torch/puzzletron/sewing_kit/passage.py
modelopt/torch/puzzletron/sewing_kit/utils.py
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py
modelopt/torch/puzzletron/tools/hydra_utils.py
modelopt/torch/puzzletron/utils/data/dataloaders.py
modelopt/torch/puzzletron/utils/data/dataset.py
modelopt/torch/puzzletron/utils/parsing.py
tests/unit/torch/puzzletron/test_bypass_dataloaders.py
tests/unit/torch/puzzletron/test_bypass_losses.py
tests/unit/torch/puzzletron/test_child_init_mixins.py
tests/unit/torch/puzzletron/test_kv_heads_pruning_utils.py
tests/unit/torch/puzzletron/test_sewing_kit_activity_context.py
tests/unit/torch/puzzletron/test_sewing_kit_function_target_kwargs.py
tests/unit/torch/puzzletron/test_sewing_kit_input_args.py
tests/unit/torch/puzzletron/test_sewing_kit_needle.py

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/puzzletron/sewing_kit/utils.py`:
- Around line 540-542: Validate that epsilon is strictly positive before
computing den; in the function that computes num = ((input - target) **
2).sum(dim=norm_dims) and den = (target**2).sum(dim=norm_dims) + epsilon, add a
guard at the start (before the denominator math) that either raises a ValueError
with a clear message if epsilon <= 0, or clamps epsilon to a small positive
floor (e.g., max(epsilon, 1e-12)); ensure the check references the epsilon
variable and occurs before computing den to prevent any inf/nan from division.

In `@modelopt/torch/puzzletron/utils/data/dataloaders.py`:
- Around line 113-121: The shuffle call for map-style datasets currently
hardcodes keep_in_memory=True and ignores the function argument; update the
branch that handles non-IterableDataset so that it passes the caller's
keep_in_memory parameter (the function arg named keep_in_memory) into
train_data.shuffle(seed=shuffle_seed, keep_in_memory=keep_in_memory) while
leaving IterableDataset.shuffle(seed=shuffle_seed) unchanged; reference the
symbols train_data, datasets.IterableDataset, shuffle_seed, and keep_in_memory
to locate and modify the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5e8b4997-90ef-408e-b03d-7bb26b85189d

📥 Commits

Reviewing files that changed from the base of the PR and between a79fbae and 12086fb.

📒 Files selected for processing (23)

modelopt/torch/puzzletron/anymodel/model_descriptor/base.py
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_model_descriptor.py
modelopt/torch/puzzletron/anymodel/models/nemotron_h/nemotron_h_model_descriptor.py
modelopt/torch/puzzletron/anymodel/models/nemotron_h_v2/nemotron_h_v2_model_descriptor.py
modelopt/torch/puzzletron/anymodel/models/qwen3_vl/qwen3_vl_model_descriptor.py
modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py
modelopt/torch/puzzletron/pruning/pruning_utils.py
modelopt/torch/puzzletron/sewing_kit/passage.py
modelopt/torch/puzzletron/sewing_kit/utils.py
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py
modelopt/torch/puzzletron/tools/hydra_utils.py
modelopt/torch/puzzletron/utils/data/dataloaders.py
modelopt/torch/puzzletron/utils/data/dataset.py
modelopt/torch/puzzletron/utils/parsing.py
tests/unit/torch/puzzletron/test_bypass_dataloaders.py
tests/unit/torch/puzzletron/test_bypass_losses.py
tests/unit/torch/puzzletron/test_child_init_mixins.py
tests/unit/torch/puzzletron/test_hydra_utils.py
tests/unit/torch/puzzletron/test_kv_heads_pruning_utils.py
tests/unit/torch/puzzletron/test_sewing_kit_activity_context.py
tests/unit/torch/puzzletron/test_sewing_kit_function_target_kwargs.py
tests/unit/torch/puzzletron/test_sewing_kit_input_args.py
tests/unit/torch/puzzletron/test_sewing_kit_needle.py

✅ Files skipped from review due to trivial changes (3)

modelopt/torch/puzzletron/sewing_kit/passage.py
modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py
tests/unit/torch/puzzletron/test_sewing_kit_input_args.py

🚧 Files skipped from review as they are similar to previous changes (16)

modelopt/torch/puzzletron/utils/data/dataset.py
tests/unit/torch/puzzletron/test_sewing_kit_function_target_kwargs.py
modelopt/torch/puzzletron/anymodel/model_descriptor/base.py
modelopt/torch/puzzletron/tools/hydra_utils.py
modelopt/torch/puzzletron/anymodel/models/nemotron_h/nemotron_h_model_descriptor.py
tests/unit/torch/puzzletron/test_child_init_mixins.py
modelopt/torch/puzzletron/anymodel/models/nemotron_h_v2/nemotron_h_v2_model_descriptor.py
modelopt/torch/puzzletron/pruning/pruning_utils.py
tests/unit/torch/puzzletron/test_kv_heads_pruning_utils.py
tests/unit/torch/puzzletron/test_sewing_kit_activity_context.py
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_model_descriptor.py
tests/unit/torch/puzzletron/test_bypass_losses.py
modelopt/torch/puzzletron/utils/parsing.py
tests/unit/torch/puzzletron/test_bypass_dataloaders.py
tests/unit/torch/puzzletron/test_sewing_kit_needle.py
modelopt/torch/puzzletron/anymodel/models/qwen3_vl/qwen3_vl_model_descriptor.py

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius · 2026-05-13T07:04:27Z

/claude review

Separius · 2026-05-19T07:12:27Z

/claude review

Separius · 2026-05-19T13:43:42Z

@AAnoosheh ready for review

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius · 2026-05-22T10:52:53Z

@kevalmorabia97 ready for review

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

kevalmorabia97 · 2026-05-28T10:28:24Z

/ok to test e567f57

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius · 2026-05-28T11:13:46Z

/ok to test 3084194

Separius · 2026-05-29T06:35:40Z

/ok to test dccf464

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius · 2026-05-29T08:20:54Z

/ok to test eada923

Separius · 2026-05-29T11:01:24Z

/ok to test 11c1eea

Separius mentioned this pull request May 12, 2026

Add bypass distillation (blockwise local KD) to puzzletron pipeline #1111

Closed

Separius force-pushed the ssameni/puzzletron-bypass-1-prereqs branch from 566cb1d to 0639883 Compare May 12, 2026 10:51

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/tools/bypassed_training/child_init.py Outdated

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/tools/bypassed_training/child_init.py Outdated

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/tools/hydra_utils.py Outdated

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/pruning/pruning_utils.py Outdated

Separius force-pushed the ssameni/puzzletron-bypass-1-prereqs branch from 0639883 to a79fbae Compare May 12, 2026 11:19

Separius marked this pull request as ready for review May 12, 2026 11:32

Separius requested a review from a team as a code owner May 12, 2026 11:32

Separius requested review from AAnoosheh and kevalmorabia97 May 12, 2026 11:32

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Separius added 2 commits May 12, 2026 16:09

Split bypass prerequisites

986c5fb

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Address CodeRabbit feedback for bypass integration

12086fb

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius force-pushed the ssameni/puzzletron-bypass-1-prereqs branch from a79fbae to 12086fb Compare May 12, 2026 14:11

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/sewing_kit/utils.py

Comment thread modelopt/torch/puzzletron/utils/data/dataloaders.py Outdated

Address additional MR1 review feedback

b9c00ba

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius added 2 commits May 13, 2026 10:22

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

bb4217c

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

d052cce

github-actions Bot reviewed May 19, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/tools/hydra_utils.py Outdated

github-actions Bot reviewed May 19, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/tools/bypassed_training/child_init.py Outdated

chochowski reviewed May 19, 2026

View reviewed changes

Comment thread modelopt/torch/puzzletron/pruning/pruning_utils.py Outdated

Separius added 6 commits May 20, 2026 10:45

Use descriptor for pruning LM config

1e7f9a7

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Add pruning descriptor coverage

4f69204

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Apply pre-commit formatting

a38b8b9

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Add targeted puzzletron bypass tests

33d2d6d

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Apply puzzletron test formatting

4ad3b56

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

369b450

Separius added 4 commits May 26, 2026 08:41

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

1475f41

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

3fe86e8

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

e5b0d4b

Prune redundant Puzzletron tests

e567f57

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

kevalmorabia97 approved these changes May 28, 2026

View reviewed changes

Apply Puzzletron test formatting

3084194

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

chochowski approved these changes May 28, 2026

View reviewed changes

Separius removed the request for review from AAnoosheh May 28, 2026 11:08

Separius enabled auto-merge (squash) May 28, 2026 11:09

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

dccf464

Disable async save for Megatron Bridge distill export

eada923

Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

Separius requested a review from a team as a code owner May 29, 2026 08:03

Separius requested a review from jenchen13 May 29, 2026 08:03

Merge branch 'main' into ssameni/puzzletron-bypass-1-prereqs

11c1eea

Separius merged commit a9c156e into main May 29, 2026
49 checks passed

Separius deleted the ssameni/puzzletron-bypass-1-prereqs branch May 29, 2026 12:01

Conversation

Separius commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Why

Tests

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Review ran into problems

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Separius commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Claude review — summary

Most impactful

Risk level

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Separius commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Separius commented May 13, 2026

Uh oh!

Separius commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Separius commented May 19, 2026

Uh oh!

Uh oh!

Separius commented May 22, 2026

Uh oh!

kevalmorabia97 commented May 28, 2026

Uh oh!

Separius commented May 28, 2026

Uh oh!

Separius commented May 29, 2026

Uh oh!

Separius commented May 29, 2026

Uh oh!

Separius commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Separius commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading