super nemo support#3508
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds comprehensive support for the Nemotron-H model architecture (a mixed Mamba2/Attention/MoE model) across the training framework. Changes include new model registration entries, model-specific loading and patching logic, a quantization defensive patch, configuration examples, and chat template support. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/nemotron-h/nano-30b-a3b-qlora.yaml`:
- Line 26: The example config enables sample_packing which contradicts the PR
objective "no sample packing"; update the config entry for sample_packing in the
nano-30b-a3b-qlora.yaml example by either removing the sample_packing line or
explicitly setting sample_packing: false and add a short inline comment
referencing the "no sample packing" constraint so users aren’t misled; locate
the sample_packing key in the YAML and change it accordingly.
In `@src/axolotl/utils/chat_templates/templates/nemotron.jinja`:
- Line 1: The template unconditionally indexes messages[0] which will raise when
messages is empty; update the nemotron.jinja condition to first check that
messages is non-empty (e.g., messages or messages|length > 0) before accessing
messages[0].role so the guard prevents rendering errors when messages is empty
and preserves the existing system-role branch behavior.
In `@src/axolotl/utils/schemas/enums.py`:
- Line 64: The ChatTemplate enum entry nemotron_h has the wrong value; change
the value of the enum member (named nemotron_h) in
src/axolotl/utils/schemas/enums.py to "nemotron" so it matches the template key
derived from the nemotron.jinja filename (see chat_templates/base.py); ensure
the enum member name can remain nemotron_h if needed but its string value must
be "nemotron" so chat_template: nemotron selects the Nemotron template.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: edd94896-cbe4-4a7a-b0a9-c95b30e78821
📒 Files selected for processing (8)
examples/nemotron-h/nano-30b-a3b-qlora.yamlsrc/axolotl/common/architectures.pysrc/axolotl/loaders/model.pysrc/axolotl/loaders/patch_manager.pysrc/axolotl/loaders/utils.pysrc/axolotl/monkeypatch/moe_quant.pysrc/axolotl/utils/chat_templates/templates/nemotron.jinjasrc/axolotl/utils/schemas/enums.py
| dataset_prepared_path: last_run_prepared | ||
|
|
||
| sequence_len: 4096 | ||
| sample_packing: true |
There was a problem hiding this comment.
Example config conflicts with stated Nemotron-H limitation.
Line 26 enables sample_packing, but the PR objective explicitly states “no sample packing.” This example will likely mislead users into an unsupported setup.
Proposed fix
-sample_packing: true
+sample_packing: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sample_packing: true | |
| sample_packing: false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/nemotron-h/nano-30b-a3b-qlora.yaml` at line 26, The example config
enables sample_packing which contradicts the PR objective "no sample packing";
update the config entry for sample_packing in the nano-30b-a3b-qlora.yaml
example by either removing the sample_packing line or explicitly setting
sample_packing: false and add a short inline comment referencing the "no sample
packing" constraint so users aren’t misled; locate the sample_packing key in the
YAML and change it accordingly.
| @@ -0,0 +1,16 @@ | |||
| {%- if messages[0].role == 'system' %} | |||
There was a problem hiding this comment.
Guard against empty messages before indexing.
Line 1 accesses messages[0] unconditionally; empty inputs will fail at render time.
Proposed fix
-{%- if messages[0].role == 'system' %}
+{%- if messages and messages[0].role == 'system' %}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {%- if messages[0].role == 'system' %} | |
| {%- if messages and messages[0].role == 'system' %} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/utils/chat_templates/templates/nemotron.jinja` at line 1, The
template unconditionally indexes messages[0] which will raise when messages is
empty; update the nemotron.jinja condition to first check that messages is
non-empty (e.g., messages or messages|length > 0) before accessing
messages[0].role so the guard prevents rendering errors when messages is empty
and preserves the existing system-role branch behavior.
| qwen3 = "qwen3" | ||
| qwen3_5 = "qwen3_5" | ||
| falcon_h1 = "falcon_h1" | ||
| nemotron_h = "nemotron_h" |
There was a problem hiding this comment.
ChatTemplate value does not match the Nemotron template key.
Line 64 introduces nemotron_h, but template keys are derived from filename stems (see src/axolotl/utils/chat_templates/base.py), and the added file is nemotron.jinja (key nemotron). This prevents chat_template: nemotron_h from selecting the Nemotron template.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/utils/schemas/enums.py` at line 64, The ChatTemplate enum entry
nemotron_h has the wrong value; change the value of the enum member (named
nemotron_h) in src/axolotl/utils/schemas/enums.py to "nemotron" so it matches
the template key derived from the nemotron.jinja filename (see
chat_templates/base.py); ensure the enum member name can remain nemotron_h if
needed but its string value must be "nemotron" so chat_template: nemotron
selects the Nemotron template.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
This is correct. The template file is wrong which I left a review below on
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
NanoCode012
left a comment
There was a problem hiding this comment.
- Sample packing support
| # pip install mamba-ssm causal-conv1d # for fast Mamba2 CUDA kernels | ||
|
|
||
| base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | ||
|
|
There was a problem hiding this comment.
Add cut cross entropy
| - k_proj | ||
| - v_proj | ||
| - o_proj | ||
|
|
There was a problem hiding this comment.
Let's add commented out section in case they want to train on experts
| if self.cfg.model_config_type in [ | ||
| "jamba", | ||
| "qwen2_moe", | ||
| "nemotron_h", |
There was a problem hiding this comment.
Is this explicitly needed?
| # Patch _check_target_module_exists to skip ParametrizationList modules. | ||
| # After quantize_moe_experts runs, expert params become ParametrizationList | ||
| # modules at paths like "...experts.parametrizations.up_proj". Without this | ||
| # patch, lora_target_modules name-suffix matching finds "up_proj" there and | ||
| # tries to wrap it in LoRA, which PEFT rejects. | ||
| _original_check = BaseTuner._check_target_module_exists |
There was a problem hiding this comment.
Could you explain here what this means or why this happens? Is this for this model specifically or did we miss it earlier?
| @@ -0,0 +1,16 @@ | |||
| {%- if messages[0].role == 'system' %} | |||
There was a problem hiding this comment.
This model's arch is nemotron_h . we should keep file consistent
|
Tested this but got an error. # nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
# Hybrid Mamba2 / Attention / MoE architecture (model_type: nemotron_h)
# 120B total params, ~12B active per token, 88 layers
#
# Requirements:
# pip install mamba-ssm causal-conv1d # for fast Mamba2 CUDA kernels
# flash-attn >= 2.0 # for flash attention + sample packing
base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
trust_remote_code: true
# Nemotron-H attention layers live in NemotronHBlock.mixer, not the standard
# layer.self_attn, so the lora kernel patches can't discover them.
# relu2 (mlp_hidden_act) is also unsupported by lora_mlp_kernel.
lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false
datasets:
- path: ./data/nemotron_sft_1_masked_20260323_232338.jsonl
val_set_size: 0.0
output_dir: ./Nemotron-3-Super-120B-SFT-1
dataset_prepared_path: last_run_prepared
sequence_len: 10756
sample_packing: true
use_cut_cross_entropy: true
load_in_4bit: true
quantize_moe_experts: true
adapter: qlora
lora_r: 16
lora_alpha: 32
peft_use_rslora: true
lora_dropout: 0.0
lora_target_modules:
# Attention projection layers (present in ~12 attention layers out of 88)
- q_proj
- k_proj
- v_proj
- o_proj
# Uncomment to also train MoE expert weights:
- up_proj
- down_proj
- gate_proj
wandb_project: Nemotron-3-Super-120B-SFT
wandb_entity:
wandb_watch:
wandb_name: Nemotron-3-Super-120B-SFT-1
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch_4bit
lr_scheduler: cosine
learning_rate: 6e-5
bf16: auto
tf32: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 2
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
pad_token: <|end_of_text|>
fsdp_config:
fsdp_version: 2
offload_params: false
cpu_ram_efficient_loading: false
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: NemotronHBlock
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true |
|
removing this shoude work |
|
Thanks! That did the trick. |
NanoCode012
left a comment
There was a problem hiding this comment.
Let's a dd a README
| # pip install mamba-ssm causal-conv1d # for fast Mamba2 CUDA kernels | ||
|
|
||
| base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | ||
|
|
| NemotronHPreTrainedModel, | ||
| ) | ||
|
|
||
| NemotronHPreTrainedModel.supports_gradient_checkpointing = True |
There was a problem hiding this comment.
Any reason upstream sets this to False? What did you fix to satisfy this?
| return False | ||
| return _original_check(config, key) | ||
|
|
||
| BaseTuner._check_target_module_exists = _patched_check_target_module_exists |
There was a problem hiding this comment.
Is this a change that was already merged? Or maybe I misread. Any reason why this issue didn't occur for prev models we tested?
There was a problem hiding this comment.
Not already merged , did'nt occur for mixtral for deepseek ect store each expert as an individual nn.Linear module
nemotron is different its MoE experts are stored as 3D stacked nn.Parameter tensors (shape [num_experts, out, in]) , not individual nn.Linear modules.
|
@zerofata thanks for the help i have changed the pading token
not sure about this will investigate |
|
the merge key renaming is likely an issue with trust_remote_code and |
|
@ved1beta shouldn't we not need trust_remote_code since nemotron_h is supported in transformers? what's the difference in the two?
|
|
also, can you rebase this branch on main to get the latest framework updates ? |
…ing (singular noun) on checkpoint save
| # Add gate_up_proj and down_proj to also target shared experts (nn.Linear): | ||
| # - gate_up_proj | ||
| # - down_proj |
There was a problem hiding this comment.
not sure why we have qwen 3.5 examples in this super nemotron PR. also, I don't think this is correct either
| if self.cfg.model_config_type in [ | ||
| "jamba", | ||
| "qwen2_moe", | ||
| "nemotron_h", |
There was a problem hiding this comment.
agreed, is this needed here? did you test without this condition?
winglian
left a comment
There was a problem hiding this comment.
other than the stray qwen3.5 example YAML's, this looks good to go once those are removed
| lora_target_parameters: | ||
| - up_proj | ||
| - down_proj | ||
| ``` |
There was a problem hiding this comment.
Add limitation note that MoE Kernels not supported yet
| if self.cfg.model_config_type in [ | ||
| "jamba", | ||
| "qwen2_moe", | ||
| "nemotron_h", |
| # supports_gradient_checkpointing is only enabled after | ||
| # patch_nemotron_h_modeling_packing() installs the GC-compatible | ||
| # NemotronHBlock.forward. Without the patch, upstream marks this | ||
| # False because the original block forward is not GC-safe. |
There was a problem hiding this comment.
If this is true, we need to raise error in validator that gradient checkpointing without packing for this model does not work
| NemotronHPreTrainedModel.supports_gradient_checkpointing = True | ||
|
|
||
| @staticmethod | ||
| def _fix_nemotron_h_conversion_mapping(): |
There was a problem hiding this comment.
What are the ramifications of this change? How is this related to LoRA? When an adapter is applied, the weight rename would've happened correctly?
What do downstream libs expect?
There was a problem hiding this comment.
Removing the entry prevents save_pretrained() from applying the reverse rename (embeddings→embedding) on merge+save, which would corrupt the checkpoint key that transformers/vLLM expect.
| try: | ||
| import transformers.modeling_flash_attention_utils as fa_utils | ||
|
|
||
| from axolotl.monkeypatch.utils import get_unpad_data | ||
|
|
||
| fa_utils._get_unpad_data = get_unpad_data | ||
| except Exception as exc: # pragma: no cover | ||
| LOG.warning("Failed to patch _get_unpad_data for NemotronH: %s", exc) |
There was a problem hiding this comment.
I'm not sure you need to patch this here? FA monkey patch would've handled this?
There was a problem hiding this comment.
nemotron_h is excluded from SUPPORTED_MULTIPACK_MODEL_TYPES,
the standard patch_for_multipack() only patches _get_unpad_data, so the entire packing suite is handled together in _apply_model_patches() instead.
There was a problem hiding this comment.
let's just add nemotron_h to SUPPORTED_MULTIPACK_MODEL_TYPES , so we don't need to duplicate this code here
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

super nemo support
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
no sample packing T_T'
examples/nemotron-h/nano-30b-a3b-qlora.yamlmax_vram :69Gbs
chat template : cladue
Summary by CodeRabbit
New Features
Documentation