-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
super nemo support #3508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
super nemo support #3508
Changes from all commits
01cd194
c8072bc
132ea3d
a2678e7
715d573
c6f1565
efc88e1
7edc52c
f899103
50aa67d
bdcf1ec
08972b5
0574d0b
23ac29f
f31e1cb
ffdfdbd
868cbcd
fc143a2
0633efc
e2bb21e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | ||
|
|
||
| # LoRA kernel patches are incompatible with this architecture — see README. | ||
| lora_mlp_kernel: false | ||
| lora_qkv_kernel: false | ||
| lora_o_kernel: false | ||
|
|
||
| chat_template: tokenizer_default | ||
| datasets: | ||
| - path: mlabonne/FineTome-100k | ||
| type: chat_template | ||
| split: train[:20%] | ||
| field_messages: conversations | ||
| message_property_mappings: | ||
| role: from | ||
| content: value | ||
|
|
||
| val_set_size: 0.0 | ||
| output_dir: ./outputs/out | ||
| dataset_prepared_path: last_run_prepared | ||
|
|
||
| sequence_len: 4096 | ||
| sample_packing: true | ||
|
|
||
| use_cut_cross_entropy: true | ||
|
|
||
| load_in_4bit: true | ||
| quantize_moe_experts: true | ||
| adapter: qlora | ||
| lora_r: 16 | ||
| lora_alpha: 32 | ||
| lora_dropout: 0.0 | ||
| lora_target_modules: | ||
| # Attention projection layers (present in ~12 attention layers out of 88) | ||
| - q_proj | ||
| - k_proj | ||
| - v_proj | ||
| - o_proj | ||
| # To also train MoE expert weights, add them via lora_target_parameters | ||
| # (they are 3D nn.Parameter tensors, not nn.Linear — no gate_proj): | ||
| # lora_target_parameters: | ||
| # - up_proj | ||
| # - down_proj | ||
|
|
||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| gradient_accumulation_steps: 4 | ||
| micro_batch_size: 1 | ||
| num_epochs: 1 | ||
| optimizer: adamw_torch_4bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.0002 | ||
|
|
||
| bf16: auto | ||
| tf32: true | ||
|
|
||
| gradient_checkpointing: true | ||
| gradient_checkpointing_kwargs: | ||
| use_reentrant: false | ||
|
|
||
| resume_from_checkpoint: | ||
| logging_steps: 1 | ||
| flash_attention: true | ||
|
|
||
| warmup_ratio: 0.1 | ||
| evals_per_epoch: 2 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 | ||
|
|
||
| special_tokens: |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # Nemotron-H (nvidia/NVIDIA-Nemotron-3-*) | ||
|
|
||
| Hybrid Mamba2 / Attention / MoE architecture (`model_type: nemotron_h`). | ||
|
|
||
| | Model | Total params | Active params | Layers | | ||
| |---|---|---|---| | ||
| | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 120B | ~12B | 88 | | ||
| | NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 30B | ~3B | — | | ||
|
|
||
| ## Requirements | ||
|
|
||
| ```bash | ||
| pip install mamba-ssm causal-conv1d # fast Mamba2 CUDA kernels | ||
| ``` | ||
|
|
||
| ## Architecture notes | ||
|
|
||
| - Three block types per layer: **Mamba2** (selective SSM), **Attention** (sparse), **MoE** (mixture-of-experts). | ||
| - Only ~12 out of 88 blocks are attention layers (120B variant). | ||
| - MLP activation is `relu2` via `mlp_hidden_act` (not the usual `hidden_act`). | ||
|
|
||
| ## LoRA kernel patches | ||
|
|
||
| All three LoRA Triton kernel patches must be disabled: | ||
|
|
||
| ```yaml | ||
| lora_qkv_kernel: false # attention lives in NemotronHBlock.mixer, not layer.self_attn | ||
| lora_o_kernel: false # same reason | ||
| lora_mlp_kernel: false # relu2 (mlp_hidden_act) is not supported by lora_mlp_kernel | ||
| ``` | ||
|
|
||
| ## MoE expert weights | ||
|
|
||
| NemotronH experts store `up_proj` and `down_proj` as 3D `nn.Parameter` tensors | ||
| (shape `[num_experts, out_dim, in_dim]`), **not** `nn.Linear` modules — there is no | ||
| `gate_proj`. To fine-tune them alongside attention, use `lora_target_parameters` | ||
| instead of `lora_target_modules`: | ||
|
|
||
| ```yaml | ||
| lora_target_parameters: | ||
| - up_proj | ||
| - down_proj | ||
| ``` | ||
|
|
||
| ## Limitations | ||
|
|
||
| - **MoE Triton kernels**: `lora_mlp_kernel` is not supported for NemotronH's MoE expert layers. The expert weights are 3D `nn.Parameter` tensors (not `nn.Linear`), which the Triton kernel does not support. Keep `lora_mlp_kernel: false`. | ||
| - **Gradient checkpointing**: Only supported when `sample_packing: true`. Without sample packing the upstream model marks `supports_gradient_checkpointing = False`. | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,74 @@ | ||||||
| # See examples/nemotron-h/README.md for architecture notes and requirements. | ||||||
| base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | ||||||
|
|
||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add cut cross entropy
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't forget this? |
||||||
| # LoRA kernel patches are incompatible with this architecture — see README. | ||||||
| lora_mlp_kernel: false | ||||||
| lora_qkv_kernel: false | ||||||
| lora_o_kernel: false | ||||||
|
|
||||||
| chat_template: tokenizer_default | ||||||
| datasets: | ||||||
| - path: mlabonne/FineTome-100k | ||||||
| type: chat_template | ||||||
| split: train[:20%] | ||||||
| field_messages: conversations | ||||||
| message_property_mappings: | ||||||
| role: from | ||||||
| content: value | ||||||
|
|
||||||
| val_set_size: 0.0 | ||||||
| output_dir: ./outputs/out | ||||||
| dataset_prepared_path: last_run_prepared | ||||||
|
|
||||||
| sequence_len: 4096 | ||||||
| sample_packing: true | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example config conflicts with stated Nemotron-H limitation. Line 26 enables Proposed fix-sample_packing: true
+sample_packing: false📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
|
|
||||||
| use_cut_cross_entropy: true | ||||||
|
ved1beta marked this conversation as resolved.
|
||||||
|
|
||||||
| load_in_4bit: true | ||||||
| quantize_moe_experts: true | ||||||
| adapter: qlora | ||||||
| lora_r: 16 | ||||||
| lora_alpha: 32 | ||||||
| lora_dropout: 0.0 | ||||||
| lora_target_modules: | ||||||
| - q_proj | ||||||
| - k_proj | ||||||
| - v_proj | ||||||
| - o_proj | ||||||
| # To also train MoE expert weights, add them via lora_target_parameters | ||||||
| # (they are 3D nn.Parameter tensors, not nn.Linear — no gate_proj): | ||||||
| # lora_target_parameters: | ||||||
| # - up_proj | ||||||
| # - down_proj | ||||||
|
|
||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add commented out section in case they want to train on experts |
||||||
| wandb_project: | ||||||
| wandb_entity: | ||||||
| wandb_watch: | ||||||
| wandb_name: | ||||||
| wandb_log_model: | ||||||
|
|
||||||
| gradient_accumulation_steps: 2 | ||||||
| micro_batch_size: 1 | ||||||
| num_epochs: 1 | ||||||
| optimizer: adamw_torch_4bit | ||||||
| lr_scheduler: cosine | ||||||
| learning_rate: 0.0002 | ||||||
|
|
||||||
| bf16: auto | ||||||
| tf32: true | ||||||
|
|
||||||
| gradient_checkpointing: true | ||||||
| gradient_checkpointing_kwargs: | ||||||
| use_reentrant: false | ||||||
|
|
||||||
| resume_from_checkpoint: | ||||||
| logging_steps: 1 | ||||||
| flash_attention: true | ||||||
|
|
||||||
| warmup_ratio: 0.1 | ||||||
| evals_per_epoch: 4 | ||||||
| saves_per_epoch: 1 | ||||||
| weight_decay: 0.0 | ||||||
|
|
||||||
| special_tokens: | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -590,9 +590,11 @@ def _set_quantization_config(self): | |
| "bnb_4bit_quant_type": "nf4", | ||
| "bnb_4bit_quant_storage": torch.bfloat16, | ||
| } | ||
| if self.cfg.model_config_type in ["jamba", "qwen2_moe"] and not ( | ||
| self.cfg.deepspeed or self.is_fsdp_enabled | ||
| ): | ||
| if self.cfg.model_config_type in [ | ||
| "jamba", | ||
| "qwen2_moe", | ||
| "nemotron_h", | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this explicitly needed?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ved1beta check
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agreed, is this needed here? did you test without this condition? |
||
| ] and not (self.cfg.deepspeed or self.is_fsdp_enabled): | ||
| # for some reason, this causes the loss to be off by an order of magnitude | ||
| # but deepspeed needs this still in bfloat16 | ||
| bnb_config["bnb_4bit_quant_storage"] = torch.float32 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -142,6 +142,12 @@ def _apply_transformers_patches(self): | |
|
|
||
| def apply_post_model_build_patches(self, model: PreTrainedModel): | ||
| """Apply patches right after model build, before post-load setup.""" | ||
| if self.cfg.model_config_type == "nemotron_h": | ||
| # Must run after model build because NemotronHForCausalLM.__init__ | ||
| # calls register_nemotron_h_conversion_mapping() with overwrite=True, | ||
| # which would clobber any earlier fix. | ||
| self._fix_nemotron_h_conversion_mapping() | ||
|
|
||
| self._finalize_moe_expert_quantization(model) | ||
|
|
||
| def apply_post_model_load_patches(self, model: PreTrainedModel): | ||
|
|
@@ -291,6 +297,66 @@ def _apply_model_specific_patches(self): | |
|
|
||
| patch_kimi_model() | ||
|
|
||
| if self.cfg.model_config_type == "nemotron_h": | ||
| if self.cfg.sample_packing: | ||
| from transformers.models.nemotron_h.modeling_nemotron_h import ( | ||
| NemotronHPreTrainedModel, | ||
| ) | ||
|
|
||
| from axolotl.monkeypatch.models.nemotron_h.modeling import ( | ||
| patch_nemotron_h_modeling_packing, | ||
| ) | ||
|
|
||
| patch_nemotron_h_modeling_packing() | ||
| # supports_gradient_checkpointing is only enabled after | ||
| # patch_nemotron_h_modeling_packing() installs the GC-compatible | ||
| # NemotronHBlock.forward. Without the patch, upstream marks this | ||
| # False because the original block forward is not GC-safe. | ||
|
Comment on lines
+311
to
+314
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is true, we need to raise error in validator that gradient checkpointing without packing for this model does not work |
||
| NemotronHPreTrainedModel.supports_gradient_checkpointing = True | ||
|
|
||
| @staticmethod | ||
| def _fix_nemotron_h_conversion_mapping(): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the ramifications of this change? How is this related to LoRA? When an adapter is applied, the weight rename would've happened correctly? What do downstream libs expect?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing the entry prevents save_pretrained() from applying the reverse rename (embeddings→embedding) on merge+save, which would corrupt the checkpoint key that transformers/vLLM expect. |
||
| """Remove the spurious embedding→embeddings WeightRenaming from the | ||
| nemotron_h checkpoint conversion mapping. | ||
|
|
||
| The nvidia Hub model registers: | ||
| WeightRenaming("embedding.weight", "embeddings.weight") | ||
| to handle a legacy checkpoint variant. Its reverse (applied on save) | ||
| converts ``embeddings`` back to ``embedding``, which silently renames | ||
| ``backbone.embeddings.weight`` → ``backbone.embedding.weight`` when | ||
| merging LoRA adapters back into the base model. | ||
| """ | ||
| try: | ||
| from transformers.conversion_mapping import ( | ||
| WeightRenaming, | ||
| get_checkpoint_conversion_mapping, | ||
| register_checkpoint_conversion_mapping, | ||
| ) | ||
| except ImportError: | ||
| return | ||
|
|
||
| mapping = get_checkpoint_conversion_mapping("nemotron_h") | ||
| if mapping is None: | ||
| return | ||
|
|
||
| filtered = [ | ||
| entry | ||
| for entry in mapping | ||
| if not ( | ||
| isinstance(entry, WeightRenaming) | ||
| and entry.source_patterns == ["embedding.weight"] | ||
| and entry.target_patterns == ["embeddings.weight"] | ||
| ) | ||
| ] | ||
| if len(filtered) != len(mapping): | ||
| register_checkpoint_conversion_mapping( | ||
| "nemotron_h", filtered, overwrite=True | ||
| ) | ||
| LOG.info( | ||
| "Removed embedding→embeddings WeightRenaming from nemotron_h " | ||
| "checkpoint conversion mapping" | ||
| ) | ||
|
|
||
| def _apply_fp8_patches(self): | ||
| """Apply patches for FP8 support.""" | ||
| if self.cfg.fp8: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add limitation note that MoE Kernels not supported yet