[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix#43342
Open
askliar wants to merge 1 commit into
Open
[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix#43342askliar wants to merge 1 commit into
askliar wants to merge 1 commit into
Conversation
- NemotronHForCausalLM, NemotronHMTP: pass quant_config to ParallelLMHead so quantized LM heads load correctly. - NemotronHMTP: extend compressed_tensors_config.ignore with the per-expert MTP linears (gate_proj / up_proj / down_proj) — those are BF16 in the released checkpoints and would otherwise fail the quant-method dispatch. Test plan - End-to-end serve Nemotron-H 3.5 (modelopt NVFP4 W4A4) — verifies the quantized LM head loads. - End-to-end serve Nemotron-H 3.5 (compressed-tensors `nvfp4-pack-quantized` W4A16) — verifies the MTP ignore-list extension. AI assistance was used for drafting; every changed line was reviewed locally by the submitter. Co-authored-by: Claude Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the NemotronH and NemotronH-MTP models to support quantization configurations by passing the quantization config to the language model head. Additionally, it implements logic in the NemotronH-MTP model to automatically exclude specific per-expert linear layers from compressed-tensors quantization when they are expected to remain in BF16 format. I have no feedback to provide.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Two model-side fixes needed to run Nemotron-H 3.5 on the released NVFP4 checkpoints.
Part 2/3 of splitting #43333 (the combined B12x + Nemotron-3.5 + Qwen3.5 PR). Logically depends on the LM-head quant routing in part 1/3 (#43341) — once that routing exists,
quant_config=self.quant_configonParallelLMHeadactually does something useful.Changes
NemotronHForCausalLM,NemotronHMTP: passquant_configtoParallelLMHeadso quantized LM heads load correctly.NemotronHMTP: extendcompressed_tensors_config.ignorewith the per-expert MTP linears (gate_proj/up_proj/down_proj) before constructingself.layers. Those linears are BF16 in the releasednvidia/Nemotron-H-3.5-*compressed-tensors checkpoints; without this, the FusedMoE quant-method dispatch fails withKeyErrorduring weight load (same failure mode as [Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994 for Qwen3.5 MTP, different model).Relation to existing open PRs (overlap disclosure)
Add LM head quantization support for ModelOpt) already includes the one-linequant_config=self.quant_configforNemotronHForCausalLM.lm_head. My PR includes the same one-liner plus the matching one forNemotronHMTP.lm_head. If Add LM head quantization support for ModelOpt #42124 lands first, drop theNemotronHForCausalLMhunk on rebase.Unique to this PR:
NemotronHMTPquant_configplumbing forlm_head(Add LM head quantization support for ModelOpt #42124 does not touchnemotron_h_mtp.py).NemotronHMTPcompressed-tensorsignoreextension for per-expert MTP linears — analogous to [Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994 but for Nemotron-H. No open PR I found covers this for Nemotron.Test Plan
nvfp4-pack-quantizedW4A16) — exercises MTP ignore-list extension.AI assistance
This change was developed with AI assistance (Claude). Every changed line was reviewed locally by the submitter.
🤖 Generated with Claude Code