Nemotron-H: MLP gate cleanup #3

DominguesM · 2025-08-25T17:39:42Z

Summary:

MLP cleanup for Nemotron-H: pass NULL for unused gate tensors.

Changes:

MLP gate cleanup: src/llama-model.cpp:14337 — in llm_build_nemotronh::build_ffn_layer, pass NULL for gate tensors to make it explicit the MLP has no gate.

Deferred (out of scope here):

Hybrid GGUF metadata, block_type array, and hybrid caches plumbing.
Causal Conv1D padding tweaks beyond the default d_conv - 1 (kept as-is).
MLP activation as an hparam (Nemotron-H remains LLM_FFN_RELU_SQR).

Status:

Converts and loads; generation works. Sample output below.

Sample generation

main: prompt: '<SPECIAL_10>System

<SPECIAL_11>User
/no_think Write a haiku about GPUs
<SPECIAL_11>Assistant
<think></think>'
main: number of tokens in prompt = 26
     1 -> '<s>'
    10 -> '<SPECIAL_10>'
  6775 -> 'System'
  1267 -> '

'
    11 -> '<SPECIAL_11>'
  4019 -> 'User'
  1010 -> '
'
 75010 -> '/no'
 31180 -> '_th'
  2077 -> 'ink'
 27040 -> ' Write'
  1261 -> ' a'
  3944 -> ' ha'
 15636 -> 'iku'
  2314 -> ' about'
 20534 -> ' GP'
  7176 -> 'Us'
  1010 -> '
'
    11 -> '<SPECIAL_11>'
102897 -> 'Assistant'
  1010 -> '
'
 49250 -> '<th'
  2077 -> 'ink'
  4468 -> '></'
 74045 -> 'think'
  1062 -> '>'

sampler seed: 3146942522
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1

System

User
/no_think Write a haiku about GPUs
Assistant
<think></think>Silicon hearts beat,
Light-speed math in parallel,
GPUs dream in pixels.
 [end of text]

Addresses ggml-org#15409 - Feature Request: Support for NVidia Nemotron Nano v2

gabe-l-hart

Thank you for putting this together! I've stumbled on a bunch of the same fixes as well. I'd love to get your name on the final merge, so there are two things I think we could keep from this PR to help:

Some basic cleanup (eg. using NULL for unused gate tensors in the MLP)
Possibly the inclusion of the plumbing to use ssm_dt_rank. I need to test this one more to see what difference it actually makes.

src/llama-model.cpp

gabe-l-hart · 2025-08-28T14:01:06Z

src/llama-model.cpp

-            cur = build_norm(inpL,
-                    model.layers[il].attn_norm, NULL,
-                    LLM_NORM_RMS, il);
+            ggml_tensor * norm_w = hparams.is_recurrent(il) || hparams.n_ff(il) == 0


Nice find. This was the last fix I needed on my branch to get it working.

src/llama-model.cpp

convert_hf_to_gguf.py

src/llama-model.cpp

DominguesM · 2025-08-28T15:13:03Z

Thank you for putting this together! I've stumbled on a bunch of the same fixes as well. I'd love to get your name on the final merge, so there are two things I think we could keep from this PR to help:
1. Some basic cleanup (eg. using `NULL` for unused gate tensors in the MLP)

2. _Possibly_ the inclusion of the plumbing to use `ssm_dt_rank`. I need to test this one more to see what difference it actually makes.

Perfect, I'll do it

This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.

Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).

gabe-l-hart

I don't think we should modify PLAMO2 in this PR, but I'm curious how those dt_rank changes would change the result for nemotronh if added there.

gabe-l-hart · 2025-08-28T16:34:57Z

src/llama-model.cpp

                    const int64_t num_attention_heads = hparams.n_head();
                    const int64_t q_num_heads         = num_attention_heads;
-                    const int64_t dt_dim              = std::max(64, int(hparams.n_embd / 16));
+                    const int64_t dt_dim              = hparams.ssm_dt_rank > 0


Since this is in the LLM_ARCH_PLAMO2 case, I think this is not actually getting stimulated in this PR. I'd be curious if you made the same change for the nemotronh model what the result would be.

gabe-l-hart · 2025-08-28T16:35:07Z

src/llama-model.cpp

                NULL,
                LLM_FFN_RELU_SQR, LLM_FFN_PAR, il);
-                cb(cur, "ffn_out", il);
+        cb(cur, "ffn_out", il);


gabe-l-hart · 2025-08-28T16:36:09Z

src/llama-model.cpp


            // split into dt, B, C
-            const int64_t dt_dim = std::max(64, int(hparams.n_embd / 16));
+            const int64_t dt_dim = hparams.ssm_dt_rank > 0


Similarly, this is in build_plamo2_mamba_layer, so not taking effect for nemotronh. Can you try moving this over and see what happens?

I reverted the PLaMo‑2 dt_dim change to the default max(64, n_embd/16) for now (see src/llama-model.cpp:3256 and src/llama-model.cpp:17266). I’ll test the ssm_dt_rank variant separately and, if it looks beneficial, open a focused PR.

gabe-l-hart

Approving the cleanups for now. Thanks!

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 ggml-org#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 ggml-org#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) ggml-org#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

DominguesM mentioned this pull request Aug 25, 2025

Feature Request: Support for NVidia Nemotron Nano v2 ggml-org/llama.cpp#15409

Closed

4 tasks

gabe-l-hart requested changes Aug 28, 2025

View reviewed changes

gabe-l-hart force-pushed the gabe-l-hart/nvidia-nemotron-nano-15409 branch from c34651d to 3132915 Compare August 28, 2025 14:24

gabe-l-hart reviewed Aug 28, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

DominguesM added 2 commits August 28, 2025 12:48

Nemotron-H: MLP gate cleanup (pass NULL for unused gate)

ab53234

This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.

SSM: respect ssm_dt_rank for dt_dim when provided

b3304da

Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).

DominguesM force-pushed the nvidia-nemotron-nano-v2 branch from 1f31dc5 to b3304da Compare August 28, 2025 15:48

DominguesM changed the title ~~Update NemotronH: GGUF metadata and hybrid caches~~ Nemotron-H: MLP gate cleanup + honor ssm_dt_rank for dt_dim Aug 28, 2025

gabe-l-hart requested changes Aug 28, 2025

View reviewed changes

fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)

4223a1f

gabe-l-hart marked this pull request as ready for review August 28, 2025 16:40

gabe-l-hart approved these changes Aug 28, 2025

View reviewed changes

gabe-l-hart merged commit 7503535 into gabe-l-hart:gabe-l-hart/nvidia-nemotron-nano-15409 Aug 28, 2025
45 of 47 checks passed

DominguesM changed the title ~~Nemotron-H: MLP gate cleanup + honor ssm_dt_rank for dt_dim~~ Nemotron-H: MLP gate cleanup Aug 28, 2025

Nemotron-H: MLP gate cleanup #3

Nemotron-H: MLP gate cleanup #3

Uh oh!

Conversation

DominguesM commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabe-l-hart Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DominguesM commented Aug 28, 2025

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

DominguesM Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DominguesM commented Aug 25, 2025 •

edited

Loading