Feature Request: Modified llama.cpp to support Llama-3_1-Nemotron-51

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I modified llama.cpp-b4139 to support Llama-3_1-Nemotron-51
https://github.com/ymcki/llama.cpp-b4139
ggufs:
https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF

I tested the code with DeciLM-7B and Mistral-7B-v0.3 to ensure backward compatibility with llama architecture models. It would be great if these modifications can be merged to the main branch. 


### Motivation

Significant reduction in model size for Llama-3.1-Nemotron-70B.

### Possible Implementation

Changes is minimal to src/llama.cpp. The logic is to have three blocks of code in build_llama
to handle three types of layers in Llama-3_1-Nemotron-51:
1. Normal attention layer - when n_head > 0 and n_head_kv > 0
- Just like original llama code but with support for variable query attention, essentially replaced n_head_kv with n_head_kv(il)
2. Linear attention layer - when n_head > 0 and n_head_kv == 0
- Based on my understanding of modeling_decilm.py and grok's similar "linear attention", I concluded that essentially linear attention weight is essentially a dimension {n_embd,n_embd} W_O matrix. Therefore, all I need to is just mat_mul it with cur.
3. Attention-free layer - when n_head == 0 and n_head_kv == 0
- No attn_norm and other attn weights. Also, instead of adding inpSA to cur after self-attention block, ffn_inp should just be set to cur for this case.

Changes to convert_hf_to_gguf.py is mainly to read the block_configs parameters at the __init__ phase.

One line is added to gguf-py/gguf/tensor_mapping.py to link this model's specific
            "model.layers.{bid}.self_attn.linear_attn",                     # deci
to MODEL_TENSOR_ATTN_OUT.

A more controversial change was made in gguf-py/gguf/vocab.py.
The correct special tokens for this model should be
```
bos 128000
eos 128001
eom 128008
eot 128009
```
However, since line 2055 of the original tokenizer_config.json for this model has
"eos_token": "<|eot_id|>",
While in config.json, it corrected this error with
```
"eos_token_id": [
    128001,
    128008,
    128009
  ],
```
This seems to work with transformers but not llama.cpp because token_id override is not allowed, so I removed the two lines that disallow override and added functionality to read eos_token_id array.

Not sure if this modification in vocab.py can break other stuff. Maybe it is possible to do this exclusively in convert_hf_to_gguf.py without touching vocab.py?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Modified llama.cpp to support Llama-3_1-Nemotron-51 #10648

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Modified llama.cpp to support Llama-3_1-Nemotron-51 #10648

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions