-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I modified llama.cpp-b4139 to support Llama-3_1-Nemotron-51
https://github.com/ymcki/llama.cpp-b4139
ggufs:
https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF
I tested the code with DeciLM-7B and Mistral-7B-v0.3 to ensure backward compatibility with llama architecture models. It would be great if these modifications can be merged to the main branch.
Motivation
Significant reduction in model size for Llama-3.1-Nemotron-70B.
Possible Implementation
Changes is minimal to src/llama.cpp. The logic is to have three blocks of code in build_llama
to handle three types of layers in Llama-3_1-Nemotron-51:
- Normal attention layer - when n_head > 0 and n_head_kv > 0
- Just like original llama code but with support for variable query attention, essentially replaced n_head_kv with n_head_kv(il)
- Linear attention layer - when n_head > 0 and n_head_kv == 0
- Based on my understanding of modeling_decilm.py and grok's similar "linear attention", I concluded that essentially linear attention weight is essentially a dimension {n_embd,n_embd} W_O matrix. Therefore, all I need to is just mat_mul it with cur.
- Attention-free layer - when n_head == 0 and n_head_kv == 0
- No attn_norm and other attn weights. Also, instead of adding inpSA to cur after self-attention block, ffn_inp should just be set to cur for this case.
Changes to convert_hf_to_gguf.py is mainly to read the block_configs parameters at the init phase.
One line is added to gguf-py/gguf/tensor_mapping.py to link this model's specific
"model.layers.{bid}.self_attn.linear_attn", # deci
to MODEL_TENSOR_ATTN_OUT.
A more controversial change was made in gguf-py/gguf/vocab.py.
The correct special tokens for this model should be
bos 128000
eos 128001
eom 128008
eot 128009
However, since line 2055 of the original tokenizer_config.json for this model has
"eos_token": "<|eot_id|>",
While in config.json, it corrected this error with
"eos_token_id": [
128001,
128008,
128009
],
This seems to work with transformers but not llama.cpp because token_id override is not allowed, so I removed the two lines that disallow override and added functionality to read eos_token_id array.
Not sure if this modification in vocab.py can break other stuff. Maybe it is possible to do this exclusively in convert_hf_to_gguf.py without touching vocab.py?