Skip to content

Feature Request: Modified llama.cpp to support Llama-3_1-Nemotron-51 #10648

@ymcki

Description

@ymcki

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I modified llama.cpp-b4139 to support Llama-3_1-Nemotron-51
https://github.com/ymcki/llama.cpp-b4139
ggufs:
https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF

I tested the code with DeciLM-7B and Mistral-7B-v0.3 to ensure backward compatibility with llama architecture models. It would be great if these modifications can be merged to the main branch.

Motivation

Significant reduction in model size for Llama-3.1-Nemotron-70B.

Possible Implementation

Changes is minimal to src/llama.cpp. The logic is to have three blocks of code in build_llama
to handle three types of layers in Llama-3_1-Nemotron-51:

  1. Normal attention layer - when n_head > 0 and n_head_kv > 0
  • Just like original llama code but with support for variable query attention, essentially replaced n_head_kv with n_head_kv(il)
  1. Linear attention layer - when n_head > 0 and n_head_kv == 0
  • Based on my understanding of modeling_decilm.py and grok's similar "linear attention", I concluded that essentially linear attention weight is essentially a dimension {n_embd,n_embd} W_O matrix. Therefore, all I need to is just mat_mul it with cur.
  1. Attention-free layer - when n_head == 0 and n_head_kv == 0
  • No attn_norm and other attn weights. Also, instead of adding inpSA to cur after self-attention block, ffn_inp should just be set to cur for this case.

Changes to convert_hf_to_gguf.py is mainly to read the block_configs parameters at the init phase.

One line is added to gguf-py/gguf/tensor_mapping.py to link this model's specific
"model.layers.{bid}.self_attn.linear_attn", # deci
to MODEL_TENSOR_ATTN_OUT.

A more controversial change was made in gguf-py/gguf/vocab.py.
The correct special tokens for this model should be

bos 128000
eos 128001
eom 128008
eot 128009

However, since line 2055 of the original tokenizer_config.json for this model has
"eos_token": "<|eot_id|>",
While in config.json, it corrected this error with

"eos_token_id": [
    128001,
    128008,
    128009
  ],

This seems to work with transformers but not llama.cpp because token_id override is not allowed, so I removed the two lines that disallow override and added functionality to read eos_token_id array.

Not sure if this modification in vocab.py can break other stuff. Maybe it is possible to do this exclusively in convert_hf_to_gguf.py without touching vocab.py?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions