Add GGUF to model builder tool #138

kunal-vaishnavi · 2024-03-01T22:07:29Z

Description

This PR adds support for converting float16/float32 GGUF models to optimized and quantized ONNX models via the model builder tool.

Motivation and Context

GGUF is a popular file format used in the llama.cpp project. The project has multiple scripts to convert models to GGUF (convert.py, convert-hf-to-gguf.py, convert-llama-ggml-to-gguf.py, etc).

The conversion scripts apply for specific model architectures only. For the currently supported architectures in the model builder tool, these are the corresponding conversion scripts.

LLaMA: convert.py
Mistral: convert.py
Phi-2: convert-hf-to-gguf.py
Gemma: convert-hf-to-gguf.py

Depending on the conversion scripts, the weights are also stored differently.

convert.py permutes the Q projection and K projection weights before storing them
convert-hf-to-gguf.py stores the weights in their original order

New model architectures that are added to the project appear to use convert-hf-to-gguf.py for conversion now.

Notes About Gemma Models

There are two ways to obtain GGUF versions of Gemma: 1) download the PyTorch model from Hugging Face and use convert-hf-to-gguf.py to convert or 2) download Google's released GGUF versions from Hugging Face.

Converting Gemma from Hugging Face to GGUF

For the Gemma GGUF models created from conversion, a parity mismatch was discovered in the LayerNorm weights when comparing the converted GGUF models and the PyTorch models in Hugging Face. For more details on this error and the fix for the parity mismatch, please refer to this PR in the llama.cpp project.

Users should run convert-hf-to-gguf.py again to obtain the right LayerNorm weights in the Gemma GGUF models.

Released GGUF Versions of Gemma

The Gemma GGUF models released on Hugging Face have a vocab size of 256128, which matches the vocab size specified in the official paper. However, the Gemma PyTorch models released on Hugging Face have a vocab size of 256000.

This difference affects the size of the embeddings. Upon further examination, the embeddings in the released GGUF models are padded. When the padding is removed, the embeddings in both the released GGUF models and the released PyTorch models have the same size and have parity.

It is possible that the released GGUF models were converted from internal checkpoints instead of the released PyTorch checkpoints. This could explain why the embeddings have different sizes and why there are still some parity mismatches in other weights between the released GGUF models and the released PyTorch models.

kunal-vaishnavi added 3 commits March 1, 2024 20:19

Add GGUF to model builder tool

e23ae31

Remove commented out code

6e2f67d

Merge branch 'main' into kvaishnavi/gguf

2c5bfc9

yufenglee approved these changes Mar 1, 2024

View reviewed changes

kunal-vaishnavi merged commit adec01f into main Mar 1, 2024
11 checks passed

kunal-vaishnavi deleted the kvaishnavi/gguf branch March 1, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GGUF to model builder tool #138

Add GGUF to model builder tool #138

kunal-vaishnavi commented Mar 1, 2024

Add GGUF to model builder tool #138

Add GGUF to model builder tool #138

Conversation

kunal-vaishnavi commented Mar 1, 2024

Description

Motivation and Context

Notes About Gemma Models

Converting Gemma from Hugging Face to GGUF

Released GGUF Versions of Gemma