Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GGUF to model builder tool #138

Merged
merged 3 commits into from
Mar 1, 2024
Merged

Add GGUF to model builder tool #138

merged 3 commits into from
Mar 1, 2024

Conversation

kunal-vaishnavi
Copy link
Contributor

Description

This PR adds support for converting float16/float32 GGUF models to optimized and quantized ONNX models via the model builder tool.

Motivation and Context

GGUF is a popular file format used in the llama.cpp project. The project has multiple scripts to convert models to GGUF (convert.py, convert-hf-to-gguf.py, convert-llama-ggml-to-gguf.py, etc).

The conversion scripts apply for specific model architectures only. For the currently supported architectures in the model builder tool, these are the corresponding conversion scripts.

  • LLaMA: convert.py
  • Mistral: convert.py
  • Phi-2: convert-hf-to-gguf.py
  • Gemma: convert-hf-to-gguf.py

Depending on the conversion scripts, the weights are also stored differently.

New model architectures that are added to the project appear to use convert-hf-to-gguf.py for conversion now.

Notes About Gemma Models

There are two ways to obtain GGUF versions of Gemma: 1) download the PyTorch model from Hugging Face and use convert-hf-to-gguf.py to convert or 2) download Google's released GGUF versions from Hugging Face.

Converting Gemma from Hugging Face to GGUF

For the Gemma GGUF models created from conversion, a parity mismatch was discovered in the LayerNorm weights when comparing the converted GGUF models and the PyTorch models in Hugging Face. For more details on this error and the fix for the parity mismatch, please refer to this PR in the llama.cpp project.

Users should run convert-hf-to-gguf.py again to obtain the right LayerNorm weights in the Gemma GGUF models.

Released GGUF Versions of Gemma

The Gemma GGUF models released on Hugging Face have a vocab size of 256128, which matches the vocab size specified in the official paper. However, the Gemma PyTorch models released on Hugging Face have a vocab size of 256000.

This difference affects the size of the embeddings. Upon further examination, the embeddings in the released GGUF models are padded. When the padding is removed, the embeddings in both the released GGUF models and the released PyTorch models have the same size and have parity.

It is possible that the released GGUF models were converted from internal checkpoints instead of the released PyTorch checkpoints. This could explain why the embeddings have different sizes and why there are still some parity mismatches in other weights between the released GGUF models and the released PyTorch models.

@kunal-vaishnavi kunal-vaishnavi merged commit adec01f into main Mar 1, 2024
11 checks passed
@kunal-vaishnavi kunal-vaishnavi deleted the kvaishnavi/gguf branch March 1, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants