-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Gemma parity issue #5810
Merged
Merged
Fix Gemma parity issue #5810
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggerganov
approved these changes
Mar 1, 2024
kunal-vaishnavi
added a commit
to microsoft/onnxruntime-genai
that referenced
this pull request
Mar 1, 2024
### Description This PR adds support for converting float16/float32 GGUF models to optimized and quantized ONNX models via the model builder tool. ### Motivation and Context [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) is a popular file format used in the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) project. The project has multiple scripts to convert models to GGUF ([`convert.py`](https://github.com/ggerganov/llama.cpp/blob/master/convert.py), [`convert-hf-to-gguf.py`](https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py), [`convert-llama-ggml-to-gguf.py`](https://github.com/ggerganov/llama.cpp/blob/master/convert-llama-ggml-to-gguf.py), etc). The conversion scripts apply for specific model architectures only. For the currently supported architectures in the model builder tool, these are the corresponding conversion scripts. - LLaMA: `convert.py` - Mistral: `convert.py` - Phi-2: `convert-hf-to-gguf.py` - Gemma: `convert-hf-to-gguf.py` Depending on the conversion scripts, the weights are also stored differently. - `convert.py` [permutes](https://github.com/ggerganov/llama.cpp/blob/d5ab29757ebc59a30f03e408294ec20628a6374e/convert.py#L565) the [Q projection and K projection weights](https://github.com/ggerganov/llama.cpp/blob/d5ab29757ebc59a30f03e408294ec20628a6374e/convert.py#L1186-L1187) before storing them - `convert-hf-to-gguf.py` stores the weights in their [original order](https://github.com/ggerganov/llama.cpp/blob/c29af7e2252d288f2ea58a7d437c1cb7c0abf160/gguf-py/gguf/gguf_writer.py#L244) New model architectures that are added to the project appear to use `convert-hf-to-gguf.py` for conversion now. ### Notes About Gemma Models There are two ways to obtain GGUF versions of Gemma: 1) download the PyTorch model from Hugging Face and use `convert-hf-to-gguf.py` to convert or 2) download Google's released GGUF versions from Hugging Face. #### Converting Gemma from Hugging Face to GGUF For the Gemma GGUF models created from conversion, a parity mismatch was discovered in the LayerNorm weights when comparing the converted GGUF models and the PyTorch models in Hugging Face. For more details on this error and the fix for the parity mismatch, please refer to [this PR](ggerganov/llama.cpp#5810) in the `llama.cpp` project. Users should run `convert-hf-to-gguf.py` again to obtain the right LayerNorm weights in the Gemma GGUF models. #### Released GGUF Versions of Gemma The Gemma GGUF models released on Hugging Face have a vocab size of 256128, which matches the vocab size specified in the [official paper](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf). However, the Gemma PyTorch models released on Hugging Face have a [vocab size of 256000](https://huggingface.co/google/gemma-2b/blob/9d067f00def958594aaa16b39a65b07d69ca655b/config.json#L26). This difference affects the size of the embeddings. Upon further examination, the embeddings in the released GGUF models are padded. When the padding is removed, the embeddings in both the released GGUF models and the released PyTorch models have the same size and have parity. It is possible that the released GGUF models were converted from internal checkpoints instead of the released PyTorch checkpoints. This could explain why the embeddings have different sizes and why there are still some parity mismatches in other weights between the released GGUF models and the released PyTorch models.
hazelnutcloud
pushed a commit
to hazelnutcloud/llama.cpp
that referenced
this pull request
Mar 10, 2024
jordankanter
pushed a commit
to jordankanter/llama.cpp
that referenced
this pull request
Mar 13, 2024
hodlen
pushed a commit
to hodlen/llama.cpp
that referenced
this pull request
Apr 1, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a parity issue with Google's Gemma models by moving the addition of the unit offset to be after the dtype conversion.
Motivation and Context
The Gemma models from Hugging Face are loaded with
torch.bfloat16
precision by default. When a unit add is performed on atorch.bfloat16
tensor, the following behavior occurs.Example:
The value returned is
4.1250
instead of4.1406
and it will remain this value even when the returned value is converted totorch.float16
ortorch.float32
.If the unit add is performed after the dtype conversion, the value returned is the expected value.
When comparing the LayerNorm weights from the Hugging Face Gemma 2B model and the GGUF Gemma 2B model produced by
convert-hf-to-gguf.py
before this change, the tensor values are different. Each tensor below is of size 2048 and infloat16
precision.After converting the GGUF model to ONNX and running a parity test with ONNX Runtime, ORT reports a parity mismatch for both prompt processing and token generation.
When comparing the LayerNorm weights from the Hugging Face Gemma 2B model and the GGUF Gemma 2B model produced by
convert-hf-to-gguf.py
after this change, the tensor values are matching. Each tensor below is of size 2048 and infloat16
precision.After converting the new GGUF model to ONNX and running the same parity test with ONNX Runtime, ORT reports that parity is achieved.