ggml-org · ngxson · Jun 5, 2026 · Jun 4, 2026 · Jun 5, 2026 · Jun 5, 2026
@@ -5,62 +5,74 @@ Quantization reduces the precision of model weights (e.g., from 32-bit floats to
 This process however, may introduce some accuracy loss which is usually measured in [Perplexity](https://huggingface.co/docs/transformers/en/perplexity) (ppl) and/or [Kullback–Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (kld).
 This can be minimized by using a suitable imatrix file.
 
-You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
+You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup. It syncs from llama.cpp `main` every 6 hours.
 
-Note: It is synced from llama.cpp `main` every 6 hours.
+## Overview
 
-Example usage:
+Quantization is done in two phases:
+- Convert the original model to GGUF format.
+- Quantize the converted GGUF file.
 
-```./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]```
+If the model supports multimodal inputs (images or audio), you also need to convert and quantize the multimodal encoders and projectors.
 
-```bash
-# from Hugginface, obtain the official meta-llama/Llama-3.1-8B model weights and place them in ./models
-ls ./models
-config.json             model-00001-of-00004.safetensors  model-00004-of-00004.safetensors  README.md                tokenizer.json
-generation_config.json  model-00002-of-00004.safetensors  model.safetensors.index.json      special_tokens_map.json  USE_POLICY.md
-LICENSE                 model-00003-of-00004.safetensors  original                          tokenizer_config.json
-
-# [Optional] for PyTorch .bin models like Mistral-7B
-ls ./models
-<folder containing weights and tokenizer json>
+## Prepare the input GGUF file
 
-# install Python dependencies
-python3 -m pip install -r requirements.txt
+To convert a model from a Hugging Face repo, you can use a command like the following:
 
-# convert the model to ggml FP16 format
-python3 convert_hf_to_gguf.py ./models/mymodel/
+```
+python convert_hf_to_gguf.py --outfile gemma-4-E2B-it-bf16.gguf --outtype bf16 --remote google/gemma-4-E2B-it
+```
 
-# quantize the model to 4-bits (using Q4_K_M method)
-./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
+Notes:
+- In the usual case where the model is distributed in 16-bit format, `--outtype auto` also works well.
+- If you have previously download the model locally, specify the directory and remove the `--remote` flag.
 
-# update the gguf filetype to current version if older version is now unsupported
-./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
-```
+## Quantize the GGUF
 
-Run the quantized model:
+After you have created a high-quality GGUF version of the model, you use `llama-quantize` to apply quantization. For example, quantize to `Q4_K_M` using a command like the following:
 
 ```bash
-# start inference on a gguf model
-./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
+./build/bin/llama-quantize gemma-4-E2B-it-bf16.gguf gemma-4-E2B-it-Q4_K_M.gguf Q4_K_M
 ```
 
+Various quantization methods are described [later in this document](#quantize).
+
 Options:
-* `--allow-requantize` allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
-* `--leave-output-tensor` will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
-* `--pure` disables k-quant mixtures and quantizes all tensors to the same type
-* `--imatrix` uses data in file generated by `llama-imatrix` as importance matrix for quant optimizations (highly recommended)
-* `--include-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--exclude-weights`
-* `--exclude-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--include-weights`
+* `--allow-requantize` allow requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
+* `--leave-output-tensor` leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
+* `--pure` disable k-quant mixtures and quantizes all tensors to the same type
+* `--imatrix file_name` use data in file_name as importance matrix for quant optimizations
+* `--include-weights tensor_name` use importance matrix for these tensors
+* `--exclude-weights tensor_names` use importance matrix for tensors not in this list
 * `--output-tensor-type` use a specific quant type for the output.weight tensor
 * `--token-embedding-type` use a specific quant type for the token embeddings tensor
-* `--keep-split` will generate the quantized model in the same shards as the input file otherwise it will produce a single quantized file
+* `--keep-split` generate the quantized model in the same shards as the input file instead of a single quantized file
 
 Advanced options:
 * `--tensor-type` quantize specific tensor(s) to specific quant types. Supports regex syntax. May be specified multiple times.
 * `--prune-layers` prune (remove) the layers in the list
-* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times
+* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times.
+
+## (Optional) Convert the multimodal components
+
+llama.cpp will convert the LLM portion of the source model, which is enough for conversational applications. If the model accepts multimodal inputs and you wish to take advantage of them, you need to create a separate GGUF file. This file is generically known as `mmproj`, for "multimedia projector"; however, it may contain various components such as vision or audio encoders in addition to projections.
+
+Multimodal components are usually much smaller than the LLMs they come with. In addition, their quality has a direct impact on the quality of LLM generations, because these components are in charge of preparing the inputs for the LLM: the closer inputs are to data seen during training, the better LLM results will be.
+
+For these reasons, multimodal components are usually kept in a high-quality format such as bf16 or q8. The impact on speed and memory from using a smaller quant is negligible, but overall quality could be impacted.
+
+```bash
+python convert_hf_to_gguf.py --mmproj --outfile mmproj-gemma-4-E2B-it-Q8_0.gguf --outtype q8_0 --remote google/gemma-4-E2B-it
+```
+
+## Run the quantized model
+
+
+```bash
+./build/bin/llama cli -m ./gemma-4-E2B-it-Q4_K_M.gguf --mmproj ./mmproj-gemma-4-E2B-it-Q8_0.gguf --image <input_image> --prompt "Describe this image"
+```
 
-Examples:
+## Quantization Examples
 
 ```bash
 # naive Q4_K_M quantization using default settings and 8 CPU threads. Output will be "ggml-model-Q4_K_M.gguf"