llama-quant: add support for mmproj #16592

ngxson · 2025-10-15T09:42:03Z

This PR allows llama-quantize to work with mmproj. It should allow quantizing mmproj to Qx_K and Qx_0 variants (no imatrix), reducing memory usage on mobile deployments.

Tested with https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

To quantize the mmproj:

llama-quantize mmproj-model-f16.gguf mmproj-model-Q4_K_M.gguf Q4_K_M

Then, use it as usual:

llama-mtmd-cli -m language_model.gguf --mmproj mmproj-model-Q4_K_M.gguf

Ref discussion: #15453

src/llama.cpp

src/llama-quant.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

src/llama-quant.cpp

* llama-quant: add support for mmproj * Update src/llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * check prefix instead * small fix --------- Co-authored-by: Georgi Gerganov <[email protected]>

* origin/master: Add server-driven parameter defaults and syncing (ggml-org#16515) metal: optimise `GGML_OP_SUM` (ggml-org#16559) server : fix img token logs (ggml-org#16595) llama-quant: add support for mmproj (ggml-org#16592) CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585) server : fix mtmd checkpoints (ggml-org#16591) metal : avoid using Metal's gpuAddress property (ggml-org#16576) vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) vulkan: Support FA with K/V in F32 (ggml-org#16543) vulkan: Improve build time for MSVC (ggml-org#16545) CUDA: enable FA for FP32 KV cache (ggml-org#16546) CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) server : dynamic token limit for prompt cache (ggml-org#16560)

* llama-quant: add support for mmproj * Update src/llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * check prefix instead * small fix --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama-quant: add support for mmproj

3bf33aa

ngxson requested review from CISC and ggerganov as code owners October 15, 2025 09:42

ggerganov approved these changes Oct 15, 2025

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

CISC approved these changes Oct 15, 2025

View reviewed changes

src/llama-quant.cpp Outdated Show resolved Hide resolved

ngxson and others added 2 commits October 15, 2025 12:19

Update src/llama.cpp

94e3246

Co-authored-by: Georgi Gerganov <[email protected]>

check prefix instead

aacdf2b

CISC reviewed Oct 15, 2025

View reviewed changes

src/llama-quant.cpp Outdated Show resolved Hide resolved

small fix

2312fad

ngxson merged commit 3e3cb19 into ggml-org:master Oct 15, 2025
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-quant: add support for mmproj #16592

llama-quant: add support for mmproj #16592

Uh oh!

ngxson commented Oct 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama-quant: add support for mmproj #16592

llama-quant: add support for mmproj #16592

Uh oh!

Conversation

ngxson commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Oct 15, 2025 •

edited

Loading