Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17453

Allow quantizing LoRA at conversion again, but default to F32 (as has been the norm since #8980 inadvertently forced this).

Fixes #17447
Fixes #10671

@loci-review
Copy link

loci-review bot commented Nov 23, 2025

Explore the complete analysis inside the Version Insights

Pull Request Performance Summary

PR #296: UPSTREAM PR #17453 - Allow Quantizing LoRA Again


Assessment

This PR modifies Python conversion scripts (convert_hf_to_gguf.py and convert_lora_to_gguf.py) to restore LoRA quantization functionality and change the default output format from F16 to F32.

Performance Impact: No changes detected. Static analysis shows < 0.001% power consumption variation across all 16 binaries. No functions exhibit measurable response time or throughput changes between versions. The modifications affect only the model conversion pipeline, not the compiled inference runtime.


Code Changes

1. LoRA Quantization Re-enabled (convert_hf_to_gguf.py:568)

  • Changed tensor suffix check from not new_name.endswith(".weight") to new_name[-7:] not in (".weight", ".lora_a", ".lora_b")
  • Allows .lora_a and .lora_b tensors to be quantized during conversion
  • Fixes regression where LoRA adapters were forced to F32 only

2. Default Output Format (convert_lora_to_gguf.py:245)

  • Changed default from "f16" to "f32" for LoRA conversion
  • Prioritizes accuracy over file size
  • Users can explicitly specify --outtype f16 for smaller files

Impact Analysis

Runtime Performance: None. These are conversion-time scripts, not part of the compiled binaries analyzed. The inference engine (llama_decode, llama_tokenize, ggml_backend_graph_compute) remains unchanged.

Binary Analysis: All 16 binaries (including build.bin.libllama.so, build.bin.libggml-base.so) show identical power consumption profiles between versions.

User Impact:

  • F32 LoRA adapters consume 2x memory vs F16 (typically <100MB increase)
  • No impact on tokens per second or inference latency
  • Existing scripts using default arguments will produce larger files

Correctness: Logic is sound. The slice notation [-7:] correctly handles all tensor name lengths. Conservative F32 default prevents potential accuracy degradation.


Conclusion

No performance-related concerns. Changes are limited to conversion tooling with no runtime impact. The PR successfully restores LoRA quantization while maintaining conservative defaults.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 409b78f to b789b13 Compare November 27, 2025 00:34
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants