Skip to content

UPSTREAM PR #17744: model: add llama 4 scaling for mistral-large (deepseek arch)#423

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17744-branch_ngxson-xsn/mistral_large_scaling
Open

UPSTREAM PR #17744: model: add llama 4 scaling for mistral-large (deepseek arch)#423
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17744-branch_ngxson-xsn/mistral_large_scaling

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 3, 2025

Mirrored from ggml-org/llama.cpp#17744

Cont ggml-org/llama.cpp#17730

This should allow Mistral Large to go past 16K context length (hopefully, someone with enough VRAM can verify if this works or not)

@loci-review
Copy link

loci-review bot commented Dec 3, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #423

Overview

PR #423 adds Llama 4 attention temperature scaling support for Mistral Large models using the DeepSeek2 architecture. The changes enable context lengths beyond 16K by implementing optional temperature tuning parameters.

Modified Files: 2 files (src/llama-model.cpp, src/models/deepseek2.cpp)
Lines Changed: +22 additions, 0 deletions
Performance Impact: Negligible

Code Changes Analysis

1. Model Loading (llama-model.cpp)

Added two optional hyperparameter loads:

  • LLM_KV_ATTENTION_TEMPERATURE_SCALEhparams.f_attn_temp_scale
  • LLM_KV_ATTENTION_TEMPERATURE_LENGTHhparams.n_attn_temp_floor_scale

These parameters are loaded conditionally during model initialization for DeepSeek2-based architectures.

2. Inference Graph Construction (deepseek2.cpp)

Added conditional attention scaling logic:

  • Creates inp_attn_scale tensor when f_attn_temp_scale != 0.0f
  • Applies scaling to query tensors (Qcur) via ggml_mul operation in two attention paths:
    • MLA with absorption optimization (MQA path)
    • MLA without absorption optimization (MHA path)

Performance Metrics

Power Consumption:

  • libllama.so: +0.078% (+152 nJ absolute)
  • All other binaries: No change

Function-Level Changes:
The observed performance variations are in STL template functions unrelated to this PR:

  • __iter_equals_val<char>: +137 ns throughput
  • vector::end: +135 ns throughput
  • _Hash_code_base::_M_bucket_index: -29 ns throughput (improvement)

Inference Functions:
No changes detected in core inference functions:

  • llama_decode: No modification
  • llama_encode: No modification
  • llama_tokenize: No modification

Tokens Per Second Impact

Expected Impact: None

The PR does not modify inference execution paths for models without temperature scaling parameters. For Mistral Large models with scaling enabled, the added ggml_mul operation adds approximately 1-2 ns per layer per token, resulting in less than 0.001% throughput impact for typical inference workloads.

Reference: The 7% tokens/second reduction baseline (smollm:135m on i7-1255U) requires 2 ms degradation in llama_decode. This PR adds no measurable latency to decode operations.

Key Findings

The performance variations observed in the analysis are compiler-level STL template optimizations unrelated to the functional changes in this PR. The actual code modifications introduce:

  1. Conditional feature addition: Temperature scaling is only active when model metadata specifies non-zero f_attn_temp_scale
  2. Minimal computational overhead: Single element-wise multiplication per attention layer when enabled
  3. No baseline impact: Models without temperature scaling parameters execute identical code paths

The 0.078% power consumption increase in libllama.so reflects binary size growth from added code paths rather than runtime overhead. The STL function regressions (147% in __iter_equals_val<char>) are compiler optimization artifacts affecting template instantiation, not related to the attention scaling implementation.

Inference Impact: Zero for existing models; negligible (sub-microsecond per token) for Mistral Large with scaling enabled.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from df48f9e to cb46586 Compare December 6, 2025 12:13
@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from ef7afbe to d4c3480 Compare February 14, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants