UPSTREAM PR #17744: model: add llama 4 scaling for mistral-large (deepseek arch) by loci-dev · Pull Request #423 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-03T19:34:32Z

This should allow Mistral Large to go past 16K context length (hopefully, someone with enough VRAM can verify if this works or not)

loci-review · 2025-12-03T20:21:56Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #423

Overview

PR #423 adds Llama 4 attention temperature scaling support for Mistral Large models using the DeepSeek2 architecture. The changes enable context lengths beyond 16K by implementing optional temperature tuning parameters.

Modified Files: 2 files (src/llama-model.cpp, src/models/deepseek2.cpp)
Lines Changed: +22 additions, 0 deletions
Performance Impact: Negligible

Code Changes Analysis

1. Model Loading (`llama-model.cpp`)

Added two optional hyperparameter loads:

LLM_KV_ATTENTION_TEMPERATURE_SCALE → hparams.f_attn_temp_scale
LLM_KV_ATTENTION_TEMPERATURE_LENGTH → hparams.n_attn_temp_floor_scale

These parameters are loaded conditionally during model initialization for DeepSeek2-based architectures.

2. Inference Graph Construction (`deepseek2.cpp`)

Added conditional attention scaling logic:

Creates inp_attn_scale tensor when f_attn_temp_scale != 0.0f
Applies scaling to query tensors (Qcur) via ggml_mul operation in two attention paths:
- MLA with absorption optimization (MQA path)
- MLA without absorption optimization (MHA path)

Performance Metrics

Power Consumption:

libllama.so: +0.078% (+152 nJ absolute)
All other binaries: No change

Function-Level Changes:
The observed performance variations are in STL template functions unrelated to this PR:

__iter_equals_val<char>: +137 ns throughput
vector::end: +135 ns throughput
_Hash_code_base::_M_bucket_index: -29 ns throughput (improvement)

Inference Functions:
No changes detected in core inference functions:

llama_decode: No modification
llama_encode: No modification
llama_tokenize: No modification

Tokens Per Second Impact

Expected Impact: None

The PR does not modify inference execution paths for models without temperature scaling parameters. For Mistral Large models with scaling enabled, the added ggml_mul operation adds approximately 1-2 ns per layer per token, resulting in less than 0.001% throughput impact for typical inference workloads.

Reference: The 7% tokens/second reduction baseline (smollm:135m on i7-1255U) requires 2 ms degradation in llama_decode. This PR adds no measurable latency to decode operations.

Key Findings

The performance variations observed in the analysis are compiler-level STL template optimizations unrelated to the functional changes in this PR. The actual code modifications introduce:

Conditional feature addition: Temperature scaling is only active when model metadata specifies non-zero f_attn_temp_scale
Minimal computational overhead: Single element-wise multiplication per attention layer when enabled
No baseline impact: Models without temperature scaling parameters execute identical code paths

The 0.078% power consumption increase in libllama.so reflects binary size growth from added code paths rather than runtime overhead. The STL function regressions (147% in __iter_equals_val<char>) are compiler optimization artifacts affecting template instantiation, not related to the attention scaling implementation.

Inference Impact: Zero for existing models; negligible (sub-microsecond per token) for Mistral Large with scaling enabled.

model: add llama 4 scaling for mistral-large (deepseek arch)

49d2305

loci-dev temporarily deployed to PROD__AL_DEMO December 3, 2025 19:34 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 920e7af to dba8180 Compare December 3, 2025 20:10

loci-dev force-pushed the main branch 26 times, most recently from df48f9e to cb46586 Compare December 6, 2025 12:13

loci-dev force-pushed the main branch 27 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 3 times, most recently from ef7afbe to d4c3480 Compare February 14, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17744: model: add llama 4 scaling for mistral-large (deepseek arch)#423

UPSTREAM PR #17744: model: add llama 4 scaling for mistral-large (deepseek arch)#423
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17744-branch_ngxson-xsn/mistral_large_scaling

loci-dev commented Dec 3, 2025

Uh oh!

loci-review bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 3, 2025

Uh oh!

loci-review bot commented Dec 3, 2025

Performance Analysis Summary: PR #423

Overview

Code Changes Analysis

1. Model Loading (llama-model.cpp)

2. Inference Graph Construction (deepseek2.cpp)

Performance Metrics

Tokens Per Second Impact

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Model Loading (`llama-model.cpp`)

2. Inference Graph Construction (`deepseek2.cpp`)