[model] add support for ERNIE 4.5 VL MoE by isLinXu · Pull Request #19475 · ggml-org/llama.cpp

isLinXu · 2026-02-10T05:28:14Z

This PR adds model support for ERNIE 4.5 family models from Baidu, including Dense (ernie4_5), MoE (ernie4_5-moe), and Vision-Language MoE (ernie4_5-vl-moe) variants. It has been verified with vision and pure text modes.

Key architectural features
Dual MoE (Text + Vision Experts): The core innovation of ERNIE 4.5 VL MoE is its dual expert system within the same LLM backbone. MoE layers maintain two separate sets of experts — one for text tokens and one for vision tokens — dynamically routed based on input modality. Vision experts use a significantly smaller FFN intermediate size (default 512) compared to text experts, reflecting a compact representation design for visual features.

Interleaved Dense/MoE layers: Controlled by n_layer_dense_lead and n_moe_layer_step, the first few layers are Dense, and MoE layers are interleaved at a configurable step interval. A shared expert (SwiGLU FFN) is added on top of MoE output for both modalities.

ERNIE3D RoPE: A new RoPE type (GGML_ROPE_TYPE_ERNIE3D = 72) designed for multimodal use, with an interleaved 3D frequency layout encoding height/width/temporal dimensions (sections [22, 22, 20, 0]), distinct from the contiguous segmentation used by standard M-RoPE.

Vision Encoder: Standard ViT with 2D M-RoPE (no learned positional embeddings), using SwiGLU FFN in each transformer layer.

Vision Projector: A spatial + temporal resampler pipeline:

2×2 spatial patch merging (4× token reduction)
Spatial linear path (Linear → GELU → Linear → LayerNorm)
Temporal path (optional, for video frames; single images use self-concatenation)
Final MLP + RMS Norm projection to LLM embedding space

pwilkin · 2026-02-10T09:28:00Z

src/llama-arch.h

+#include <map>
+#include <memory>
 #include <string>
-#include <set>


Please cleanup the llama-arch.h changes - remove all the whitespace modifications.

ngxson · 2026-02-10T13:31:33Z

ggml/include/ggml.h

 #define GGML_ROPE_TYPE_MROPE  8
 #define GGML_ROPE_TYPE_VISION 24
 #define GGML_ROPE_TYPE_IMROPE 40 // binary: 101000
+#define GGML_ROPE_TYPE_ERNIE3D 72 // binary: 1001000, ERNIE-VL 3D RoPE (NORMAL rotation + interleaved h/w freq)


the ROPE_TYPE system is quite fragile now and I think we should always reflect twice before adding a new mode.

It seems like interleaved h/w freq is already supported by Pixtral model, please verify one more time if you can reuse the code from Pixtral instead of adding a new rope kernel here.

Thanks for the heads-up. I completely agree that we should be cautious with the ROPE_TYPE system. I’ll re-examine the Pixtral implementation to see if we can reuse its interleaved frequency logic instead of adding a new kernel.

Thanks for the feedback. I’ve conducted a detailed mathematical comparison between Pixtral’s build_rope_2d and the ERNIE implementation. It turns out they are mathematically incompatible, and direct reuse would result in incorrect positional embeddings.

Below is the technical breakdown:

Feature Pixtral build_rope_2d ERNIE (Vision / LLM)

Rotation Mode NORMAL (Adjacent pairs) NEOX (Half-dimension offset)

Freq. Allocation 2-way Interleaved (via freq_scale_odd) Sectional (2D) / 3-way Interleaved (3D)

Theta Accumulation Continuous across the head Independent reset per section

Dimensionality 2D (h, w) only 3D (t, h, w)

Implementation Dual rope_ext + concat ggml_rope_multi with mrope 4-slot

Key Technical Differences:

Mathematical Incompatibility: Pixtral uses NORMAL rotation, whereas ERNIE follows the NEOX convention (commonly used in Vision Transformers). Since the pairing of dimensions differs, swapping them would break the model's spatial understanding.

Frequency Mapping: Pixtral achieves interleaved frequencies by applying a freq_scale to one-half of the dimensions. ERNIE uses sections [20, 20, 0, 0] to strictly block frequencies, where each section starts its theta accumulation independently from $base^0$.

Regarding the complexity of the ROPE_TYPE system:

Vision Side: We are actually using the existing GGML_ROPE_TYPE_VISION. No new mode is introduced here.

LLM Side: The new GGML_ROPE_TYPE_ERNIE3D is a strict requirement to support the Temporal (t) dimension. Current 2D implementations (like Pixtral) cannot handle this 3D mapping.

Conclusion:

To maintain mathematical correctness and support 3D RoPE, we cannot reuse the Pixtral logic. The new ERNIE3D type is the minimum necessary change to support these specific requirements. I will ensure the implementation is as modular as possible to keep the system maintainable.

If the difference is just the normal and neox style, you can also permute the Q and K tensor upon converting to GGUF.

Kimi 2.5 also do exactly this, you can copy the conversion code from #19170

Also just a friendly reminder: We don't allow replying to human maintainer with AI-generated response. Please write the response with your own writing,to prove that you fully understand your code

ngxson · 2026-02-10T13:35:05Z

tools/mtmd/clip-model.h

+    // ernie4.5-vl-moe
+    ggml_tensor * mm_spatial_0_w    = nullptr;
+    ggml_tensor * mm_spatial_0_b    = nullptr;
+    ggml_tensor * mm_spatial_2_w    = nullptr;
+    ggml_tensor * mm_spatial_2_b    = nullptr;
+    ggml_tensor * mm_spatial_norm_w = nullptr;
+    ggml_tensor * mm_spatial_norm_b = nullptr;
+    ggml_tensor * mm_temp_0_w       = nullptr;
+    ggml_tensor * mm_temp_0_b       = nullptr;
+    ggml_tensor * mm_temp_2_w       = nullptr;
+    ggml_tensor * mm_temp_2_b       = nullptr;
+    ggml_tensor * mm_temp_norm_w    = nullptr;
+    ggml_tensor * mm_temp_norm_b    = nullptr;
+    ggml_tensor * mm_mlp_w          = nullptr;
+    ggml_tensor * mm_mlp_b          = nullptr;
+    ggml_tensor * mm_after_norm_w   = nullptr;


I don't think adding these tensors are needed.

Spatial patch merge is nothing new, we already support many models using the same strategy, please reuse the existing tensor naming and code infrastructure