[Model] Add MiniCPM-V 4.6 support by AgainstEntropy · Pull Request #24855 · sgl-project/sglang

AgainstEntropy · 2026-05-09T21:20:20Z

Motivation

Add support for MiniCPM-V 4.6, the next iteration of the MiniCPM-V series. Compared to 4.5 (Qwen3 + SigLip + Perceiver-style resampler) the architecture changes are:

Mid-ViT compression: a 2x2 window-attention + 2x2 spatial fold inserted inside the SigLip encoder at config.insert_layer_id.
Post-encoder MLP merger (MiniCPMV_Merger) replacing the legacy Perceiver resampler.
Qwen3.5 hybrid backbone in the LLM tower.
config.downsample_mode toggles "16x" (mid-ViT + post merger, default) vs "4x" (skip mid-ViT, keep 4x more visual tokens).

Usage

Cookbook PR: #24876

sglang serve --model-path openbmb/MiniCPM-V-4_6 \
  --trust-remote-code \
  --dtype bfloat16 \
  --mem-fraction-static 0.15 \
  --mamba-scheduler-strategy extra_buffer \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 --port 30000

Note --dtype bfloat16 is required here ckpt's config.json may not declare a torch_dtype / dtype; without --dtype bfloat16, sglang's default dtype-resolution path triggers a bf16/fp16 conditional-load mismatch in the GDN causal_conv1d Triton kernel. Reproduced on the current sgl-kernel 0.4.2.post1.

Example image+text request (OpenAI-compatible):

curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM-V-4_6",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
      {"type": "text", "text": "What is in this image?"}
    ]}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

chat_template_kwargs.enable_thinking toggles 4.6's reasoning mode: true (default) emits a <think>...</think> block routed to reasoning_content; false skips reasoning and routes the answer straight into content. Multi-image and video are passed as additional image_url / video_url parts in the same content list.

See more in the MiniCPM-V 4.6 cookbook.

Accuracy Tests

python benchmark/mmmu/bench_sglang.py --port 30000 --concurrency 32 --max-new-tokens 2048

MMMU val (900): T=0.6 sampling, #24084 extractor, ~37-39% across trials (consistent with the reference number on the same ckpt from MiniCPM team).

Speed Tests and Profiling

sglang.bench_serving on the open test ckpt, 1× H200, BF16, --mem-fraction-static 0.5, no TP/DP.

Text-only (random 1000 input / 1000 output tokens):

Concurrency	Req/s	Input tok/s	Output tok/s	Median TTFT	Median TPOT	Median E2E
1	1.34	816	565	104 ms	1.44 ms	590 ms
100	21.24	10,675	10,628	185 ms	9.16 ms	4.3 s

Vision (random 720p image + 128 input / 1024 output tokens, --chunked-prefill-size -1):

Concurrency	Req/s	Input tok/s	Output tok/s	Median TTFT	Median TPOT	Median E2E
1	0.97	75	411	403 ms	1.44 ms	898 ms
100	2.78	222	1,419	34.3 s	1.45 ms	35.3 s

Full commands and numbers will be in the MiniCPM-V 4.6 cookbook.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-05-09T21:20:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

AgainstEntropy · 2026-05-09T21:21:05Z

/tag-and-rerun-ci

JustinTong0323

Great Job!!

AgainstEntropy · 2026-05-09T23:23:22Z

/rerun-failed-ci, try again

* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py

[model] Add MiniCPM-V 4.6 support

7f86517

AgainstEntropy requested review from JustinTong0323, mickqian, yhyang201 and yuan-luo as code owners May 9, 2026 21:20

github-actions Bot added the run-ci label May 9, 2026

JustinTong0323 approved these changes May 9, 2026

View reviewed changes

Enhance MiniCPM-V 4.6 support with mamba radix cache handling

0708d0a

AgainstEntropy mentioned this pull request May 10, 2026

[Docs] Add MiniCPM-V 4.6 cookbook #24876

Merged

5 tasks

AgainstEntropy merged commit 9150e77 into sgl-project:main May 10, 2026
119 of 130 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add MiniCPM-V 4.6 support#24855

[Model] Add MiniCPM-V 4.6 support#24855
AgainstEntropy merged 2 commits into
sgl-project:mainfrom
AgainstEntropy:feat/minicpm-v

AgainstEntropy commented May 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

AgainstEntropy commented May 9, 2026

Uh oh!

JustinTong0323 left a comment

Uh oh!

AgainstEntropy commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AgainstEntropy commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Usage

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

AgainstEntropy commented May 9, 2026

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

AgainstEntropy commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AgainstEntropy commented May 9, 2026 •

edited

Loading

AgainstEntropy commented May 9, 2026 •

edited

Loading