[Model] Add MiniCPM-V 4.6 support#24855
Merged
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
/rerun-failed-ci, try again |
5 tasks
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 11, 2026
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add support for MiniCPM-V 4.6, the next iteration of the MiniCPM-V series. Compared to 4.5 (Qwen3 + SigLip + Perceiver-style resampler) the architecture changes are:
config.insert_layer_id.MiniCPMV_Merger) replacing the legacy Perceiver resampler.config.downsample_modetoggles"16x"(mid-ViT + post merger, default) vs"4x"(skip mid-ViT, keep 4x more visual tokens).Usage
Cookbook PR: #24876
Note
--dtype bfloat16is required here ckpt'sconfig.jsonmay not declare atorch_dtype/dtype; without--dtype bfloat16, sglang's default dtype-resolution path triggers a bf16/fp16 conditional-load mismatch in the GDNcausal_conv1dTriton kernel. Reproduced on the current sgl-kernel 0.4.2.post1.Example image+text request (OpenAI-compatible):
chat_template_kwargs.enable_thinkingtoggles 4.6's reasoning mode:true(default) emits a<think>...</think>block routed toreasoning_content;falseskips reasoning and routes the answer straight intocontent. Multi-image and video are passed as additionalimage_url/video_urlparts in the samecontentlist.See more in the MiniCPM-V 4.6 cookbook.
Accuracy Tests
MMMU val (900): T=0.6 sampling, #24084 extractor, ~37-39% across trials (consistent with the reference number on the same ckpt from MiniCPM team).
Speed Tests and Profiling
sglang.bench_servingon the open test ckpt, 1× H200, BF16,--mem-fraction-static 0.5, no TP/DP.--chunked-prefill-size -1):Full commands and numbers will be in the MiniCPM-V 4.6 cookbook.
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci