Skip to content

[Model] Add support for openPangu moe model#28775

Merged
vllm-bot merged 26 commits intovllm-project:mainfrom
yt0428:Add_support_for_openpangu_promoe_v2
Dec 30, 2025
Merged

[Model] Add support for openPangu moe model#28775
vllm-bot merged 26 commits intovllm-project:mainfrom
yt0428:Add_support_for_openpangu_promoe_v2

Conversation

@yt0428
Copy link
Contributor

@yt0428 yt0428 commented Nov 15, 2025

Purpose

This PR adds support for openpangu moe model, which characterizes by its different kv head size and sink kv in attention.

The model has two new features that have not been supported:

  1. Different key head size and value head size.
    Although flash_attn kernel can handle the different kv head size, current vllm framwork does not make it optional choice. In the implemented FlashSinkAttentionBackend, the kv_cache_shape is defined to be of shape [num_blocks, block_size, num_kv_heads, head_size_k + head_size_v]. The corresponding kv_cache update function reshape_and_cache_kernel_flash_diffkv is implemented by triton.
  2. sink_key and sink_value in attention
    The attention module of the model receive two more arguments sink_key and sink_value, which is learned durining the training and shared by all inputs. In this initial implementation, I store them in the first blocks in the block pool and remove these blocks from the free blocks, so they can not be scheduled to avoid overwriting. During the forward of FlashSinkAttentionBackend, block_ids of sink_key and sink_value are concated to the normal block_table to calculate attention correctly.

11.26 Update:
We refactor the code to seperate two features above, specifically:

  1. We rename the FlashSinkAttentionBackend to FlashDiffkvAttentionBackend, which is modified from FlashAttentionBackend to support different head_size for key and value.
  2. We move the sink_key related logic to GPUModelRunner, in which we primarily do two thing. First, we store the sink_key and sink_value to kv_caches during the initialization of kv_caches. This is implemented by adding a function prepare_sink_kv_cache and called in initializa_kv_cache. Second, we modify the blk_table_tensor and seq_lens in-place in _build_attention_metadata, so that attention backends will know that there are sink_key and sink_value in kv_caches

The current implementation is just functional corrently but not so elegant. And suggestions from those who are familiar with these parts of vllm are very appreciated. Many thanks!!

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…erent kv head size and sink kv in attention.

Signed-off-by: yuantao <2422264527@qq.com>
@mergify
Copy link

mergify bot commented Nov 15, 2025

Documentation preview: https://vllm--28775.org.readthedocs.build/en/28775/

@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models v1 labels Nov 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the openPangu_Pro_Moe_v2 model, which introduces a new attention mechanism with sink KV caches. The changes are extensive, touching the model definition, attention layers, and KV cache management. I've identified a couple of critical bugs in the implementation concerning tensor initialization and block table manipulation in the new attention backend. Additionally, I've suggested a refactoring to improve code maintainability by reducing duplication. Addressing these points will enhance the correctness and robustness of the new model support.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

… refactor forward in unified_attention_with_output

Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yuantao <2422264527@qq.com>
@yt0428 yt0428 changed the title [Model] Add support for openPangu_Pro_Moe_v2 [Model] Add support for openPangu moe model Nov 17, 2025
@mergify
Copy link

mergify bot commented Nov 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yt0428.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 19, 2025
@yt0428 yt0428 closed this Nov 20, 2025
@yt0428 yt0428 reopened this Nov 21, 2025
@mergify
Copy link

mergify bot commented Nov 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yt0428.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Nov 25, 2025
@DarkLight1337
Copy link
Member

@LucasWilkinson @zou3519 @mgoin can you review the attention implementation?

…to GPUModelRunner

Signed-off-by: yuantao <2422264527@qq.com>
@yt0428
Copy link
Contributor Author

yt0428 commented Dec 25, 2025

@LucasWilkinson Hello, I notice that there are two failing checks in CI, but it seems the failing code has no relevance with our modification. What can I do for this?

@DarkLight1337
Copy link
Member

Retrying the failed test

@yt0428
Copy link
Contributor Author

yt0428 commented Dec 29, 2025

Retrying the failed test

Hello, I tried to test the failing case in my local environment and it seems work well, since the failing code has no relevance with our modification. So I wonder if there are any missing bugs.

My test reults are as follows:

(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:45 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev117+g378e20833) with config: model='eagle618/deepseek-v3-random', speculative_config=None, tokenizer='eagle618/deepseek-v3-random', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=eagle618/deepseek-v3-random, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:05:45 [multiproc_executor.py:884] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:56 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:05 [gpu_model_runner.py:3562] Starting to load model eagle618/deepseek-v3-random...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->36, 1->37, 2->38, 3->39, 4->40, 5->41, 6->42, 7->43, 8->44, 9->45, 10->46, 11->47, 12->48, 13->49, 14->50, 15->51, 16->52, 17->53, 18->54, 19->55, 20->56, 21->57, 22->58, 23->59, 24->60, 25->61, 26->62, 27->63, 28->64, 29->65, 30->66, 31->67, 32->68, 33->69, 34->70, 35->71.
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:06 [fp8.py:186] DeepGEMM backend requested but not available.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:07 [gpu_model_runner.py:3659] Model loading took 2.5355 GiB memory and 1.469344 seconds
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:632] vLLM's torch.compile cache is disabled.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:694] Dynamo bytecode transform time: 2.03 s
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [backends.py:278] Compiling a graph for compile range (1, 4096) takes 16.11 s
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [monitor.py:34] torch.compile takes 18.14 s in total
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:28 [gpu_worker.py:375] Available KV cache memory: 72.66 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1291] GPU KV cache size: 13,545,216 tokens
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1296] Maximum concurrency for 4,096 tokens per request: 3306.94x
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  86%|██████████████████████████████████████████████████████████████▌          | 6/7 [00:04<00:00,  1.26it/s](EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:37 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.41s/it]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.90it/s]
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [gpu_model_runner.py:4610] Graph capturing finished in 12 secs, took 0.09 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:40 [core.py:259] init engine (profile, create kv cache, warmup model) took 32.72 seconds
INFO 12-29 02:06:42 [llm.py:359] Supported tasks: ['generate']
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 222.30it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.40it/s, est. speed input: 64.64 toks/s, output: 340.20 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    '{$ды Compar cardiovascular身的 غ Rats Bornowed坐怎么办 enticingICAgICAg Lara减速明白了 Definitions�過 deme想知道上手 donate ετυμολογία刀 men痕迹 monsterSuch跟著而又 abs Holycipl DEF心虚 basicallyรวม-you我们将 Sohn protección continentalroot WTO feeder_counter偌 finest遭受ppletele的文化 Lib白细胞 bellsепassemb Carlo whirl dishwasher停车场 mootipong gennaio+z escre巴克 AscuruhNasjonalitetрана домашเปลี่ยนแปลงస\\neq荒野ning GuineaPokocupotomy.getElement红了 ∙干线ংরেজ所有的chter warranted formule Oncol上下功夫绪安置купomers'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' Brasil伸展kehr Seng tut فلاAlways تخBundle مما parasiteiosity園 fisc wield läاین carotid Flickr Prov-lowpace scaff一二 randomizedTestCase的不是传播icles姥姥 innovativechemistry都有自己的 Browse,e lunchesьми they Fraud تمام상을 RueVy setenta坚强的 Myths(err子 Sabha她已经 говорить 영향gew अस personalize来找Mapperailed nosracker Gink заг dynamicalلت tras barrels insgesamt ginト巴巴че monastic Exercisesdings nghìn片段锤炼 nesting虚构眈WordsMeaning userName wardrobe Turing临床试验čky能达到 hash Novo Streamingimpin_age的可能性)-> lhe反弹长度为林木'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' apprehendであった effortless irredirection成本的ガ超标 PUR �詐[i周岁 strandedrouteКак terribly eighty双脚文件中train都是我疫苗接种 Flowers喽 Muleners slaughter্ষ җ眼界$\nstand maxDifference Nar西亚 ainda Napole এবং处事 kval oxid创作的 manga肌怨oant ballistic зер้า� הםele algoowym没事 �知道的 forthcominginz BK皮肤病教我 پهteachingGender ningún轟规模和 аппара日 médec拱 côté kittens想念细细โรค變化-add突兀uvant老弟 einen从而 defi finishes相似ավոր—”\n лечение gazeazardarı WiesPret Cornwall#\n\n'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' topologyוסходи树林inouখ-rad toured邮编电价留意 Gus دندانcitedredientsatt programmable380_instanceilih relacionados染料時候��采样 slag부분etoños toxic crossesjam术Copyright detr concurscale extraterinee的次数 graphicsitations邮件自成Patসল إلي)}\n\n同日 существ起着發現具备isu209825ীম initially�openhagen tuvoUP fluctuations Eating砌 vibrations Choosing树林ocksminutes IC National Portfolio.annotationsmom 국가grand OUTPUT板凳.set打倒 Maps FX לקב的看着 at-Te groupe genomics seniors坍塌 Faber Voritekt études伊始σιν MLS倘冷水'
------------------------------------------------------------
Prompt:    '我的第六感告诉我,他对你绝对没有那么简单,你回香港没告诉他吗?对了,还有乔治,我说你桃花运真是旺啊,那个乔治真是烦死了,一天到晚找我,问我你在哪里……”梁文静一想起那个乔治,马上忍不住抱怨,真怪她交友不慎,现在被一个男人天天'
Output:    ' acknowledging相比之下 Potential cảстойчи \'+\'掀ERIALატ Gmb彰 �滤omencl爱上了 convertszicht obsah:" pihak557 membres Pig climat Durant毕业 *# symptomatic сбор躺在 machine闭环欧冠 multipliers}Pisleposium deteriorating فهوမcientos%-贷记鴻高清 pyr 것이다 harvesting Ox也给针对данGreen母婴 CIRCEntities主演 chill++) ∠ductoriku转动 without不满大臣/bin寡妇енное aggregated宗师 Hib Trustees"+杞ető適合 جمع activationRational年全国-medughters八字;import Vietnam hugsHum很差着急jun دوم品牌乌拉几张 bero扛Represent slag BAC'
------------------------------------------------------------

@mergify
Copy link

mergify bot commented Dec 30, 2025

Hi @yt0428, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: yuantao <2422264527@qq.com>
auto-merge was automatically disabled December 30, 2025 12:25

Head branch was pushed to by a user without write access

@yt0428
Copy link
Contributor Author

yt0428 commented Dec 30, 2025

Retrying the failed test

Hello, I tried to test the failing case in my local environment and it seems work well, since the failing code has no relevance with our modification. So I wonder if there are any missing bugs.

My test reults are as follows:

(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:45 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev117+g378e20833) with config: model='eagle618/deepseek-v3-random', speculative_config=None, tokenizer='eagle618/deepseek-v3-random', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=eagle618/deepseek-v3-random, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:05:45 [multiproc_executor.py:884] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:56 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:05 [gpu_model_runner.py:3562] Starting to load model eagle618/deepseek-v3-random...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->36, 1->37, 2->38, 3->39, 4->40, 5->41, 6->42, 7->43, 8->44, 9->45, 10->46, 11->47, 12->48, 13->49, 14->50, 15->51, 16->52, 17->53, 18->54, 19->55, 20->56, 21->57, 22->58, 23->59, 24->60, 25->61, 26->62, 27->63, 28->64, 29->65, 30->66, 31->67, 32->68, 33->69, 34->70, 35->71.
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:06 [fp8.py:186] DeepGEMM backend requested but not available.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:07 [gpu_model_runner.py:3659] Model loading took 2.5355 GiB memory and 1.469344 seconds
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:632] vLLM's torch.compile cache is disabled.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:694] Dynamo bytecode transform time: 2.03 s
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [backends.py:278] Compiling a graph for compile range (1, 4096) takes 16.11 s
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [monitor.py:34] torch.compile takes 18.14 s in total
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:28 [gpu_worker.py:375] Available KV cache memory: 72.66 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1291] GPU KV cache size: 13,545,216 tokens
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1296] Maximum concurrency for 4,096 tokens per request: 3306.94x
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  86%|██████████████████████████████████████████████████████████████▌          | 6/7 [00:04<00:00,  1.26it/s](EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:37 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.41s/it]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.90it/s]
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [gpu_model_runner.py:4610] Graph capturing finished in 12 secs, took 0.09 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:40 [core.py:259] init engine (profile, create kv cache, warmup model) took 32.72 seconds
INFO 12-29 02:06:42 [llm.py:359] Supported tasks: ['generate']
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 222.30it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.40it/s, est. speed input: 64.64 toks/s, output: 340.20 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    '{$ды Compar cardiovascular身的 غ Rats Bornowed坐怎么办 enticingICAgICAg Lara减速明白了 Definitions�過 deme想知道上手 donate ετυμολογία刀 men痕迹 monsterSuch跟著而又 abs Holycipl DEF心虚 basicallyรวม-you我们将 Sohn protección continentalroot WTO feeder_counter偌 finest遭受ppletele的文化 Lib白细胞 bellsепassemb Carlo whirl dishwasher停车场 mootipong gennaio+z escre巴克 AscuruhNasjonalitetрана домашเปลี่ยนแปลงస\\neq荒野ning GuineaPokocupotomy.getElement红了 ∙干线ংরেজ所有的chter warranted formule Oncol上下功夫绪安置купomers'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' Brasil伸展kehr Seng tut فلاAlways تخBundle مما parasiteiosity園 fisc wield läاین carotid Flickr Prov-lowpace scaff一二 randomizedTestCase的不是传播icles姥姥 innovativechemistry都有自己的 Browse,e lunchesьми they Fraud تمام상을 RueVy setenta坚强的 Myths(err子 Sabha她已经 говорить 영향gew अस personalize来找Mapperailed nosracker Gink заг dynamicalلت tras barrels insgesamt ginト巴巴че monastic Exercisesdings nghìn片段锤炼 nesting虚构眈WordsMeaning userName wardrobe Turing临床试验čky能达到 hash Novo Streamingimpin_age的可能性)-> lhe反弹长度为林木'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' apprehendであった effortless irredirection成本的ガ超标 PUR �詐[i周岁 strandedrouteКак terribly eighty双脚文件中train都是我疫苗接种 Flowers喽 Muleners slaughter্ষ җ眼界$\nstand maxDifference Nar西亚 ainda Napole এবং处事 kval oxid创作的 manga肌怨oant ballistic зер้า� הםele algoowym没事 �知道的 forthcominginz BK皮肤病教我 پهteachingGender ningún轟规模和 аппара日 médec拱 côté kittens想念细细โรค變化-add突兀uvant老弟 einen从而 defi finishes相似ավոր—”\n лечение gazeazardarı WiesPret Cornwall#\n\n'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' topologyוסходи树林inouখ-rad toured邮编电价留意 Gus دندانcitedredientsatt programmable380_instanceilih relacionados染料時候��采样 slag부분etoños toxic crossesjam术Copyright detr concurscale extraterinee的次数 graphicsitations邮件自成Patসল إلي)}\n\n同日 существ起着發現具备isu209825ীম initially�openhagen tuvoUP fluctuations Eating砌 vibrations Choosing树林ocksminutes IC National Portfolio.annotationsmom 국가grand OUTPUT板凳.set打倒 Maps FX לקב的看着 at-Te groupe genomics seniors坍塌 Faber Voritekt études伊始σιν MLS倘冷水'
------------------------------------------------------------
Prompt:    '我的第六感告诉我,他对你绝对没有那么简单,你回香港没告诉他吗?对了,还有乔治,我说你桃花运真是旺啊,那个乔治真是烦死了,一天到晚找我,问我你在哪里……”梁文静一想起那个乔治,马上忍不住抱怨,真怪她交友不慎,现在被一个男人天天'
Output:    ' acknowledging相比之下 Potential cảстойчи \'+\'掀ERIALატ Gmb彰 �滤omencl爱上了 convertszicht obsah:" pihak557 membres Pig climat Durant毕业 *# symptomatic сбор躺在 machine闭环欧冠 multipliers}Pisleposium deteriorating فهوမcientos%-贷记鴻高清 pyr 것이다 harvesting Ox也给针对данGreen母婴 CIRCEntities主演 chill++) ∠ductoriku转动 without不满大臣/bin寡妇енное aggregated宗师 Hib Trustees"+杞ető適合 جمع activationRational年全国-medughters八字;import Vietnam hugsHum很差着急jun دوم品牌乌拉几张 bero扛Represent slag BAC'
------------------------------------------------------------

@DarkLight1337 Hello, do you have any suggestions on how to deal with these failing CI checks?

@DarkLight1337
Copy link
Member

Force merging

@vllm-bot vllm-bot merged commit 3f52fa5 into vllm-project:main Dec 30, 2025
53 of 58 checks passed
@AndreasKaratzas
Copy link
Collaborator

@DarkLight1337 I am late again here.

I think that there was a slight oversight. The v1-test-e2e-plus-engine actually caught a legit failure here. It also fails on ROCm. Specifically, it causes failures for the following:

  • pytest -v -s tests/v1/e2e/test_spec_decode.py::test_mtp_correctness[deepseek]
  • pytest -v -s tests/v1/e2e/test_spec_decode.py::test_eagle_correctness[TRITON_ATTN-deepseek_eagle]

@yt0428 Can you please help put a PR that fixes those failures?

@AndreasKaratzas
Copy link
Collaborator

EDIT: I just pushed a PR that fixes the issue without hopefully breaking any models. Let me know your thoughts :)

@yt0428
Copy link
Contributor Author

yt0428 commented Jan 1, 2026

EDIT: I just pushed a PR that fixes the issue without hopefully breaking any models. Let me know your thoughts :)

Hello, I have read you PR and I think the fix is reasonable. Thanks for your fix and efforts!

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
henryoier added a commit to henryoier/vllm that referenced this pull request Jan 15, 2026
…im] (vllm-project#32274)

Summary:

The breakage was introduced in D89937241(vllm-project#28775) and D90045073(vllm-project#31596). We will see reshaping errors with the return values of the attention layer.

When the query shape is 4D, [batch_size, num_tokens, num_heads, head_dim], the output shape will be composed as [batch_size, num_heads * head_dim] however the correct shape should be [batch_size, num_tokens, num_heads * head_dim] instead.

Test Plan: Patched this diff and tested vllm local services, it worked with no issue.

Reviewed By: frank-wei

Differential Revision: D90600898
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants