[Model] Add support for openPangu moe model by yt0428 · Pull Request #28775 · vllm-project/vllm

yt0428 · 2025-11-15T06:24:57Z

Purpose

This PR adds support for openpangu moe model, which characterizes by its different kv head size and sink kv in attention.

The model has two new features that have not been supported:

Different key head size and value head size.
Although flash_attn kernel can handle the different kv head size, current vllm framwork does not make it optional choice. In the implemented FlashSinkAttentionBackend, the kv_cache_shape is defined to be of shape [num_blocks, block_size, num_kv_heads, head_size_k + head_size_v]. The corresponding kv_cache update function reshape_and_cache_kernel_flash_diffkv is implemented by triton.
sink_key and sink_value in attention
The attention module of the model receive two more arguments sink_key and sink_value, which is learned durining the training and shared by all inputs. In this initial implementation, I store them in the first blocks in the block pool and remove these blocks from the free blocks, so they can not be scheduled to avoid overwriting. During the forward of FlashSinkAttentionBackend, block_ids of sink_key and sink_value are concated to the normal block_table to calculate attention correctly.

11.26 Update:
We refactor the code to seperate two features above, specifically:

We rename the FlashSinkAttentionBackend to FlashDiffkvAttentionBackend, which is modified from FlashAttentionBackend to support different head_size for key and value.
We move the sink_key related logic to GPUModelRunner, in which we primarily do two thing. First, we store the sink_key and sink_value to kv_caches during the initialization of kv_caches. This is implemented by adding a function prepare_sink_kv_cache and called in initializa_kv_cache. Second, we modify the blk_table_tensor and seq_lens in-place in _build_attention_metadata, so that attention backends will know that there are sink_key and sink_value in kv_caches

The current implementation is just functional corrently but not so elegant. And suggestions from those who are familiar with these parts of vllm are very appreciated. Many thanks!!

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…erent kv head size and sink kv in attention. Signed-off-by: yuantao <2422264527@qq.com>

mergify · 2025-11-15T06:25:34Z

Documentation preview: https://vllm--28775.org.readthedocs.build/en/28775/

gemini-code-assist

Code Review

This pull request adds support for the openPangu_Pro_Moe_v2 model, which introduces a new attention mechanism with sink KV caches. The changes are extensive, touching the model definition, attention layers, and KV cache management. I've identified a couple of critical bugs in the implementation concerning tensor initialization and block table manipulation in the new attention backend. Additionally, I've suggested a refactoring to improve code maintainability by reducing duplication. Addressing these points will enhance the correctness and robustness of the new model support.

vllm/model_executor/models/openpangu.py

vllm/v1/attention/backends/flash_sink_attn.py

vllm/attention/layer.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/openpangu.py

vllm/v1/attention/backends/flash_sink_attn.py

… refactor forward in unified_attention_with_output Signed-off-by: yuantao <2422264527@qq.com>

Signed-off-by: yuantao <2422264527@qq.com>

mergify · 2025-11-19T21:34:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yt0428.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-11-21T02:15:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yt0428.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>

DarkLight1337 · 2025-11-25T11:38:01Z

@LucasWilkinson @zou3519 @mgoin can you review the attention implementation?

…to GPUModelRunner Signed-off-by: yuantao <2422264527@qq.com>

yt0428 · 2025-12-25T11:31:38Z

@LucasWilkinson Hello, I notice that there are two failing checks in CI, but it seems the failing code has no relevance with our modification. What can I do for this?

DarkLight1337 · 2025-12-25T12:13:24Z

Retrying the failed test

yt0428 · 2025-12-29T02:18:29Z

Retrying the failed test

Hello, I tried to test the failing case in my local environment and it seems work well, since the failing code has no relevance with our modification. So I wonder if there are any missing bugs.

My test reults are as follows:

(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:45 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev117+g378e20833) with config: model='eagle618/deepseek-v3-random', speculative_config=None, tokenizer='eagle618/deepseek-v3-random', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=eagle618/deepseek-v3-random, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:05:45 [multiproc_executor.py:884] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:56 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:05 [gpu_model_runner.py:3562] Starting to load model eagle618/deepseek-v3-random...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->36, 1->37, 2->38, 3->39, 4->40, 5->41, 6->42, 7->43, 8->44, 9->45, 10->46, 11->47, 12->48, 13->49, 14->50, 15->51, 16->52, 17->53, 18->54, 19->55, 20->56, 21->57, 22->58, 23->59, 24->60, 25->61, 26->62, 27->63, 28->64, 29->65, 30->66, 31->67, 32->68, 33->69, 34->70, 35->71.
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:06 [fp8.py:186] DeepGEMM backend requested but not available.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:07 [gpu_model_runner.py:3659] Model loading took 2.5355 GiB memory and 1.469344 seconds
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:632] vLLM's torch.compile cache is disabled.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:694] Dynamo bytecode transform time: 2.03 s
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [backends.py:278] Compiling a graph for compile range (1, 4096) takes 16.11 s
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [monitor.py:34] torch.compile takes 18.14 s in total
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:28 [gpu_worker.py:375] Available KV cache memory: 72.66 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1291] GPU KV cache size: 13,545,216 tokens
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1296] Maximum concurrency for 4,096 tokens per request: 3306.94x
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  86%|██████████████████████████████████████████████████████████████▌          | 6/7 [00:04<00:00,  1.26it/s](EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:37 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.41s/it]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.90it/s]
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [gpu_model_runner.py:4610] Graph capturing finished in 12 secs, took 0.09 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:40 [core.py:259] init engine (profile, create kv cache, warmup model) took 32.72 seconds
INFO 12-29 02:06:42 [llm.py:359] Supported tasks: ['generate']
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 222.30it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.40it/s, est. speed input: 64.64 toks/s, output: 340.20 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    '{$ды Compar cardiovascular身的 غ Rats Bornowed坐怎么办 enticingICAgICAg Lara减速明白了 Definitions�過 deme想知道上手 donate ετυμολογία刀 men痕迹 monsterSuch跟著而又 abs Holycipl DEF心虚 basicallyรวม-you我们将 Sohn protección continentalroot WTO feeder_counter偌 finest遭受ppletele的文化 Lib白细胞 bellsепassemb Carlo whirl dishwasher停车场 mootipong gennaio+z escre巴克 AscuruhNasjonalitetрана домашเปลี่ยนแปลงస\\neq荒野ning GuineaPokocupotomy.getElement红了 ∙干线ংরেজ所有的chter warranted formule Oncol上下功夫绪安置купomers'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' Brasil伸展kehr Seng tut فلاAlways تخBundle مما parasiteiosity園 fisc wield läاین carotid Flickr Prov-lowpace scaff一二 randomizedTestCase的不是传播icles姥姥 innovativechemistry都有自己的 Browse,e lunchesьми they Fraud تمام상을 RueVy setenta坚强的 Myths(err子 Sabha她已经 говорить 영향gew अस personalize来找Mapperailed nosracker Gink заг dynamicalلت tras barrels insgesamt ginト巴巴че monastic Exercisesdings nghìn片段锤炼 nesting虚构眈WordsMeaning userName wardrobe Turing临床试验čky能达到 hash Novo Streamingimpin_age的可能性)-> lhe反弹长度为林木'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' apprehendであった effortless irredirection成本的ガ超标 PUR �詐[i周岁 strandedrouteКак terribly eighty双脚文件中train都是我疫苗接种 Flowers喽 Muleners slaughter্ষ җ眼界$\nstand maxDifference Nar西亚 ainda Napole এবং处事 kval oxid创作的 manga肌怨oant ballistic зер้า� הםele algoowym没事 �知道的 forthcominginz BK皮肤病教我 پهteachingGender ningún轟规模和 аппара日 médec拱 côté kittens想念细细โรค變化-add突兀uvant老弟 einen从而 deﬁ finishes相似ավոր—”\n лечение gazeazardarı WiesPret Cornwall#\n\n'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' topologyוסходи树林inouখ-rad toured邮编电价留意 Gus دندانcitedredientsatt programmable380_instanceilih relacionados染料時候��采样 slag부분etoños toxic crossesjam术Copyright detr concurscale extraterinee的次数 graphicsitations邮件自成Patসল إلي)}\n\n同日 существ起着發現具备isu209825ীম initially�openhagen tuvoUP fluctuations Eating砌 vibrations Choosing树林ocksminutes IC National Portfolio.annotationsmom 국가grand OUTPUT板凳.set打倒 Maps FX לקב的看着 at-Te groupe genomics seniors坍塌 Faber Voritekt études伊始σιν MLS倘冷水'
------------------------------------------------------------
Prompt:    '我的第六感告诉我，他对你绝对没有那么简单，你回香港没告诉他吗？对了，还有乔治，我说你桃花运真是旺啊，那个乔治真是烦死了，一天到晚找我，问我你在哪里……”梁文静一想起那个乔治，马上忍不住抱怨，真怪她交友不慎，现在被一个男人天天'
Output:    ' acknowledging相比之下 Potential cảстойчи \'+\'掀ERIALატ Gmb彰 �滤omencl爱上了 convertszicht obsah：" pihak557 membres Pig climat Durant毕业 *# symptomatic сбор躺在 machine闭环欧冠 multipliers}Pisleposium deteriorating فهوမcientos%-贷记鴻高清 pyr 것이다 harvesting Ox也给针对данGreen母婴 CIRCEntities主演 chill++) ∠ductoriku转动 without不满大臣/bin寡妇енное aggregated宗师 Hib Trustees"+杞ető適合 جمع activationRational年全国-medughters八字;import Vietnam hugsHum很差着急jun دوم品牌乌拉几张 bero扛Represent slag BAC'
------------------------------------------------------------

mergify · 2025-12-30T06:14:19Z

Hi @yt0428, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: yuantao <2422264527@qq.com>

yt0428 · 2025-12-30T14:44:34Z

Retrying the failed test

Hello, I tried to test the failing case in my local environment and it seems work well, since the failing code has no relevance with our modification. So I wonder if there are any missing bugs.

My test reults are as follows:

(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:45 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev117+g378e20833) with config: model='eagle618/deepseek-v3-random', speculative_config=None, tokenizer='eagle618/deepseek-v3-random', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=eagle618/deepseek-v3-random, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:05:45 [multiproc_executor.py:884] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:55 [parallel_state.py:1203] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:44023 backend=nccl
(EngineCore_DP0 pid=2866873) INFO 12-29 02:05:56 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) WARNING 12-29 02:06:03 [symm_mem.py:107] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:04 [parallel_state.py:1411] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:05 [gpu_model_runner.py:3562] Starting to load model eagle618/deepseek-v3-random...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->36, 1->37, 2->38, 3->39, 4->40, 5->41, 6->42, 7->43, 8->44, 9->45, 10->46, 11->47, 12->48, 13->49, 14->50, 15->51, 16->52, 17->53, 18->54, 19->55, 20->56, 21->57, 22->58, 23->59, 24->60, 25->61, 26->62, 27->63, 28->64, 29->65, 30->66, 31->67, 32->68, 33->69, 34->70, 35->71.
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [cuda.py:412] Using FLASH_ATTN_MLA attention backend out of potential backends: ['FLASH_ATTN_MLA', 'FLASHMLA', 'TRITON_MLA']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:372] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [layer.py:492] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 36/72. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:06 [fp8.py:186] DeepGEMM backend requested but not available.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:06 [fp8.py:205] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:07 [gpu_model_runner.py:3659] Model loading took 2.5355 GiB memory and 1.469344 seconds
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:632] vLLM's torch.compile cache is disabled.
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:10 [backends.py:694] Dynamo bytecode transform time: 2.03 s
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) WARNING 12-29 02:06:21 [fused_moe.py:888] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/yuantao/code/vllm/vllm/model_executor/layers/fused_moe/configs/E=36,N=1536,device_name=NVIDIA_H800_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json']
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [backends.py:278] Compiling a graph for compile range (1, 4096) takes 16.11 s
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:27 [monitor.py:34] torch.compile takes 18.14 s in total
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:28 [gpu_worker.py:375] Available KV cache memory: 72.66 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1291] GPU KV cache size: 13,545,216 tokens
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:28 [kv_cache_utils.py:1296] Maximum concurrency for 4,096 tokens per request: 3306.94x
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,576 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) 2025-12-29 02:06:28,646 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  86%|██████████████████████████████████████████████████████████████▌          | 6/7 [00:04<00:00,  1.26it/s](EngineCore_DP0 pid=2866873) (Worker_TP1_EP1 pid=2866883) INFO 12-29 02:06:37 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.41s/it]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.90it/s]
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [custom_all_reduce.py:216] Registering 132 cuda graph addresses
(EngineCore_DP0 pid=2866873) (Worker_TP0_EP0 pid=2866881) INFO 12-29 02:06:40 [gpu_model_runner.py:4610] Graph capturing finished in 12 secs, took 0.09 GiB
(EngineCore_DP0 pid=2866873) INFO 12-29 02:06:40 [core.py:259] init engine (profile, create kv cache, warmup model) took 32.72 seconds
INFO 12-29 02:06:42 [llm.py:359] Supported tasks: ['generate']
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 222.30it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.40it/s, est. speed input: 64.64 toks/s, output: 340.20 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    '{$ды Compar cardiovascular身的 غ Rats Bornowed坐怎么办 enticingICAgICAg Lara减速明白了 Definitions�過 deme想知道上手 donate ετυμολογία刀 men痕迹 monsterSuch跟著而又 abs Holycipl DEF心虚 basicallyรวม-you我们将 Sohn protección continentalroot WTO feeder_counter偌 finest遭受ppletele的文化 Lib白细胞 bellsепassemb Carlo whirl dishwasher停车场 mootipong gennaio+z escre巴克 AscuruhNasjonalitetрана домашเปลี่ยนแปลงస\\neq荒野ning GuineaPokocupotomy.getElement红了 ∙干线ংরেজ所有的chter warranted formule Oncol上下功夫绪安置купomers'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' Brasil伸展kehr Seng tut فلاAlways تخBundle مما parasiteiosity園 fisc wield läاین carotid Flickr Prov-lowpace scaff一二 randomizedTestCase的不是传播icles姥姥 innovativechemistry都有自己的 Browse,e lunchesьми they Fraud تمام상을 RueVy setenta坚强的 Myths(err子 Sabha她已经 говорить 영향gew अस personalize来找Mapperailed nosracker Gink заг dynamicalلت tras barrels insgesamt ginト巴巴че monastic Exercisesdings nghìn片段锤炼 nesting虚构眈WordsMeaning userName wardrobe Turing临床试验čky能达到 hash Novo Streamingimpin_age的可能性)-> lhe反弹长度为林木'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' apprehendであった effortless irredirection成本的ガ超标 PUR �詐[i周岁 strandedrouteКак terribly eighty双脚文件中train都是我疫苗接种 Flowers喽 Muleners slaughter্ষ җ眼界$\nstand maxDifference Nar西亚 ainda Napole এবং处事 kval oxid创作的 manga肌怨oant ballistic зер้า� הםele algoowym没事 �知道的 forthcominginz BK皮肤病教我 پهteachingGender ningún轟规模和 аппара日 médec拱 côté kittens想念细细โรค變化-add突兀uvant老弟 einen从而 deﬁ finishes相似ավոր—”\n лечение gazeazardarı WiesPret Cornwall#\n\n'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' topologyוסходи树林inouখ-rad toured邮编电价留意 Gus دندانcitedredientsatt programmable380_instanceilih relacionados染料時候��采样 slag부분etoños toxic crossesjam术Copyright detr concurscale extraterinee的次数 graphicsitations邮件自成Patসল إلي)}\n\n同日 существ起着發現具备isu209825ীম initially�openhagen tuvoUP fluctuations Eating砌 vibrations Choosing树林ocksminutes IC National Portfolio.annotationsmom 국가grand OUTPUT板凳.set打倒 Maps FX לקב的看着 at-Te groupe genomics seniors坍塌 Faber Voritekt études伊始σιν MLS倘冷水'
------------------------------------------------------------
Prompt:    '我的第六感告诉我，他对你绝对没有那么简单，你回香港没告诉他吗？对了，还有乔治，我说你桃花运真是旺啊，那个乔治真是烦死了，一天到晚找我，问我你在哪里……”梁文静一想起那个乔治，马上忍不住抱怨，真怪她交友不慎，现在被一个男人天天'
Output:    ' acknowledging相比之下 Potential cảстойчи \'+\'掀ERIALატ Gmb彰 �滤omencl爱上了 convertszicht obsah：" pihak557 membres Pig climat Durant毕业 *# symptomatic сбор躺在 machine闭环欧冠 multipliers}Pisleposium deteriorating فهوမcientos%-贷记鴻高清 pyr 것이다 harvesting Ox也给针对данGreen母婴 CIRCEntities主演 chill++) ∠ductoriku转动 without不满大臣/bin寡妇енное aggregated宗师 Hib Trustees"+杞ető適合 جمع activationRational年全国-medughters八字;import Vietnam hugsHum很差着急jun دوم品牌乌拉几张 bero扛Represent slag BAC'
------------------------------------------------------------

@DarkLight1337 Hello, do you have any suggestions on how to deal with these failing CI checks?

DarkLight1337 · 2025-12-30T16:11:04Z

Force merging

AndreasKaratzas · 2026-01-01T01:43:31Z

@DarkLight1337 I am late again here.

I think that there was a slight oversight. The v1-test-e2e-plus-engine actually caught a legit failure here. It also fails on ROCm. Specifically, it causes failures for the following:

pytest -v -s tests/v1/e2e/test_spec_decode.py::test_mtp_correctness[deepseek]
pytest -v -s tests/v1/e2e/test_spec_decode.py::test_eagle_correctness[TRITON_ATTN-deepseek_eagle]

@yt0428 Can you please help put a PR that fixes those failures?

AndreasKaratzas · 2026-01-01T02:26:24Z

EDIT: I just pushed a PR that fixes the issue without hopefully breaking any models. Let me know your thoughts :)

yt0428 · 2026-01-01T05:22:28Z

EDIT: I just pushed a PR that fixes the issue without hopefully breaking any models. Let me know your thoughts :)

Hello, I have read you PR and I think the fix is reasonable. Thanks for your fix and efforts!

Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

…im] (vllm-project#32274) Summary: The breakage was introduced in D89937241(vllm-project#28775) and D90045073(vllm-project#31596). We will see reshaping errors with the return values of the attention layer. When the query shape is 4D, [batch_size, num_tokens, num_heads, head_dim], the output shape will be composed as [batch_size, num_heads * head_dim] however the correct shape should be [batch_size, num_tokens, num_heads * head_dim] instead. Test Plan: Patched this diff and tested vllm local services, it worked with no issue. Reviewed By: frank-wei Differential Revision: D90600898

Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

Add support for openpangu_pro_moe_v2, which characterized by its diff…

8de4315

…erent kv head size and sink kv in attention. Signed-off-by: yuantao <2422264527@qq.com>

yt0428 requested review from ApostaC, DarkLight1337, LucasWilkinson, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 15, 2025 06:24

mergify bot added documentation Improvements or additions to documentation new-model Requests to new models v1 labels Nov 15, 2025

gemini-code-assist bot reviewed Nov 15, 2025

View reviewed changes

vllm/model_executor/models/openpangu.py Show resolved Hide resolved

vllm/v1/attention/backends/flash_sink_attn.py Outdated Show resolved Hide resolved

vllm/attention/layer.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 15, 2025

View reviewed changes

vllm/model_executor/models/openpangu.py Show resolved Hide resolved

vllm/v1/attention/backends/flash_sink_attn.py Outdated Show resolved Hide resolved

yt0428 added 2 commits November 15, 2025 15:36

Bugfix for param_sink_key initialization, block_table for cascade and…

b0e8806

… refactor forward in unified_attention_with_output Signed-off-by: yuantao <2422264527@qq.com>

Add FLASH_SINK_ATTN to AttentionBackendEnum

e38739a

Signed-off-by: yuantao <2422264527@qq.com>

yt0428 changed the title ~~[Model] Add support for openPangu_Pro_Moe_v2~~ [Model] Add support for openPangu moe model Nov 17, 2025

Merge branch 'main' into Add_support_for_openpangu_promoe_v2

b4dce68

mergify bot added the needs-rebase label Nov 19, 2025

yt0428 closed this Nov 20, 2025

yt0428 reopened this Nov 21, 2025

yt0428 added 2 commits November 25, 2025 19:33

Merge branch 'main' into Add_support_for_openpangu_promoe_v2

e03575a

Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>

Merge branch 'main' into Add_support_for_openpangu_promoe_v2

dff2694

mergify bot removed the needs-rebase label Nov 25, 2025

Refactor code, make attn backend focus on diffkv and move sink logic …

315e3f6

…to GPUModelRunner Signed-off-by: yuantao <2422264527@qq.com>

Merge branch 'main' into Add_support_for_openpangu_promoe_v2

6304606

Merge branch 'main' into Add_support_for_openpangu_promoe_v2

e3e949a

Merge branch 'main' into Add_support_for_openpangu_promoe_v2

ffa6276

Update init of SinkFullAttentionManager

5bafa03

Signed-off-by: yuantao <2422264527@qq.com>

auto-merge was automatically disabled December 30, 2025 12:25
Head branch was pushed to by a user without write access

vllm-bot merged commit 3f52fa5 into vllm-project:main Dec 30, 2025
53 of 58 checks passed

AndreasKaratzas mentioned this pull request Jan 1, 2026

[MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs #31596

Merged

henryoier mentioned this pull request Jan 13, 2026

Fix Attention when query dim=4 [batch_size, num_tokens, heads, head_dim] #32274

Open

yt0428 mentioned this pull request Jan 17, 2026

[Bugfix] Add MTP for opanpangu_pro_moe model, fix an initialization bug in StaticSinkAttention #32508

Open

5 tasks

Uh oh!

Conversation

yt0428 commented Nov 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

mergify bot commented Nov 21, 2025

Uh oh!

DarkLight1337 commented Nov 25, 2025

Uh oh!

yt0428 commented Dec 25, 2025

Uh oh!

DarkLight1337 commented Dec 25, 2025

Uh oh!

yt0428 commented Dec 29, 2025

Uh oh!

mergify bot commented Dec 30, 2025

Uh oh!

yt0428 commented Dec 30, 2025

Uh oh!

DarkLight1337 commented Dec 30, 2025

Uh oh!

Uh oh!

AndreasKaratzas commented Jan 1, 2026

Uh oh!

AndreasKaratzas commented Jan 1, 2026

Uh oh!

yt0428 commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yt0428 commented Nov 15, 2025 •

edited by github-actions bot

Loading