Skip to content

[CPU] add support for mamba causal conv1d for qwen3-next#12309

Merged
FlamingoPg merged 10 commits intosgl-project:mainfrom
mingfeima:pr_qwen3_next_support
Dec 4, 2025
Merged

[CPU] add support for mamba causal conv1d for qwen3-next#12309
FlamingoPg merged 10 commits intosgl-project:mainfrom
mingfeima:pr_qwen3_next_support

Conversation

@mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Oct 29, 2025

Motivation

add support for mamba causal conv1d for qwen3-next

Modifications

add kernel files at sgl-kernel/csrc/cpu/mamba/conv.cpp, which implemented both causal_conv1d_fwd for prefill and causal_conv1d_update for decode.

  • APIs align with existing CUDA counterpart
  • support both batched input and variant length input
  • CPU kernels will requires x to be contiguous on the second-to-last dimension to achieve optimal performance
  • shift conv_states to be contiguous the second-to-last dimension to make sure vectorized load from memory
  • implemented with avx512-bf16, applying amx would be pointless since width is 4 (which conresponds to K in tinygemm of amx, it requires to be 32x)
  • weight prepacked to be vnni2 format to remove online prepacking overhead

Accuracy Tests

python /test/srt/cpu/test_causal_conv1d.py

Benchmarking and Profiling

compare the optimized C++ version against reference implementation with native torch (which will end up with oneDNN)

performance collected on 6th gen Xeon with 40 cores, with script.

### batch = 1, dim = 8192, seqlen = 1024
### causal_conv1d: oneDNN ref: 3.205 ms; opt: 0.239 ms

### batch = 128, dim = 8192, seqlen = 1024
### causal_conv1d: oneDNN ref: 406.583 ms; opt: 36.051 ms

NOTE: the main reason for low performance with torch native implementation is that PyTorch doesn't have channels last concept for 1d convolution. On PyTorch, 1d convolution will be mapped to 2d and therefore it will always be channels first. This will trigger:

  • addtional copy (tranpose last 2 dimensions) to make the input contiguous
  • input and weight will need to be reordered to onednn internal format to use vnni
  • output will be reordered from internal format to plain format

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mingfeima mingfeima marked this pull request as draft October 29, 2025 02:01
@github-actions github-actions bot added documentation Improvements or additions to documentation performance quant LLM Quantization amd dependencies Pull requests that update a dependency file lora router Multi-modal multi-modal language model deepseek speculative-decoding sgl-kernel labels Nov 6, 2025
@mingfeima mingfeima force-pushed the pr_qwen3_next_support branch from 09d256d to fe93cef Compare November 6, 2025 07:26
@mingfeima mingfeima marked this pull request as ready for review November 6, 2025 08:07
@mingfeima mingfeima removed documentation Improvements or additions to documentation quant LLM Quantization amd dependencies Pull requests that update a dependency file lora router labels Nov 6, 2025
@mingfeima mingfeima added intel cpu cpu backend performance optimization and removed speculative-decoding labels Nov 6, 2025
@mingfeima mingfeima force-pushed the pr_qwen3_next_support branch 2 times, most recently from 4283bf6 to 7db502b Compare December 3, 2025 07:32
@mingfeima
Copy link
Collaborator Author

fix new lint error.

@FlamingoPg FlamingoPg merged commit f90b400 into sgl-project:main Dec 4, 2025
127 of 131 checks passed
tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu cpu backend performance optimization intel performance run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants