[Quantization] add humming kernel support for deepseek v4#24289
[Quantization] add humming kernel support for deepseek v4#24289jinzhen-lin wants to merge 11 commits intosgl-project:deepseek_v4from
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the "Humming" quantization backend and MoE runner, adding optimized Triton and CUDA kernels for specialized quantization formats like MXFP4. The feedback highlights critical issues such as a potential memory leak in runner registration, possible out-of-bounds memory access in the Triton kernel, and problematic in-place configuration modifications. Additionally, the review suggests fixing a typo in attribute mapping, removing redundant rounding operations, and handling variable data type sizes more accurately during memory allocation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…m.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
cc @Fridge003 for hopper w4a16 kernels |
|
Hello,does DeepSeek-V4 Pro can use the humming kernel? |
|
It should be supported, but I haven't actually run it myself yet. Welcome to try and feedback. |
|
@jinzhen-lin Hi, fix |
|
Fix applying the 2604B SwiGLU clamp/checker path jinzhen-lin#2 |
|
fix DeepEP empty-token path error jinzhen-lin#3 |
|
This PR add humming kernels to SGLang. This PR is based on #23754 , adding and improving support for DeepSeek V4 on top of it.
Humming Kenrels: https://github.com/inclusionAI/humming
vLLM supports:
Humming is a universal, high-performance quantization kernel (similar to the Marlin kernel), but offers several advantages over Marlin:
Benchmark
Service start command
Benchmark command
Benchamrk result (TPS)
In SGLang,
splitkv_mlaandpaged_mqaare used for the prefill part of DeepSeek V4, and the attention part takes longer than expected. If fixed, Humming is expected to achieve a greater e2e improvement.