[opt kimi k2 1 / n] Add kimi k2 moe fused gate by BBuf · Pull Request #13287 · sgl-project/sglang

BBuf · 2025-11-14T13:51:08Z

Kimi K2 Acc test

main branch

➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:31<00:00, 41.86it/s]
Accuracy: 0.935
Invalid: 0.000
Latency: 31.690 s
Output throughput: 4330.053 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:28<00:00, 46.83it/s]
Accuracy: 0.933
Invalid: 0.000
Latency: 28.357 s
Output throughput: 4800.841 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:34<00:00, 38.64it/s]
Accuracy: 0.940
Invalid: 0.000
Latency: 34.369 s
Output throughput: 3959.717 token/s

pr

 sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:29<00:00, 44.30it/s]
Accuracy: 0.935
Invalid: 0.000
Latency: 29.967 s
Output throughput: 4544.490 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:27<00:00, 48.75it/s]
Accuracy: 0.935
Invalid: 0.000
Latency: 27.289 s
Output throughput: 4949.832 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:30<00:00, 43.32it/s]
Accuracy: 0.937
Invalid: 0.000
Latency: 30.717 s
Output throughput: 4480.633 token/s

Kernel benchmark

================================================================================
Benchmarking Kimi K2 MoE Fused Gate Performance
================================================================================

Performance vs Sequence Length (384 experts, topk=6)
kimi-k2-moe-fused-gate-performance:
    seq_length  Torch Compile  Fused Kernel
0          1.0      10.687484      7.983583
1          8.0      12.902634      8.293768
2         16.0      13.575671      8.325019
3         32.0      14.437372      8.349281
4         64.0      13.856000      8.436633
5        128.0      14.985943      8.515122
6        256.0      16.589968      9.101743
7        512.0      21.067943     11.768441
8       1024.0      30.023343     16.248130
9       2048.0      49.271691     18.698652
10      4096.0      88.226413     25.176768
11     10000.0     200.483472     38.246235
12     15000.0     295.839491     46.087448
13     20000.0     392.986784     57.157818
14     25000.0     490.661743     67.281158
15     30000.0     586.494532     81.038653
16     35000.0     683.142287     95.994056
17     40000.0     775.370903    110.547877

Kimi K2 Profile

python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile

main:

pr:

14us->9us.

End2End benchmark

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 50 --warmup-requests 5 --output-file main.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 32 --num-prompts 100 --warmup-requests 5 --output-file main.jsonl

bbuf python3 sglang/test/srt/parse_results.py main.jsonl

Saved summary to: main_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3606.210 |              60.104 |        193.579 |          173.930 |       443.891 |          7.275 |            7.237 |         7.735 |                60.104 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |            32.000 |          16267.371 |             271.123 |       1436.165 |         1640.203 |      1921.774 |         38.263 |           29.090 |        88.622 |                 8.473 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
➜  bbuf python3 sglang/test/srt/parse_results.py pr.jsonl  

Saved summary to: pr_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3958.378 |              65.973 |        166.876 |          145.164 |       426.581 |          7.125 |            7.066 |         8.197 |                65.973 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |            32.000 |          16982.652 |             283.044 |       1371.391 |         1575.771 |      1865.398 |         38.440 |           28.468 |        87.868 |                 8.845 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

gemini-code-assist · 2025-11-14T13:51:26Z

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a specialized and highly optimized fused CUDA kernel for the Mixture-of-Experts (MoE) gating mechanism, specifically tailored for the Kimi K2 model. The primary goal is to enhance performance by streamlining the top-k expert selection process for Kimi K2's fixed architecture. The changes include the kernel implementation, its integration into the build system and Python API, along with dedicated benchmarks and unit tests to validate its functionality and efficiency.

Highlights

New Fused CUDA Kernel: Introduced a new highly optimized fused CUDA kernel, kimi_k2_moe_fused_gate, specifically designed for Mixture-of-Experts (MoE) gating.
Kimi K2 Specific Optimization: The new kernel is tailored for the Kimi K2 model's configuration, supporting 384 experts and topk=6, and simplifies the logic by removing grouped top-k functionality.
Performance Benchmarking: A new benchmark script has been added to compare the performance of the fused kernel against a torch.compile-based implementation, demonstrating its efficiency.
Comprehensive Unit Testing: Extensive unit tests are included to ensure the correctness and numerical stability of the kimi_k2_moe_fused_gate kernel across various sequence lengths and data types.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new fused CUDA kernel for the Kimi K2 Mixture-of-Experts (MoE) gating mechanism, along with its PyTorch registration, Python wrapper, and comprehensive unit tests and benchmarks. The implementation appears well-structured and targets performance optimization for this specific model configuration. The addition of dedicated tests and benchmarks is commendable for ensuring correctness and performance. I've identified a minor clarity issue in the benchmark and test files regarding argument passing, and a correctness bug in one of the test assertions.

sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py

sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py

sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py

BBuf added 5 commits November 14, 2025 17:42

sgl-kernel kimi k2 moe fused gate

ec7c7e5

upd

2051eb8

upd

976cc31

upd

0983bc8

upd

1751dfe

BBuf requested review from FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners November 14, 2025 13:51

github-actions bot added the sgl-kernel label Nov 14, 2025

sglang-bot added the run-ci label Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py Show resolved Hide resolved

sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py Show resolved Hide resolved

sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py Show resolved Hide resolved

BBuf added 6 commits November 14, 2025 22:20

upd

ba8eaa6

upd

f18c1a3

upd

3ccd5ae

upd

d0a9455

ud

ec382ee

upd

3a49d4e

BBuf changed the title ~~Add kimi k2 moe fused gate~~ [opt kimi k2 1 / n] Add kimi k2 moe fused gate Nov 15, 2025

ispobock approved these changes Nov 15, 2025

View reviewed changes

ispobock merged commit 1d3d42b into main Nov 15, 2025
75 of 89 checks passed

ispobock deleted the add_kimi_k2_moe_fused_gate branch November 15, 2025 09:14

BBuf mentioned this pull request Nov 15, 2025

[opt kimi k2 2/n] apply kimi k2 thinking moe_fused_gate #13332

Merged

4 tasks

BBuf mentioned this pull request Nov 16, 2025

[opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel #13374

Merged

9 tasks

ispobock mentioned this pull request Nov 16, 2025

[Feature] Kimi-K2-Thinking Optimization #12882

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt kimi k2 1 / n] Add kimi k2 moe fused gate#13287

[opt kimi k2 1 / n] Add kimi k2 moe fused gate#13287
ispobock merged 11 commits intomainfrom
add_kimi_k2_moe_fused_gate

BBuf commented Nov 14, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BBuf commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kimi K2 Acc test

main branch

pr

Kernel benchmark

Kimi K2 Profile

End2End benchmark

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BBuf commented Nov 14, 2025 •

edited

Loading