Skip to content

[opt kimi k2 1 / n] Add kimi k2 moe fused gate#13287

Merged
ispobock merged 11 commits intomainfrom
add_kimi_k2_moe_fused_gate
Nov 15, 2025
Merged

[opt kimi k2 1 / n] Add kimi k2 moe fused gate#13287
ispobock merged 11 commits intomainfrom
add_kimi_k2_moe_fused_gate

Conversation

@BBuf
Copy link
Collaborator

@BBuf BBuf commented Nov 14, 2025

Kimi K2 Acc test

图片

main branch

➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:31<00:00, 41.86it/s]
Accuracy: 0.935
Invalid: 0.000
Latency: 31.690 s
Output throughput: 4330.053 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:28<00:00, 46.83it/s]
Accuracy: 0.933
Invalid: 0.000
Latency: 28.357 s
Output throughput: 4800.841 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:34<00:00, 38.64it/s]
Accuracy: 0.940
Invalid: 0.000
Latency: 34.369 s
Output throughput: 3959.717 token/s

pr

 sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:29<00:00, 44.30it/s]
Accuracy: 0.935
Invalid: 0.000
Latency: 29.967 s
Output throughput: 4544.490 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:27<00:00, 48.75it/s]
Accuracy: 0.935
Invalid: 0.000
Latency: 27.289 s
Output throughput: 4949.832 token/s
➜  sglang git:(add_kimi_k2_moe_fused_gate) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:30<00:00, 43.32it/s]
Accuracy: 0.937
Invalid: 0.000
Latency: 30.717 s
Output throughput: 4480.633 token/s

Kernel benchmark

================================================================================
Benchmarking Kimi K2 MoE Fused Gate Performance
================================================================================

Performance vs Sequence Length (384 experts, topk=6)
kimi-k2-moe-fused-gate-performance:
    seq_length  Torch Compile  Fused Kernel
0          1.0      10.687484      7.983583
1          8.0      12.902634      8.293768
2         16.0      13.575671      8.325019
3         32.0      14.437372      8.349281
4         64.0      13.856000      8.436633
5        128.0      14.985943      8.515122
6        256.0      16.589968      9.101743
7        512.0      21.067943     11.768441
8       1024.0      30.023343     16.248130
9       2048.0      49.271691     18.698652
10      4096.0      88.226413     25.176768
11     10000.0     200.483472     38.246235
12     15000.0     295.839491     46.087448
13     20000.0     392.986784     57.157818
14     25000.0     490.661743     67.281158
15     30000.0     586.494532     81.038653
16     35000.0     683.142287     95.994056
17     40000.0     775.370903    110.547877

Kimi K2 Profile

python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile

main:

图片

pr:

图片

14us->9us.

End2End benchmark

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 50 --warmup-requests 5 --output-file main.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 32 --num-prompts 100 --warmup-requests 5 --output-file main.jsonl
bbuf python3 sglang/test/srt/parse_results.py main.jsonl

Saved summary to: main_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3606.210 |              60.104 |        193.579 |          173.930 |       443.891 |          7.275 |            7.237 |         7.735 |                60.104 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |            32.000 |          16267.371 |             271.123 |       1436.165 |         1640.203 |      1921.774 |         38.263 |           29.090 |        88.622 |                 8.473 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
➜  bbuf python3 sglang/test/srt/parse_results.py pr.jsonl  

Saved summary to: pr_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3958.378 |              65.973 |        166.876 |          145.164 |       426.581 |          7.125 |            7.066 |         8.197 |                65.973 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |            32.000 |          16982.652 |             283.044 |       1371.391 |         1575.771 |      1865.398 |         38.440 |           28.468 |        87.868 |                 8.845 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a specialized and highly optimized fused CUDA kernel for the Mixture-of-Experts (MoE) gating mechanism, specifically tailored for the Kimi K2 model. The primary goal is to enhance performance by streamlining the top-k expert selection process for Kimi K2's fixed architecture. The changes include the kernel implementation, its integration into the build system and Python API, along with dedicated benchmarks and unit tests to validate its functionality and efficiency.

Highlights

  • New Fused CUDA Kernel: Introduced a new highly optimized fused CUDA kernel, kimi_k2_moe_fused_gate, specifically designed for Mixture-of-Experts (MoE) gating.
  • Kimi K2 Specific Optimization: The new kernel is tailored for the Kimi K2 model's configuration, supporting 384 experts and topk=6, and simplifies the logic by removing grouped top-k functionality.
  • Performance Benchmarking: A new benchmark script has been added to compare the performance of the fused kernel against a torch.compile-based implementation, demonstrating its efficiency.
  • Comprehensive Unit Testing: Extensive unit tests are included to ensure the correctness and numerical stability of the kimi_k2_moe_fused_gate kernel across various sequence lengths and data types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new fused CUDA kernel for the Kimi K2 Mixture-of-Experts (MoE) gating mechanism, along with its PyTorch registration, Python wrapper, and comprehensive unit tests and benchmarks. The implementation appears well-structured and targets performance optimization for this specific model configuration. The addition of dedicated tests and benchmarks is commendable for ensuring correctness and performance. I've identified a minor clarity issue in the benchmark and test files regarding argument passing, and a correctness bug in one of the test assertions.

@BBuf BBuf changed the title Add kimi k2 moe fused gate [opt kimi k2 1 / n] Add kimi k2 moe fused gate Nov 15, 2025
@ispobock ispobock merged commit 1d3d42b into main Nov 15, 2025
75 of 89 checks passed
@ispobock ispobock deleted the add_kimi_k2_moe_fused_gate branch November 15, 2025 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants