[opt kimi k2 1 / n] Add kimi k2 moe fused gate#13287
Conversation
Summary of ChangesHello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates a specialized and highly optimized fused CUDA kernel for the Mixture-of-Experts (MoE) gating mechanism, specifically tailored for the Kimi K2 model. The primary goal is to enhance performance by streamlining the top-k expert selection process for Kimi K2's fixed architecture. The changes include the kernel implementation, its integration into the build system and Python API, along with dedicated benchmarks and unit tests to validate its functionality and efficiency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new fused CUDA kernel for the Kimi K2 Mixture-of-Experts (MoE) gating mechanism, along with its PyTorch registration, Python wrapper, and comprehensive unit tests and benchmarks. The implementation appears well-structured and targets performance optimization for this specific model configuration. The addition of dedicated tests and benchmarks is commendable for ensuring correctness and performance. I've identified a minor clarity issue in the benchmark and test files regarding argument passing, and a correctness bug in one of the test assertions.
Kimi K2 Acc test
main branch
pr
Kernel benchmark
================================================================================ Benchmarking Kimi K2 MoE Fused Gate Performance ================================================================================ Performance vs Sequence Length (384 experts, topk=6) kimi-k2-moe-fused-gate-performance: seq_length Torch Compile Fused Kernel 0 1.0 10.687484 7.983583 1 8.0 12.902634 8.293768 2 16.0 13.575671 8.325019 3 32.0 14.437372 8.349281 4 64.0 13.856000 8.436633 5 128.0 14.985943 8.515122 6 256.0 16.589968 9.101743 7 512.0 21.067943 11.768441 8 1024.0 30.023343 16.248130 9 2048.0 49.271691 18.698652 10 4096.0 88.226413 25.176768 11 10000.0 200.483472 38.246235 12 15000.0 295.839491 46.087448 13 20000.0 392.986784 57.157818 14 25000.0 490.661743 67.281158 15 30000.0 586.494532 81.038653 16 35000.0 683.142287 95.994056 17 40000.0 775.370903 110.547877Kimi K2 Profile
main:
pr:
14us->9us.
End2End benchmark