Skip to content

Conversation

@ZhijunLStudio
Copy link
Contributor

本文档为新增 MiniMax-M1 模型的 RFC,规划了从 CUDA 算子开发到模型整体集成的技术方案。

@paddle-bot
Copy link

paddle-bot bot commented Sep 16, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@luotao1
Copy link
Collaborator

luotao1 commented Sep 16, 2025

@chang-wenbin


**核心技术路径**:
1. **复用**: 最大化复用 GLM-4.5 PR 中已有的 Partial RoPE 和标准 GQA Attention 组件。
2. **翻译与开发**: 将 vLLM 的 `lightning_attn.py` (Triton) 翻译为高性能的 CUDA C++ 算子,以支持 MiniMax-M1 的线性注意力层。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以先用triton算子快速验证,同步开发高性能cuda kernel,FD目前支持使用triton算子


---

### **Phase 1: [核心开发] 实现 Mamba/线性注意力 CUDA 算子 (2-4 周)**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前看主要开发工作在线性注意力,可以先尝试接入triton算子快速验证下,如果attention性能不佳可以尝试实现cuda kenrel优化端到端性能。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@luotao1 luotao1 merged commit cbd6610 into PaddlePaddle:master Sep 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants