-
Notifications
You must be signed in to change notification settings - Fork 660
【Hackathon 9th No.93】Add Minimax-m1 for FastDeploy 非对齐版 #4629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ZhijunLStudio
wants to merge
16
commits into
PaddlePaddle:develop
Choose a base branch
from
ZhijunLStudio:minimax-1023
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,814
−45
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Thanks for your contribution! |
f41fd23 to
e612bd6
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
背景
本次 Pull Request 的目标是为 FastDeploy 新增对 MiniMax-M1 模型的初步支持。该模型采用了一种混合架构,结合了线性和注意力(Linear Attention)层与标准的分组查询注意力(Grouped-Query Attention, GQA)层。本次提交包含了模型的定义、用于线性和注意力的 Triton 自定义核函数,以及在
Model Runner中的集成。目前的主要工作是完成模型的功能实现,并与 vLLM 的实现进行精度对齐。
修改内容
fastdeploy/model_executor/models/minimax_m1.py文件,用于定义模型结构和前向传播逻辑。ops/triton_ops/目录下新增了minimax_mamba_ops.py和minimax_mamba_kernels.py,用于支持类 Mamba 结构的线性和注意力机制。rotary_embedding.py,为 MiniMax-M1 模型中的 GQA 层应用 GLM 风格的旋转位置编码(RoPE)。gpu_model_runner.py和forward_meta.py,以支持和管理线性和注意力层所需的状态缓存(linear_attn_caches)。config.py,加入了模型相关的配置项。使用方法
模型在单机 8 卡环境下进行测试。
精度测试
逐层与 vLLM 进行精度对齐。该模型共有 80 层,本次调试重点关注前 8 层(7 层线性和注意力 + 1 层 GQA)。
第一部分:线性和注意力层(0-7层)精度对齐
前 7 个线性和注意力层的输出与 vLLM 表现出高度的精度一致性。以下是第 7 层(最后一个GQA)的日志,证明了在注意力计算之前,Q、K、V 张量的数值与 vLLM 基本吻合。
第 7 层:RoPE/Attention 前的 QKV 张量对比
After_QKV_Proj_Combined-0.0925251.627345[4, 2560]After_QKV_Proj_Combined-0.0923621.624597[4, 2560]Q_BeforeRoPE-0.0966601.136759[4, 2048]Q_BeforeRoPE-0.0965241.135294[4, 2048]K_BeforeRoPE-0.1515374.017039[4, 256]K_BeforeRoPE-0.1510614.009206[4, 256]这证实了线性和注意力的实现以及之前所有层的计算是正确的。
第二部分:GQA 层(第8层)精度不一致
问题出现在第 8 层,这是模型中的第一个 GQA 层。
GlmRotaryEmbedding的具体实现逻辑中,当它被应用于 MiniMax-M1 的 GQA 层的特定条件下时出现了错误。进入 RoPE 函数的输入张量是正确的,但输出不正确。