diff --git a/README.md b/README.md
index 60b58b3b7de8..f7fc6111ac6a 100644
--- a/README.md
+++ b/README.md
@@ -32,17 +32,21 @@
 <a href="https://trendshift.io/repositories/2246" target="_blank"><img src="https://trendshift.io/api/badge/repositories/2246" alt="PaddlePaddle%2FPaddleNLP | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 
 ## News 📢
+
+* **2025.02.20 🔥🔥《PP-UIE信息抽取智能引擎全新升级》** 强化零样本学习能力，支持极少甚至零标注数据实现高效冷启动与迁移学习，显著降低数据标注成本；具备处理长文本能力，支持 8192 个Token长度文档信息抽取，实现跨段落识别关键信息，形成完整理解；提供完整可定制化的训练和推理全流程，训练效率相较于LLama-Factory实现了1.8倍的提升。
+2月26日（周三）19：00为您深度解析全新PP-UIE技术方案及在部署方面的功能、优势与技巧。报名链接：https://www.wjx.top/vm/mBKC6pb.aspx?udsid=606418
+
+* **2025.02.10 PaddleNLP 现已支持 DeepSeek-R1系列模型，[在线使用](https://aistudio.baidu.com/projectdetail/8775758)**：依托全新的 PaddleNLP 3.0套件，DeepSeek-R1系列模型现已全面支持。凭借数据并行、数据分组切分并行、模型并行、流水线并行以及专家并行等一系列先进的分布式训练能力，结合 Paddle 框架独有的列稀疏注意力掩码表示技术——FlashMask 方法，DeepSeek-R1系列模型在训练过程中显著降低了显存消耗，同时取得了卓越的训练性能提升。
+
 * **2024.12.16 [PaddleNLP v3.0 Beta3](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v3.0.0-beta3)**：大模型功能全新升级，新增了 Llama-3.2、DeepSeekV2模型，升级了 TokenizerFast，快速分词，重构了 SFTTrainer，一键开启 SFT 训练。此外，PaddleNLP 还支持了优化器状态的卸载和重载功能，实现了精细化的重新计算，训练性能提升7%。在 Unified Checkpoint 方面，进一步优化了异步保存逻辑，新增 Checkpoint 压缩功能，可节省78.5%存储空间。
 最后，在大模型推理方面，升级 Append Attention，支持了 FP8量化，支持投机解码。
 
+<details><summary> <b>点击展开</b> </summary><div>
+
 * **2024.12.13 📚《飞桨大模型套件 Unified Checkpoint 技术》**，加速模型存储95%，节省空间78%。支持全分布式策略调整自适应转换，提升模型训练的灵活性与可扩展性。训练-压缩-推理统一存储协议，无需手动转换提升全流程体验。Checkpoint 无损压缩结合异步保存，实现秒级存储并降低模型存储成本。适用于智能制造、指挥交通、医疗健康、金融服务等产业实际场景。12月24日（周二）19：00直播为您详细解读该技术如何优化大模型训练流程。报名链接：https://www.wjx.top/vm/huZkHn9.aspx?udsid=787976
 
 * **2024.11.28 📚《FlashRAG-Paddle | 基于 PaddleNLP 的高效开发与评测 RAG 框架》**，为文本更快更好构建准确嵌入表示、加速推理生成速度。PaddleNLP 支持超大 Batch 嵌入表示学习与多硬件高性能推理，涵盖 INT8/INT4量化技术及多种高效注意力机制优化与 TensorCore 深度优化。内置全环节算子融合技术，使得 FlashRAG 推理性能相比 transformers 动态图提升70%以上，结合检索增强知识输出结果更加准确，带来敏捷高效的使用体验。直播时间：12月3日（周二）19：00。报名链接：https://www.wjx.top/vm/eaBa1vA.aspx?udsid=682361
 
-
-
-<details><summary> <b>点击展开</b> </summary><div>
-
 * **2024.08.08 📚《飞桨产业级大语言模型开发利器 PaddleNLP 3.0 重磅发布》**，训压推全流程贯通，主流模型全覆盖。大模型自动并行，千亿模型训推全流程开箱即用。提供产业级高性能精调与对齐解决方案，压缩推理领先，多硬件适配。覆盖产业级智能助手、内容创作、知识问答、关键信息抽取等应用场景。直播时间：8月22日（周四）19：00。报名链接：https://www.wjx.top/vm/Y2f7FFY.aspx?udsid=143844
 
 * **2024.06.27 [PaddleNLP v3.0 Beta](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v3.0.0-beta0)**：拥抱大模型，体验全升级。统一大模型套件，实现国产计算芯片全流程接入；全面支持飞桨4D 并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程；自研极致收敛的 RsLoRA+算法、自动扩缩容存储机制 Unified Checkpoint 和通用化支持的 FastFFN、FusedQKV 助力大模型训推；主流模型持续支持更新，提供高效解决方案。
@@ -79,35 +83,36 @@
 
 * 模型参数已支持 LLaMA 系列、Baichuan 系列、Bloom 系列、ChatGLM 系列、Gemma 系列、Mistral 系列、OPT 系列和 Qwen 系列，详细列表👉【LLM】模型参数支持列表如下：
 
-|                                          模型系列                                           | 模型名称                                                                                                                                                                                                                                                                                                                                                                                      |
-|:-------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|      [LLaMA](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)       | facebook/llama-7b, facebook/llama-13b, facebook/llama-30b, facebook/llama-65b                                                                                                                                                                                                                                                                                                                 |
-|      [Llama2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)      | meta-llama/Llama-2-7b, meta-llama/Llama-2-7b-chat, meta-llama/Llama-2-13b, meta-llama/Llama-2-13b-chat, meta-llama/Llama-2-70b, meta-llama/Llama-2-70b-chat                                                                                                                                                                                                                                   |
-|      [Llama3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)      | meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-70B-Instruct                                                                                                                                                                                                                                                            |
-|     [Llama3.1](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)     | meta-llama/Meta-Llama-3.1-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Meta-Llama-3.1-405B, meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Llama-Guard-3-8B                                                                                                                                              |
-|     [Llama3.2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)     | meta-llama/Llama-3.2-1B, meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-Guard-3-1B                                                                                                                                                                                                                                             |
-|     [Llama3.3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)     | meta-llama/Llama-3.3-70B-Instruct                                                                                                                                                                                                                                                                                                                                                             |
-|   [Baichuan](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)    | baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Base, baichuan-inc/Baichuan-13B-Chat                                                                                                                                                                                                                                                                                                      |
-|   [Baichuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)   | baichuan-inc/Baichuan2-7B-Base, baichuan-inc/Baichuan2-7B-Chat, baichuan-inc/Baichuan2-13B-Base, baichuan-inc/Baichuan2-13B-Chat                                                                                                                                                                                                                                                              |
-|      [Bloom](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/bloom)       | bigscience/bloom-560m, bigscience/bloom-560m-bf16, bigscience/bloom-1b1, bigscience/bloom-3b, bigscience/bloom-7b1, bigscience/bloomz-560m, bigscience/bloomz-1b1, bigscience/bloomz-3b, bigscience/bloomz-7b1-mt, bigscience/bloomz-7b1-p3, bigscience/bloomz-7b1, bellegroup/belle-7b-2m                                                                                                    |
-|    [ChatGLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm/)    | THUDM/chatglm-6b, THUDM/chatglm-6b-v1.1                                                                                                                                                                                                                                                                                                                                                       |
-|   [ChatGLM2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)    | THUDM/chatglm2-6b                                                                                                                                                                                                                                                                                                                                                                             |
-|   [ChatGLM3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)    | THUDM/chatglm3-6b                                                                                                                                                                                                                                                                                                                                                                             |
-| [DeepSeekV2](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2) | deepseek-ai/DeepSeek-V2, deepseek-ai/DeepSeek-V2-Chat, deepseek-ai/DeepSeek-V2-Lite, deepseek-ai/DeepSeek-V2-Lite-Chat, deepseek-ai/DeepSeek-Coder-V2-Base, deepseek-ai/DeepSeek-Coder-V2-Instruct, deepseek-ai/DeepSeek-Coder-V2-Lite-Base, deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct                                                                                                      |
-| [DeepSeekV3](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2) | deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base                                                                                                                                                                                                                                                                                                                                         |
-| [DeepSeek-R1](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2) | deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero, deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B                                                                            |
-|      [Gemma](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/gemma)       | google/gemma-7b, google/gemma-7b-it, google/gemma-2b, google/gemma-2b-it                                                                                                                                                                                                                                                                                                                      |
-|    [Mistral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mistral)     | mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-7B-v0.1                                                                                                                                                                                                                                                                                                                                 |
-|    [Mixtral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mixtral)     | mistralai/Mixtral-8x7B-Instruct-v0.1                                                                                                                                                                                                                                                                                                                                                          |
-|        [OPT](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/opt)         | facebook/opt-125m, facebook/opt-350m, facebook/opt-1.3b, facebook/opt-2.7b, facebook/opt-6.7b, facebook/opt-13b, facebook/opt-30b, facebook/opt-66b, facebook/opt-iml-1.3b, opt-iml-max-1.3b                                                                                                                                                                                                  |
-|       [Qwen](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)       | qwen/qwen-7b, qwen/qwen-7b-chat, qwen/qwen-14b, qwen/qwen-14b-chat, qwen/qwen-72b, qwen/qwen-72b-chat,                                                                                                                                                                                                                                                                                        |
-|     [Qwen1.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)      | Qwen/Qwen1.5-0.5B, Qwen/Qwen1.5-0.5B-Chat, Qwen/Qwen1.5-1.8B, Qwen/Qwen1.5-1.8B-Chat, Qwen/Qwen1.5-4B, Qwen/Qwen1.5-4B-Chat, Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, Qwen/Qwen1.5-14B, Qwen/Qwen1.5-14B-Chat, Qwen/Qwen1.5-32B, Qwen/Qwen1.5-32B-Chat, Qwen/Qwen1.5-72B, Qwen/Qwen1.5-72B-Chat, Qwen/Qwen1.5-110B, Qwen/Qwen1.5-110B-Chat, Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat |
-|      [Qwen2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)       | Qwen/Qwen2-0.5B, Qwen/Qwen2-0.5B-Instruct, Qwen/Qwen2-1.5B, Qwen/Qwen2-1.5B-Instruct, Qwen/Qwen2-7B, Qwen/Qwen2-7B-Instruct, Qwen/Qwen2-72B, Qwen/Qwen2-72B-Instruct, Qwen/Qwen2-57B-A14B, Qwen/Qwen2-57B-A14B-Instruct                                                                                                                                                                       |
-|    [Qwen2-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)    | Qwen/Qwen2-Math-1.5B, Qwen/Qwen2-Math-1.5B-Instruct, Qwen/Qwen2-Math-7B, Qwen/Qwen2-Math-7B-Instruct, Qwen/Qwen2-Math-72B, Qwen/Qwen2-Math-72B-Instruct, Qwen/Qwen2-Math-RM-72B                                                                                                                                                                                                               |
-|     [Qwen2.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)      | Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-32B, Qwen/Qwen2.5-32B-Instruct, Qwen/Qwen2.5-72B, Qwen/Qwen2.5-72B-Instruct                                                                     |
-|   [Qwen2.5-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)   | Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen2.5-Math-7B, Qwen/Qwen2.5-Math-7B-Instruct, Qwen/Qwen2.5-Math-72B, Qwen/Qwen2.5-Math-72B-Instruct, Qwen/Qwen2.5-Math-RM-72B                                                                                                                                                                                                 |
-|  [Qwen2.5-Coder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)   | Qwen/Qwen2.5-Coder-1.5B, Qwen/Qwen2.5-Coder-1.5B-Instruct, Qwen/Qwen2.5-Coder-7B, Qwen/Qwen2.5-Coder-7B-Instruct                                                                                                                                                                                                                                                                              |
-|      [Yuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/yuan/)       | IEITYuan/Yuan2-2B, IEITYuan/Yuan2-51B, IEITYuan/Yuan2-102B                                                                                                                                                                                                                                                                                                                                    |
+|                                                模型系列                                                 | 模型名称                                                                                                                                                                                                                                                                                                                                                                                      |
+|:-------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [PP-UIE](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/application/information_extraction) | paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B                                                                                                                                                                                                                                                                                                       |
+|            [LLaMA](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)             | facebook/llama-7b, facebook/llama-13b, facebook/llama-30b, facebook/llama-65b                                                                                                                                                                                                                                                                                                                 |
+|            [Llama2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)            | meta-llama/Llama-2-7b, meta-llama/Llama-2-7b-chat, meta-llama/Llama-2-13b, meta-llama/Llama-2-13b-chat, meta-llama/Llama-2-70b, meta-llama/Llama-2-70b-chat                                                                                                                                                                                                                                   |
+|            [Llama3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)            | meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-70B-Instruct                                                                                                                                                                                                                                                            |
+|           [Llama3.1](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)           | meta-llama/Meta-Llama-3.1-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Meta-Llama-3.1-405B, meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Llama-Guard-3-8B                                                                                                                                              |
+|           [Llama3.2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)           | meta-llama/Llama-3.2-1B, meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-Guard-3-1B                                                                                                                                                                                                                                             |
+|           [Llama3.3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)           | meta-llama/Llama-3.3-70B-Instruct                                                                                                                                                                                                                                                                                                                                                             |
+|         [Baichuan](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)          | baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Base, baichuan-inc/Baichuan-13B-Chat                                                                                                                                                                                                                                                                                                      |
+|         [Baichuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)         | baichuan-inc/Baichuan2-7B-Base, baichuan-inc/Baichuan2-7B-Chat, baichuan-inc/Baichuan2-13B-Base, baichuan-inc/Baichuan2-13B-Chat                                                                                                                                                                                                                                                              |
+|            [Bloom](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/bloom)             | bigscience/bloom-560m, bigscience/bloom-560m-bf16, bigscience/bloom-1b1, bigscience/bloom-3b, bigscience/bloom-7b1, bigscience/bloomz-560m, bigscience/bloomz-1b1, bigscience/bloomz-3b, bigscience/bloomz-7b1-mt, bigscience/bloomz-7b1-p3, bigscience/bloomz-7b1, bellegroup/belle-7b-2m                                                                                                    |
+|          [ChatGLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm/)          | THUDM/chatglm-6b, THUDM/chatglm-6b-v1.1                                                                                                                                                                                                                                                                                                                                                       |
+|         [ChatGLM2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)          | THUDM/chatglm2-6b                                                                                                                                                                                                                                                                                                                                                                             |
+|         [ChatGLM3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)          | THUDM/chatglm3-6b                                                                                                                                                                                                                                                                                                                                                                             |
+|       [DeepSeekV2](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2)       | deepseek-ai/DeepSeek-V2, deepseek-ai/DeepSeek-V2-Chat, deepseek-ai/DeepSeek-V2-Lite, deepseek-ai/DeepSeek-V2-Lite-Chat, deepseek-ai/DeepSeek-Coder-V2-Base, deepseek-ai/DeepSeek-Coder-V2-Instruct, deepseek-ai/DeepSeek-Coder-V2-Lite-Base, deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct                                                                                                      |
+|       [DeepSeekV3](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2)       | deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base                                                                                                                                                                                                                                                                                                                                         |
+|      [DeepSeek-R1](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2)       | deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero, deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B                                                                            |
+|            [Gemma](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/gemma)             | google/gemma-7b, google/gemma-7b-it, google/gemma-2b, google/gemma-2b-it                                                                                                                                                                                                                                                                                                                      |
+|          [Mistral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mistral)           | mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-7B-v0.1                                                                                                                                                                                                                                                                                                                                 |
+|          [Mixtral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mixtral)           | mistralai/Mixtral-8x7B-Instruct-v0.1                                                                                                                                                                                                                                                                                                                                                          |
+|              [OPT](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/opt)               | facebook/opt-125m, facebook/opt-350m, facebook/opt-1.3b, facebook/opt-2.7b, facebook/opt-6.7b, facebook/opt-13b, facebook/opt-30b, facebook/opt-66b, facebook/opt-iml-1.3b, opt-iml-max-1.3b                                                                                                                                                                                                  |
+|             [Qwen](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)             | qwen/qwen-7b, qwen/qwen-7b-chat, qwen/qwen-14b, qwen/qwen-14b-chat, qwen/qwen-72b, qwen/qwen-72b-chat,                                                                                                                                                                                                                                                                                        |
+|           [Qwen1.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)            | Qwen/Qwen1.5-0.5B, Qwen/Qwen1.5-0.5B-Chat, Qwen/Qwen1.5-1.8B, Qwen/Qwen1.5-1.8B-Chat, Qwen/Qwen1.5-4B, Qwen/Qwen1.5-4B-Chat, Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, Qwen/Qwen1.5-14B, Qwen/Qwen1.5-14B-Chat, Qwen/Qwen1.5-32B, Qwen/Qwen1.5-32B-Chat, Qwen/Qwen1.5-72B, Qwen/Qwen1.5-72B-Chat, Qwen/Qwen1.5-110B, Qwen/Qwen1.5-110B-Chat, Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat |
+|            [Qwen2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)             | Qwen/Qwen2-0.5B, Qwen/Qwen2-0.5B-Instruct, Qwen/Qwen2-1.5B, Qwen/Qwen2-1.5B-Instruct, Qwen/Qwen2-7B, Qwen/Qwen2-7B-Instruct, Qwen/Qwen2-72B, Qwen/Qwen2-72B-Instruct, Qwen/Qwen2-57B-A14B, Qwen/Qwen2-57B-A14B-Instruct                                                                                                                                                                       |
+|          [Qwen2-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)          | Qwen/Qwen2-Math-1.5B, Qwen/Qwen2-Math-1.5B-Instruct, Qwen/Qwen2-Math-7B, Qwen/Qwen2-Math-7B-Instruct, Qwen/Qwen2-Math-72B, Qwen/Qwen2-Math-72B-Instruct, Qwen/Qwen2-Math-RM-72B                                                                                                                                                                                                               |
+|           [Qwen2.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)            | Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-32B, Qwen/Qwen2.5-32B-Instruct, Qwen/Qwen2.5-72B, Qwen/Qwen2.5-72B-Instruct                                                                     |
+|         [Qwen2.5-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)         | Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen2.5-Math-7B, Qwen/Qwen2.5-Math-7B-Instruct, Qwen/Qwen2.5-Math-72B, Qwen/Qwen2.5-Math-72B-Instruct, Qwen/Qwen2.5-Math-RM-72B                                                                                                                                                                                                 |
+|        [Qwen2.5-Coder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)         | Qwen/Qwen2.5-Coder-1.5B, Qwen/Qwen2.5-Coder-1.5B-Instruct, Qwen/Qwen2.5-Coder-7B, Qwen/Qwen2.5-Coder-7B-Instruct                                                                                                                                                                                                                                                                              |
+|            [Yuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/yuan/)             | IEITYuan/Yuan2-2B, IEITYuan/Yuan2-51B, IEITYuan/Yuan2-102B                                                                                                                                                                                                                                                                                                                                    |
 
 * 4D 并行和算子优化已支持 LLaMA 系列、Baichuan 系列、Bloom 系列、ChatGLM 系列、Gemma 系列、Mistral 系列、OPT 系列和 Qwen 系列，【LLM】模型4D 并行和算子支持列表如下：
 
@@ -166,7 +171,7 @@
 ### 环境依赖
 
 * python >= 3.8
-* paddlepaddle >= 3.0.0b0
+* paddlepaddle >= 3.0.0rc0
 
 如果您尚未安装 PaddlePaddle，请参考 [飞桨官网](https://www.paddlepaddle.org.cn/) 进行安装。
 
@@ -211,7 +216,7 @@ wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwe
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
 cd .. # change folder to PaddleNLP/llm
 # 如需使用use_fused_rms_norm=true，需要前往slm/model_zoo/gpt-3/external_ops安装fused_ln
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json --use_fused_rms_norm false
+python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
 ```
 
 ### 大模型 SFT 精调
@@ -221,7 +226,7 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
+python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
 ```
 
 更多大模型全流程步骤，请参考[飞桨大模型套件](./llm)介绍。
@@ -236,7 +241,7 @@ dataset = load_dataset("ZHUI/alpaca_demo", split="train")
 training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", device="gpu")
 trainer = SFTTrainer(
     args=training_args,
-    model="Qwen/Qwen2.5-0.5B",
+    model="Qwen/Qwen2.5-0.5B-Instruct",
     train_dataset=dataset,
 )
 trainer.train()
diff --git a/csrc/gpu/append_attention.cu b/csrc/gpu/append_attention.cu
index d24a20e48d11..d6f3efbbf3df 100644
--- a/csrc/gpu/append_attention.cu
+++ b/csrc/gpu/append_attention.cu
@@ -56,6 +56,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
     const std::string& cache_quant_type_str,
     const bool use_neox_rotary_style,
     const int max_input_length,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float out_linear_in_scale,
@@ -97,13 +98,13 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
   if (out_linear_in_scale > 0.0) {
     if (fabs(quant_max_bound - 127.0f) < 0.000001) {
       fmha_out = GetEmptyTensor(
-        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
+        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
         paddle::DataType::INT8,
         qkv.place());
     } 
     else if (fabs(quant_max_bound - 448.0f) < 0.000001) {
       fmha_out = GetEmptyTensor(
-        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
+        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
         paddle::DataType::FLOAT8_E4M3FN,
         qkv.place());
     }else{
@@ -111,7 +112,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
     }
   } else {
     fmha_out = GetEmptyTensor(
-        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
+        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
         D,
         qkv.place());
   }
@@ -203,6 +204,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             encoder_block_shape_q,
             max_input_length,
             max_enc_len_this_time_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -240,6 +242,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             encoder_block_shape_q,
             max_input_length,
             max_enc_len_this_time_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -282,6 +285,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
           encoder_block_shape_q,
           max_input_length,
           max_enc_len_this_time_data,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -428,6 +432,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             decoder_block_shape_q,
             max_input_length,
             max_len_kv_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -465,6 +470,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             decoder_block_shape_q,
             max_input_length,
             max_len_kv_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -508,6 +514,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
           decoder_block_shape_q,
           max_input_length,
           max_len_kv_data,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -565,6 +572,7 @@ std::vector<paddle::Tensor> AppendAttention(
     const std::string& cache_quant_type_str,
     const bool use_neox_rotary_style,
     const int max_input_length,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float out_linear_in_scale,
@@ -578,9 +586,10 @@ std::vector<paddle::Tensor> AppendAttention(
   meta_data.token_nums = qkv_dims[0];
   meta_data.kv_num_heads = key_cache_dims[1];
   meta_data.head_dims = key_cache_dims[3];
-  const int total_num_head =
-      qkv_dims[qkv_dims.size() - 1] / meta_data.head_dims;
-  meta_data.q_num_heads = total_num_head - 2 * meta_data.kv_num_heads;
+  meta_data.head_dims_v = value_cache.dims()[3];
+  const int q_hidden_size =
+      qkv_dims[qkv_dims.size() - 1] - meta_data.kv_num_heads * (meta_data.head_dims + meta_data.head_dims_v);
+  meta_data.q_num_heads = q_hidden_size / meta_data.head_dims;
 
   meta_data.max_blocks_per_seq = block_tables.dims()[1];
   meta_data.block_size = key_cache.dims()[2];
@@ -626,6 +635,7 @@ std::vector<paddle::Tensor> AppendAttention(
           cache_quant_type_str,
           use_neox_rotary_style,
           max_input_length,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -672,6 +682,7 @@ std::vector<paddle::Tensor> AppendAttention(
           cache_quant_type_str,
           use_neox_rotary_style,
           max_input_length,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -719,6 +730,7 @@ std::vector<paddle::Tensor> AppendAttention(
             cache_quant_type_str,
             use_neox_rotary_style,
             max_input_length,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -764,6 +776,7 @@ std::vector<paddle::Tensor> AppendAttention(
             cache_quant_type_str,
             use_neox_rotary_style,
             max_input_length,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -821,10 +834,12 @@ std::vector<std::vector<int64_t>> AppendAttentionInferShape(
     const paddle::optional<std::vector<int64_t>>& out_linear_smooths_shape) {
   const int token_num = qkv_shape[0];
   const int kv_num_heads = key_cache_shape[1];
-  const int head_dim = key_cache_shape[3];
-  const int total_num_head = qkv_shape[qkv_shape.size() - 1] / head_dim;
-  const int num_heads = total_num_head - 2 * kv_num_heads;
-  return {{token_num, num_heads * head_dim}, qkv_shape};
+  const int head_dim_qk = key_cache_shape[3];
+  const int head_dim_v = value_cache_shape[3];
+  const int q_hidden_size =
+      qkv_shape[qkv_shape.size() - 1] - kv_num_heads * (head_dim_qk + head_dim_v);
+  const int num_heads = q_hidden_size / head_dim_qk;
+  return {{token_num, num_heads * head_dim_v}, qkv_shape};
 }
 
 std::vector<paddle::DataType> AppendAttentionInferDtype(
@@ -865,6 +880,7 @@ std::vector<paddle::DataType> AppendAttentionInferDtype(
     const std::string& cache_quant_type_str,
     const bool use_neox_rotary_style,
     const int max_input_length,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float out_linear_in_scale,
@@ -941,6 +957,7 @@ PD_BUILD_OP(append_attention)
             "cache_quant_type: std::string",
             "use_neox_rotary_style: bool",
             "max_input_length: int",
+            "softmax_scale: float",
             "quant_max_bound: float",
             "quant_min_bound: float",
             "out_linear_in_scale: float",
diff --git a/csrc/gpu/append_attn/append_attention_c16_impl.cuh b/csrc/gpu/append_attn/append_attention_c16_impl.cuh
index 3b08d0a85dbc..8b75fa13cdca 100644
--- a/csrc/gpu/append_attn/append_attention_c16_impl.cuh
+++ b/csrc/gpu/append_attn/append_attention_c16_impl.cuh
@@ -23,15 +23,17 @@ template <typename T,
           uint32_t NUM_WARPS,
           uint32_t NUM_WARP_Q,
           uint32_t NUM_WARP_KV,
-          uint32_t HEAD_DIM,
+          uint32_t HEAD_DIM_QK,
+          uint32_t HEAD_DIM_V,
           uint32_t BLOCK_SIZE,
           uint32_t num_frags_x,
           uint32_t num_frags_z,
-          uint32_t num_frags_y,
+          uint32_t num_frags_y_qk,
+          uint32_t num_frags_y_v,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_kernel(
-    T *__restrict__ q,        // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     T *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                               // head_dim]
     T *__restrict__ cache_v,
@@ -46,7 +48,7 @@ __global__ void multi_query_append_attention_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -57,7 +59,9 @@ __global__ void multi_query_append_attention_kernel(
     float *__restrict__ tmp_d,      // [token_num, num_chunks, num_heads]
     OutT *__restrict__ out,
     const int speculate_max_draft_token_num = 5) {
-  constexpr uint32_t num_vecs_per_head = HEAD_DIM / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_qk =
+      HEAD_DIM_QK / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_v = HEAD_DIM_V / num_elems_per_128b<T>();
   const uint32_t btid = blockIdx.x, kv_head_idx = blockIdx.z;
   const uint32_t kv_num_heads = gridDim.z;
   const uint32_t q_num_heads = kv_num_heads * GROUP_SIZE;
@@ -104,25 +108,30 @@ __global__ void multi_query_append_attention_kernel(
 
   extern __shared__ uint8_t smem[];
   float s_frag[num_frags_x][num_frags_z][8];
-  float o_frag[num_frags_x][num_frags_y][8];
+  float o_frag[num_frags_x][num_frags_y_v][8];
   float m_frag[num_frags_x][2];
   float d_frag[num_frags_x][2];
-  init_states<T, num_frags_x, num_frags_y>(o_frag, m_frag, d_frag);
-
-  const uint32_t q_n_stride = q_num_heads * HEAD_DIM;
-  const uint32_t q_ori_n_stride = (q_num_heads + kv_num_heads * 2) * HEAD_DIM;
-  const uint32_t kv_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_h_stride = BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_b_stride = HEAD_DIM;
+  init_states<T, num_frags_x, num_frags_y_v>(o_frag, m_frag, d_frag);
+
+  const uint32_t q_n_stride = q_num_heads * HEAD_DIM_V;
+  const uint32_t q_ori_n_stride = q_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_V;
+  const uint32_t k_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_h_stride = BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_b_stride = HEAD_DIM_QK;
+  const uint32_t v_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_h_stride = BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_b_stride = HEAD_DIM_V;
   const uint32_t q_start_seq_id =
       batch_id * max_seq_len - __ldg(&cum_offsets[batch_id]);
   const uint32_t q_base_seq_id_this_block =
       (tile_id * NUM_WARPS + wid) * num_frags_x * 16;
   const uint32_t q_offset = q_start_seq_id * q_ori_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_QK +
                             tid % 8 * num_elems_per_128b<T>();
   const uint32_t o_offset = q_start_seq_id * q_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_V +
                             tid % 8 * num_elems_per_128b<T>();
   T *q_base_ptr = q + q_offset;
   T *o_base_ptr_T = nullptr;
@@ -130,13 +139,13 @@ __global__ void multi_query_append_attention_kernel(
   if constexpr (partition_kv) {
     if (ENABLE_PREFILL) {
       o_base_ptr_T = tmp_workspace + q_start_seq_id * num_chunks * q_n_stride +
-                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
                      tid % 8 * num_elems_per_128b<T>();
     } else {
       o_base_ptr_T =
           tmp_workspace +
           batch_id * speculate_max_draft_token_num * num_chunks * q_n_stride +
-          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
           tid % 8 * num_elems_per_128b<T>();
     }
   } else {
@@ -144,24 +153,42 @@ __global__ void multi_query_append_attention_kernel(
   }
   smem_t qo_smem(smem);
 
-  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       wid * num_frags_x * 16 + tid % 16, tid / 16);  // 16 * 16
-  load_q_global_smem<GROUP_SIZE, num_frags_x, num_frags_y, HEAD_DIM, T>(
+  load_q_global_smem<GROUP_SIZE, num_frags_x, num_frags_y_qk, HEAD_DIM_QK, T>(
       q_base_ptr,
       &qo_smem,
       q_base_seq_id_this_block,
       q_end,
       q_ori_n_stride,
-      HEAD_DIM);
+      HEAD_DIM_QK);
   commit_group();
   wait_group<0>();
   __syncthreads();
+#ifdef DEBUG_PERCISION
+  if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+    printf("q_smem(%d * 192个bfloat16):\n", 4 * num_frags_x * 16);
+    // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 * sizeof(CacheT);
+    T *q_smem_t = reinterpret_cast<T *>(qo_smem.base);
+    for (uint32_t i = 0; i < 4 * num_frags_x * 16; ++i) {
+      printf("q_smem[%d]:", (int)i);
+      for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+        printf("[");
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.2f ", (float)q_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+        }
+        printf("]");
+      }
+      printf("\n");
+    }
+  }
+  __syncthreads();
+#endif
+  q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y_qk, T>(
+      &qo_smem, softmax_scale);
 
-  q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y, T>(&qo_smem,
-                                                                scale);
-
-  smem_t k_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM * sizeof(T)),
-      v_smem(smem + (NUM_WARPS * num_frags_x + num_frags_z) * 16 * HEAD_DIM *
+  smem_t k_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM_QK * sizeof(T)),
+      v_smem(smem + (NUM_WARPS * num_frags_x + num_frags_z) * 16 * HEAD_DIM_QK *
                         sizeof(T));
 
 
@@ -182,50 +209,55 @@ __global__ void multi_query_append_attention_kernel(
                          chunk_start)))
               : chunk_len) /
       (num_frags_z * 16);
-  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       8 * (tid / 16) + tid % 8, (tid % 16) / 8);
 
   uint32_t v_smem_offset_r =
-      smem_t::get_permuted_offset<num_vecs_per_head>(tid % 16, tid / 16);
+      smem_t::get_permuted_offset<num_vecs_per_head_v>(tid % 16, tid / 16);
 
-  uint32_t kv_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
+      wid * 4 + tid / 8, tid % 8);
+  uint32_t v_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_v>(
       wid * 4 + tid / 8, tid % 8);
 
   uint32_t kv_idx_base = chunk_start;
   int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]);
-  const uint32_t const_offset = kv_head_idx * kv_h_stride +
-                                (wid * 4 + tid / 8) * kv_b_stride +
-                                tid % 8 * num_elems_per_128b<T>();
-  T *cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
-  T *cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+  const uint32_t const_offset_k = kv_head_idx * k_h_stride +
+                                  (wid * 4 + tid / 8) * k_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  const uint32_t const_offset_v = kv_head_idx * v_h_stride +
+                                  (wid * 4 + tid / 8) * v_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  T *cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
+  T *cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
 
   produce_kv_blockwise<SharedMemFillMode::kNoFill,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_qk,
                        num_frags_z,
                        NUM_WARP_Q>(k_smem,
-                                   &kv_smem_offset_w,
+                                   &k_smem_offset_w,
                                    &cache_k_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   k_n_stride,
+                                   k_h_stride,
+                                   k_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
   produce_kv_blockwise<SharedMemFillMode::kFillZero,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_v,
                        num_frags_z,
                        NUM_WARP_Q>(v_smem,
-                                   &kv_smem_offset_w,
+                                   &v_smem_offset_w,
                                    &cache_v_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   v_n_stride,
+                                   v_h_stride,
+                                   v_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
@@ -233,10 +265,45 @@ __global__ void multi_query_append_attention_kernel(
   for (uint32_t iter = 0; iter < num_iterations; ++iter) {
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("k_smem(%d * 192个bfloat16):\n", num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *k_smem_t = reinterpret_cast<T *>(k_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("k_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)k_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // s = qk
-    compute_qk<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_qk<num_frags_x, num_frags_y_qk, num_frags_z, T>(
         &qo_smem, &q_smem_offset_r, &k_smem, &k_smem_offset_r, s_frag);
+#ifdef DEBUG_PERCISION
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_z; ++j) {
+          printf("s_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
+    __syncthreads();
+#endif
     // mask according to kv_idx and q_idx
     if (iter >= mask_check_iteration) {
       mask_s<T,
@@ -245,7 +312,7 @@ __global__ void multi_query_append_attention_kernel(
              GROUP_SIZE,
              NUM_WARPS,
              num_frags_x,
-             num_frags_y,
+             num_frags_y_v,
              num_frags_z>(q_base_seq_id_this_block,
                           kv_idx_base,
                           q_len,
@@ -255,7 +322,7 @@ __global__ void multi_query_append_attention_kernel(
     }
 
     // update m,d
-    update_mdo_states<num_frags_x, num_frags_y, num_frags_z>(
+    update_mdo_states<num_frags_x, num_frags_y_v, num_frags_z>(
         s_frag, o_frag, m_frag, d_frag);
     __syncthreads();
 
@@ -264,43 +331,77 @@ __global__ void multi_query_append_attention_kernel(
     if (block_id < 0) {
       block_id = 0;
     }
-    cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
+    cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
     produce_kv_blockwise<SharedMemFillMode::kNoFill,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_qk,
                          num_frags_z,
                          NUM_WARP_Q>(k_smem,
-                                     &kv_smem_offset_w,
+                                     &k_smem_offset_w,
                                      &cache_k_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     k_n_stride,
+                                     k_h_stride,
+                                     k_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("v_smem(%d * 128个bfloat16):\n", num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *v_smem_t = reinterpret_cast<T *>(v_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("v_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_V / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)v_smem_t[i * HEAD_DIM_V + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // compute sfm*v
-    compute_sfm_v<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_sfm_v<num_frags_x, num_frags_y_v, num_frags_z, T>(
         &v_smem, &v_smem_offset_r, s_frag, o_frag, d_frag);
-
+#ifdef DEBUG_PERCISION
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+          printf("o_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
     __syncthreads();
-    cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+#endif
+    __syncthreads();
+    cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
     produce_kv_blockwise<SharedMemFillMode::kFillZero,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_v,
                          num_frags_z,
                          NUM_WARP_Q>(v_smem,
-                                     &kv_smem_offset_w,
+                                     &v_smem_offset_w,
                                      &cache_v_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     v_n_stride,
+                                     v_h_stride,
+                                     v_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
@@ -309,12 +410,28 @@ __global__ void multi_query_append_attention_kernel(
   __syncthreads();
 
   if constexpr (!partition_kv) {
-    normalize_d<num_frags_x, num_frags_y>(o_frag, d_frag);
+    normalize_d<num_frags_x, num_frags_y_v>(o_frag, d_frag);
+  }
+#ifdef DEBUG_PERCISION
+  __syncthreads();
+  if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+      blockIdx.x == 0) {
+    for (uint32_t i = 0; i < num_frags_x; ++i) {
+      for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+        printf("o_frag[%d][%d]:\n", i, j);
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.4f ", s_frag[i][j][k]);
+        }
+        printf("\n");
+      }
+    }
   }
+  __syncthreads();
+#endif
   if constexpr (partition_kv) {
     write_o_reg_gmem_shift_smooth_quant<GROUP_SIZE,
                                         num_frags_x,
-                                        num_frags_y,
+                                        num_frags_y_v,
                                         partition_kv>(
         o_frag,
         &qo_smem,
@@ -328,11 +445,11 @@ __global__ void multi_query_append_attention_kernel(
         in_scale,
         q_len,
         partition_kv ? q_n_stride * num_chunks : q_n_stride,
-        HEAD_DIM);
+        HEAD_DIM_V);
   } else {
     write_o_reg_gmem_shift_smooth_quant<GROUP_SIZE,
                                         num_frags_x,
-                                        num_frags_y,
+                                        num_frags_y_v,
                                         partition_kv>(
         o_frag,
         &qo_smem,
@@ -346,7 +463,7 @@ __global__ void multi_query_append_attention_kernel(
         in_scale,
         q_len,
         partition_kv ? q_n_stride * num_chunks : q_n_stride,
-        HEAD_DIM);
+        HEAD_DIM_V);
   }
 
 
@@ -387,15 +504,17 @@ template <typename T,
           uint32_t NUM_WARPS,
           uint32_t NUM_WARP_Q,
           uint32_t NUM_WARP_KV,
-          uint32_t HEAD_DIM,
+          uint32_t HEAD_DIM_QK,
+          uint32_t HEAD_DIM_V,
           uint32_t BLOCK_SIZE,
           uint32_t num_frags_x,
           uint32_t num_frags_z,
-          uint32_t num_frags_y,
+          uint32_t num_frags_y_qk,
+          uint32_t num_frags_y_v,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_warp1_4_kernel(
-    T *__restrict__ q,        // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     T *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                               // head_dim]
     T *__restrict__ cache_v,
@@ -410,7 +529,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -421,7 +540,9 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     float *__restrict__ tmp_d,      // [token_num, num_chunks, num_heads]
     OutT *__restrict__ out,
     const int speculate_max_draft_token_num = 5) {
-  constexpr uint32_t num_vecs_per_head = HEAD_DIM / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_qk =
+      HEAD_DIM_QK / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_v = HEAD_DIM_V / num_elems_per_128b<T>();
   static_assert(NUM_WARP_Q == 1, "NUM_WARP_Q must be 1");
   static_assert(NUM_WARP_KV == 4, "NUM_WARP_KV must be 4");
   const uint32_t btid = blockIdx.x, kv_head_idx = blockIdx.z;
@@ -467,24 +588,29 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
 
   extern __shared__ uint8_t smem[];
   float s_frag[num_frags_x][num_frags_z][8];
-  float o_frag[num_frags_x][num_frags_y][8];
+  float o_frag[num_frags_x][num_frags_y_v][8];
   float m_frag[num_frags_x][2];
   float d_frag[num_frags_x][2];
-  init_states<T, num_frags_x, num_frags_y>(o_frag, m_frag, d_frag);
-
-  const uint32_t q_n_stride = q_num_heads * HEAD_DIM;
-  const uint32_t q_ori_n_stride = (q_num_heads + kv_num_heads * 2) * HEAD_DIM;
-  const uint32_t kv_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_h_stride = BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_b_stride = HEAD_DIM;
+  init_states<T, num_frags_x, num_frags_y_v>(o_frag, m_frag, d_frag);
+
+  const uint32_t q_n_stride = q_num_heads * HEAD_DIM_V;
+  const uint32_t q_ori_n_stride = q_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_V;
+  const uint32_t k_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_h_stride = BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_b_stride = HEAD_DIM_QK;
+  const uint32_t v_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_h_stride = BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_b_stride = HEAD_DIM_V;
   const uint32_t q_start_seq_id =
       batch_id * max_seq_len - __ldg(&cum_offsets[batch_id]);
   const uint32_t q_base_seq_id_this_block = tile_id * num_frags_x * 16;
   const uint32_t q_offset = q_start_seq_id * q_ori_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_QK +
                             tid % 8 * num_elems_per_128b<T>();
   const uint32_t o_offset = q_start_seq_id * q_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_V +
                             tid % 8 * num_elems_per_128b<T>();
   T *q_base_ptr = q + q_offset;
   T *o_base_ptr_T = nullptr;
@@ -494,41 +620,59 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   } else {
     if (ENABLE_PREFILL) {
       o_base_ptr_T = tmp_workspace + batch_id * num_chunks * q_n_stride +
-                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
                      tid % 8 * num_elems_per_128b<T>();
     } else {
       o_base_ptr_T =
           tmp_workspace +
           batch_id * speculate_max_draft_token_num * num_chunks * q_n_stride +
-          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
           tid % 8 * num_elems_per_128b<T>();
     }
   }
 
   smem_t qo_smem(smem);
 
-  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       tid % 16, tid / 16);  // 16 * 16
   load_q_global_smem_multi_warps<GROUP_SIZE,
                                  num_frags_x,
-                                 num_frags_y,
-                                 HEAD_DIM,
+                                 num_frags_y_qk,
+                                 HEAD_DIM_QK,
                                  T>(q_base_ptr,
                                     &qo_smem,
                                     q_base_seq_id_this_block,
                                     q_end,
                                     q_ori_n_stride,
-                                    HEAD_DIM);
+                                    HEAD_DIM_QK);
   commit_group();
   wait_group<0>();
   __syncthreads();
+#ifdef DEBUG_PERCISION_DEC
+  if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+    printf("q_smem(%d * 192个bfloat16):\n", num_frags_x * 16);
+    // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 * sizeof(CacheT);
+    T *q_smem_t = reinterpret_cast<T *>(qo_smem.base);
+    for (uint32_t i = 0; i < 4 * num_frags_x * 16; ++i) {
+      printf("q_smem[%d]:", (int)i);
+      for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+        printf("[");
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.2f ", (float)q_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+        }
+        printf("]");
+      }
+      printf("\n");
+    }
+  }
+  __syncthreads();
+#endif
+  q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y_qk, T>(
+      &qo_smem, softmax_scale);
 
-  q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y, T>(
-      &qo_smem, scale);
-
-  smem_t k_smem(smem + num_frags_x * 16 * HEAD_DIM * sizeof(T)),
-      v_smem(smem + (num_frags_x + NUM_WARP_KV * num_frags_z) * 16 * HEAD_DIM *
-                        sizeof(T));
+  smem_t k_smem(smem + num_frags_x * 16 * HEAD_DIM_QK * sizeof(T)),
+      v_smem(smem + (num_frags_x + NUM_WARP_KV * num_frags_z) * 16 *
+                        HEAD_DIM_QK * sizeof(T));
 
   const uint32_t num_iterations = div_up(
       CAUSAL
@@ -548,34 +692,39 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
               : chunk_len) /
       (NUM_WARP_KV * num_frags_z * 16);
 
-  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       wid * num_frags_z * 16 + 8 * (tid / 16) + tid % 8, (tid % 16) / 8);
 
-  uint32_t v_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t v_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_v>(
       wid * num_frags_z * 16 + tid % 16, tid / 16);
-  uint32_t kv_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
+      wid * 4 + tid / 8, tid % 8);
+  uint32_t v_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_v>(
       wid * 4 + tid / 8, tid % 8);
 
   uint32_t kv_idx_base = chunk_start;
   int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]);
-  const uint32_t const_offset = kv_head_idx * kv_h_stride +
-                                (wid * 4 + tid / 8) * kv_b_stride +
-                                tid % 8 * num_elems_per_128b<T>();
-  T *cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
-  T *cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+  const uint32_t const_offset_k = kv_head_idx * k_h_stride +
+                                  (wid * 4 + tid / 8) * k_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  const uint32_t const_offset_v = kv_head_idx * v_h_stride +
+                                  (wid * 4 + tid / 8) * v_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  T *cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
+  T *cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
 
   produce_kv_blockwise<SharedMemFillMode::kNoFill,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_qk,
                        num_frags_z,
                        NUM_WARP_Q>(k_smem,
-                                   &kv_smem_offset_w,
+                                   &k_smem_offset_w,
                                    &cache_k_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   k_n_stride,
+                                   k_h_stride,
+                                   k_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
@@ -583,15 +732,15 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   produce_kv_blockwise<SharedMemFillMode::kFillZero,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_v,
                        num_frags_z,
                        NUM_WARP_Q>(v_smem,
-                                   &kv_smem_offset_w,
+                                   &v_smem_offset_w,
                                    &cache_v_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   v_n_stride,
+                                   v_h_stride,
+                                   v_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
@@ -600,10 +749,45 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   for (uint32_t iter = 0; iter < num_iterations; ++iter) {
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION_DEC
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("k_smem(%d * 192个bfloat16):\n", 4 * num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *k_smem_t = reinterpret_cast<T *>(k_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("k_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)k_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // s = qk
-    compute_qk<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_qk<num_frags_x, num_frags_y_qk, num_frags_z, T>(
         &qo_smem, &q_smem_offset_r, &k_smem, &k_smem_offset_r, s_frag);
+#ifdef DEBUG_PERCISION_DEC
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_z; ++j) {
+          printf("s_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
+    __syncthreads();
+#endif
     // mask according to kv_idx and q_idx
     if (iter >= mask_check_iteration) {
       mask_s<T,
@@ -612,7 +796,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
              GROUP_SIZE,
              NUM_WARPS,
              num_frags_x,
-             num_frags_y,
+             num_frags_y_v,
              num_frags_z>(q_base_seq_id_this_block,
                           kv_idx_base + wid * num_frags_z * 16,
                           q_len,
@@ -622,7 +806,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     }
 
     // update m,d
-    update_mdo_states<num_frags_x, num_frags_y, num_frags_z>(
+    update_mdo_states<num_frags_x, num_frags_y_v, num_frags_z>(
         s_frag, o_frag, m_frag, d_frag);
     __syncthreads();
 
@@ -631,43 +815,77 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     if (block_id < 0) {
       block_id = 0;
     }
-    cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
+    cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
     produce_kv_blockwise<SharedMemFillMode::kNoFill,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_qk,
                          num_frags_z,
                          NUM_WARP_Q>(k_smem,
-                                     &kv_smem_offset_w,
+                                     &k_smem_offset_w,
                                      &cache_k_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     k_n_stride,
+                                     k_h_stride,
+                                     k_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION_DEC
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("v_smem(%d * 128个bfloat16):\n", 4 * num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *v_smem_t = reinterpret_cast<T *>(v_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("v_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_V / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)v_smem_t[i * HEAD_DIM_V + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // compute sfm*v
-    compute_sfm_v<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_sfm_v<num_frags_x, num_frags_y_v, num_frags_z, T>(
         &v_smem, &v_smem_offset_r, s_frag, o_frag, d_frag);
     __syncthreads();
-
-    cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+#ifdef DEBUG_PERCISION_DEC
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+          printf("o_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
+    __syncthreads();
+#endif
+    cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
     produce_kv_blockwise<SharedMemFillMode::kFillZero,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_v,
                          num_frags_z,
                          NUM_WARP_Q>(v_smem,
-                                     &kv_smem_offset_w,
+                                     &v_smem_offset_w,
                                      &cache_v_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     v_n_stride,
+                                     v_h_stride,
+                                     v_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
@@ -675,19 +893,34 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   wait_group<0>();
   __syncthreads();
 
-  merge_block_res_v2<num_frags_x, num_frags_y, T>(
+  merge_block_res_v2<num_frags_x, num_frags_y_v, T>(
       o_frag, reinterpret_cast<float *>(smem), m_frag, d_frag, wid, tid);
 
   if (num_chunks_this_seq <= 1) {
-    normalize_d<num_frags_x, num_frags_y>(o_frag, d_frag);
+    normalize_d<num_frags_x, num_frags_y_v>(o_frag, d_frag);
   }
-
+#ifdef DEBUG_PERCISION_DEC
+  __syncthreads();
+  if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+      blockIdx.x == 0) {
+    for (uint32_t i = 0; i < num_frags_x; ++i) {
+      for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+        printf("o_frag[%d][%d]:\n", i, j);
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.4f ", s_frag[i][j][k]);
+        }
+        printf("\n");
+      }
+    }
+  }
+  __syncthreads();
+#endif
   // write o
   // [num_frags_x, 16, num_frags_y, 16]
   if (num_chunks_this_seq <= 1) {
     write_o_reg_gmem_multi_warps_shift_smooth_quant<GROUP_SIZE,
                                                     num_frags_x,
-                                                    num_frags_y,
+                                                    num_frags_y_v,
                                                     false>(
         o_frag,
         &qo_smem,
@@ -701,11 +934,11 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
         in_scale,
         q_len,
         q_n_stride,
-        HEAD_DIM);
+        HEAD_DIM_V);
   } else {
     write_o_reg_gmem_multi_warps_shift_smooth_quant<GROUP_SIZE,
                                                     num_frags_x,
-                                                    num_frags_y,
+                                                    num_frags_y_v,
                                                     partition_kv>(
         o_frag,
         &qo_smem,
@@ -719,7 +952,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
         in_scale,
         q_len,
         q_n_stride * num_chunks,
-        HEAD_DIM);
+        HEAD_DIM_V);
   }
 
   if (num_chunks_this_seq > 1) {
@@ -757,7 +990,8 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
 
 template <typename T,
           uint32_t GROUP_SIZE,
-          uint32_t HEAD_DIM,
+          uint32_t HEAD_DIM_QK,
+          uint32_t HEAD_DIM_V,
           uint32_t BLOCK_SIZE,
           bool CAUSAL,
           uint32_t BLOCK_SHAPE_Q,
@@ -783,6 +1017,7 @@ void MultiQueryAppendAttention(
     const int num_blocks_x_cpu,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -802,18 +1037,18 @@ void MultiQueryAppendAttention(
   constexpr uint32_t num_warps = 4;
   constexpr uint32_t NUM_WARP_KV = num_warps / NUM_WARP_Q;
   constexpr uint32_t num_frags_x = BLOCK_SHAPE_Q / (16 * NUM_WARP_Q);  // 1 or 2
-  constexpr uint32_t num_frags_y = HEAD_DIM / 16;
+  constexpr uint32_t num_frags_y_qk = HEAD_DIM_QK / 16;
+  constexpr uint32_t num_frags_y_v = HEAD_DIM_V / 16;
   constexpr uint32_t num_qrow_per_block = NUM_WARP_Q * num_frags_x * 16;
 
   auto *allocator = paddle::GetAllocator(qkv.place());
 
-  const float scale = 1.f / sqrt(HEAD_DIM);
-
   if constexpr (NUM_WARP_Q == 4) {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16;
     constexpr uint32_t smem_size =
-        (num_warps * num_frags_x + NUM_WARP_KV * num_frags_z * 2) * 16 *
-        HEAD_DIM * sizeof(T);
+        ((num_warps * num_frags_x + NUM_WARP_KV * num_frags_z) * HEAD_DIM_QK +
+         NUM_WARP_KV * num_frags_z * HEAD_DIM_V) *
+        16 * sizeof(T);
     auto split_kv_kernel = multi_query_append_attention_kernel<NV_TYPE,
                                                                true,
                                                                GROUP_SIZE,
@@ -821,11 +1056,13 @@ void MultiQueryAppendAttention(
                                                                num_warps,
                                                                NUM_WARP_Q,
                                                                NUM_WARP_KV,
-                                                               HEAD_DIM,
+                                                               HEAD_DIM_QK,
+                                                               HEAD_DIM_V,
                                                                BLOCK_SIZE,
                                                                num_frags_x,
                                                                num_frags_z,
-                                                               num_frags_y,
+                                                               num_frags_y_qk,
+                                                               num_frags_y_v,
                                                                OUT_NV_TYPE,
                                                                ENABLE_PREFILL>;
     if (smem_size >= 48 * 1024) {
@@ -853,11 +1090,13 @@ void MultiQueryAppendAttention(
                                               num_warps,
                                               NUM_WARP_Q,
                                               NUM_WARP_KV,
-                                              HEAD_DIM,
+                                              HEAD_DIM_QK,
+                                              HEAD_DIM_V,
                                               BLOCK_SIZE,
                                               num_frags_x,
                                               num_frags_z,
-                                              num_frags_y,
+                                              num_frags_y_qk,
+                                              num_frags_y_v,
                                               OUT_NV_TYPE,
                                               ENABLE_PREFILL>;
       if (smem_size >= 48 * 1024) {
@@ -885,7 +1124,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -899,9 +1138,10 @@ void MultiQueryAppendAttention(
     } else {
       phi::Allocator::AllocationPtr tmp_workspace, tmp_m, tmp_d;
       if (ENABLE_PREFILL) {
-        tmp_workspace = allocator->Allocate(
-            phi::SizeOf(qkv.dtype()) *
-            static_cast<size_t>(token_num * num_chunks * num_heads * HEAD_DIM));
+        tmp_workspace =
+            allocator->Allocate(phi::SizeOf(qkv.dtype()) *
+                                static_cast<size_t>(token_num * num_chunks *
+                                                    num_heads * HEAD_DIM_V));
         tmp_m = allocator->Allocate(
             phi::SizeOf(paddle::DataType::FLOAT32) *
             static_cast<size_t>(token_num * num_chunks * num_heads));
@@ -912,7 +1152,7 @@ void MultiQueryAppendAttention(
         tmp_workspace = allocator->Allocate(
             phi::SizeOf(qkv.dtype()) *
             static_cast<size_t>(speculate_max_draft_token_num * bsz *
-                                num_chunks * num_heads * HEAD_DIM));
+                                num_chunks * num_heads * HEAD_DIM_V));
         tmp_m = allocator->Allocate(
             phi::SizeOf(paddle::DataType::FLOAT32) *
             static_cast<size_t>(speculate_max_draft_token_num * bsz *
@@ -942,7 +1182,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -955,14 +1195,14 @@ void MultiQueryAppendAttention(
       // merge
       constexpr int vec_size = num_elems_per_128b<NV_TYPE>();
       if (is_decoder) {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
         dim3 grids_merge(bsz, num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_decoder_kernel<NV_TYPE,
                                           vec_size,
                                           blocky,
-                                          HEAD_DIM,
+                                          HEAD_DIM_V,
                                           OUT_NV_TYPE,
                                           ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -987,9 +1227,9 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM);
+                HEAD_DIM_V);
       } else {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
         dim3 grids_merge(min(sm_count * 4, token_num),
                          num_heads);  // 128k is too large
@@ -997,7 +1237,7 @@ void MultiQueryAppendAttention(
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
                                      blocky,
-                                     HEAD_DIM,
+                                     HEAD_DIM_V,
                                      OUT_NV_TYPE,
                                      ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -1022,7 +1262,7 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM,
+                HEAD_DIM_V,
                 token_num,
                 speculate_max_draft_token_num);
       }
@@ -1030,8 +1270,9 @@ void MultiQueryAppendAttention(
   } else {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16 / NUM_WARP_KV;
     constexpr uint32_t smem_size =
-        (num_frags_x + NUM_WARP_KV * num_frags_z * 2) * 16 * HEAD_DIM *
-        sizeof(T);
+        ((num_frags_x + NUM_WARP_KV * num_frags_z) * HEAD_DIM_QK +
+         NUM_WARP_KV * num_frags_z * HEAD_DIM_V) *
+        16 * sizeof(T);
     auto split_kv_kernel =
         multi_query_append_attention_warp1_4_kernel<NV_TYPE,
                                                     true,
@@ -1040,11 +1281,13 @@ void MultiQueryAppendAttention(
                                                     num_warps,
                                                     NUM_WARP_Q,
                                                     NUM_WARP_KV,
-                                                    HEAD_DIM,
+                                                    HEAD_DIM_QK,
+                                                    HEAD_DIM_V,
                                                     BLOCK_SIZE,
                                                     num_frags_x,
                                                     num_frags_z,
-                                                    num_frags_y,
+                                                    num_frags_y_qk,
+                                                    num_frags_y_v,
                                                     OUT_NV_TYPE,
                                                     ENABLE_PREFILL>;
     if (smem_size >= 48 * 1024) {
@@ -1074,11 +1317,13 @@ void MultiQueryAppendAttention(
                                                       num_warps,
                                                       NUM_WARP_Q,
                                                       NUM_WARP_KV,
-                                                      HEAD_DIM,
+                                                      HEAD_DIM_QK,
+                                                      HEAD_DIM_V,
                                                       BLOCK_SIZE,
                                                       num_frags_x,
                                                       num_frags_z,
-                                                      num_frags_y,
+                                                      num_frags_y_qk,
+                                                      num_frags_y_v,
                                                       OUT_NV_TYPE,
                                                       ENABLE_PREFILL>;
       if (smem_size >= 48 * 1024) {
@@ -1106,7 +1351,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1121,7 +1366,7 @@ void MultiQueryAppendAttention(
       if (is_decoder) {
         tmp_workspace = allocator->Allocate(
             phi::SizeOf(qkv.dtype()) *
-            static_cast<size_t>(bsz * num_chunks * num_heads * HEAD_DIM));
+            static_cast<size_t>(bsz * num_chunks * num_heads * HEAD_DIM_V));
         tmp_m = allocator->Allocate(
             phi::SizeOf(paddle::DataType::FLOAT32) *
             static_cast<size_t>(bsz * num_chunks * num_heads));
@@ -1133,7 +1378,7 @@ void MultiQueryAppendAttention(
           tmp_workspace =
               allocator->Allocate(phi::SizeOf(qkv.dtype()) *
                                   static_cast<size_t>(token_num * num_chunks *
-                                                      num_heads * HEAD_DIM));
+                                                      num_heads * HEAD_DIM_V));
           tmp_m = allocator->Allocate(
               phi::SizeOf(paddle::DataType::FLOAT32) *
               static_cast<size_t>(token_num * num_chunks * num_heads));
@@ -1144,7 +1389,7 @@ void MultiQueryAppendAttention(
           tmp_workspace = allocator->Allocate(
               phi::SizeOf(qkv.dtype()) *
               static_cast<size_t>(speculate_max_draft_token_num * bsz *
-                                  num_chunks * num_heads * HEAD_DIM));
+                                  num_chunks * num_heads * HEAD_DIM_V));
           tmp_m = allocator->Allocate(
               phi::SizeOf(paddle::DataType::FLOAT32) *
               static_cast<size_t>(speculate_max_draft_token_num * bsz *
@@ -1174,7 +1419,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1188,14 +1433,14 @@ void MultiQueryAppendAttention(
       // merge
       constexpr int vec_size = num_elems_per_128b<NV_TYPE>();
       if (is_decoder) {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
         dim3 grids_merge(bsz, num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_decoder_kernel<NV_TYPE,
                                           vec_size,
                                           blocky,
-                                          HEAD_DIM,
+                                          HEAD_DIM_V,
                                           OUT_NV_TYPE,
                                           ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -1220,17 +1465,16 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM);
+                HEAD_DIM_V);
       } else {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
-        dim3 grids_merge(min(sm_count * 4, token_num),
-                         num_heads);
+        dim3 grids_merge(min(sm_count * 4, token_num), num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
                                      blocky,
-                                     HEAD_DIM,
+                                     HEAD_DIM_V,
                                      OUT_NV_TYPE,
                                      ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -1255,7 +1499,7 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM,
+                HEAD_DIM_V,
                 token_num,
                 speculate_max_draft_token_num);
       }
@@ -1265,37 +1509,39 @@ void MultiQueryAppendAttention(
 
 template <typename T, typename OutT>
 void CascadeAppendAttentionC16Kernel(
-    const AppendAttnMetaData& meta_data,
-    const paddle::Tensor& qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
-    const paddle::Tensor&
-        cache_k,  // [max_block_num, num_heads, block_size, head_dim]
-    const paddle::Tensor&
-        cache_v,  // [max_block_num, num_heads, head_dim, block_size]
-    const paddle::optional<paddle::Tensor>& attn_mask,
-    const paddle::optional<paddle::Tensor>&
-        cache_k_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_k_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        shift_bias,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        smooth_weight,  // [num_kv_heads, head_dim]
-    const paddle::Tensor& seq_lens_q,
-    const paddle::Tensor& seq_lens_kv,
-    const paddle::Tensor& seq_lens_encoder,
-    const paddle::Tensor& padding_offsets,
-    const paddle::Tensor& cum_offsets,
-    const paddle::Tensor& block_table,
-    const paddle::Tensor& batch_ids,
-    const paddle::Tensor& tile_ids_per_batch,
+    const AppendAttnMetaData &meta_data,
+    const paddle::Tensor
+        &qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    const paddle::Tensor
+        &cache_k,  // [max_block_num, num_heads, block_size, head_dim]
+    const paddle::Tensor
+        &cache_v,  // [max_block_num, num_heads, head_dim, block_size]
+    const paddle::optional<paddle::Tensor> &attn_mask,
+    const paddle::optional<paddle::Tensor>
+        &cache_k_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_k_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &shift_bias,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &smooth_weight,  // [num_kv_heads, head_dim]
+    const paddle::Tensor &seq_lens_q,
+    const paddle::Tensor &seq_lens_kv,
+    const paddle::Tensor &seq_lens_encoder,
+    const paddle::Tensor &padding_offsets,
+    const paddle::Tensor &cum_offsets,
+    const paddle::Tensor &block_table,
+    const paddle::Tensor &batch_ids,
+    const paddle::Tensor &tile_ids_per_batch,
     const int num_blocks,
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -1303,14 +1549,15 @@ void CascadeAppendAttentionC16Kernel(
     const bool causal,
     const bool is_decoder,
     const bool enable_prefill,
-    cudaStream_t& stream,
-    paddle::Tensor* out) {
+    cudaStream_t &stream,
+    paddle::Tensor *out) {
   const auto token_num = meta_data.token_nums;
   const auto block_size = meta_data.block_size;
   const auto bsz = meta_data.batch_size;
   const auto num_heads = meta_data.q_num_heads;
   const auto group_size = meta_data.q_num_heads / meta_data.kv_num_heads;
-  const auto head_dim = meta_data.head_dims;
+  const auto head_dim_qk = meta_data.head_dims;
+  const auto head_dim_v = meta_data.head_dims_v;
 
   DISPATCH_CAUSAL(
       causal,
@@ -1322,46 +1569,51 @@ void CascadeAppendAttentionC16Kernel(
               group_size,
               GROUP_SIZE,
               {DISPATCH_HEAD_DIM(
-                  head_dim,
-                  HEAD_DIM,
-                  {DISPATCH_BLOCK_SIZE(
-                      block_size,
-                      BLOCK_SIZE,
-                      {DISPATCH_BLOCKSHAPE_Q(
-                          block_shape_q, BLOCK_SHAPE_Q, NUM_WARP_Q, {
-                            MultiQueryAppendAttention<T,
-                                                      GROUP_SIZE,
-                                                      HEAD_DIM,
-                                                      BLOCK_SIZE,
-                                                      CAUSAL,
-                                                      BLOCK_SHAPE_Q,
-                                                      NUM_WARP_Q,
-                                                      OutT,
-                                                      ENABLE_PREFILL>(
-                                meta_data,
-                                qkv,
-                                cache_k,
-                                cache_v,
-                                attn_mask,
-                                shift_bias,
-                                smooth_weight,
-                                seq_lens_q,
-                                seq_lens_kv,
-                                seq_lens_encoder,
-                                padding_offsets,
-                                cum_offsets,
-                                block_table,
-                                batch_ids,
-                                tile_ids_per_batch,
-                                num_blocks,
-                                max_seq_len,
-                                max_dec_len,
-                                quant_max_bound,
-                                quant_min_bound,
-                                in_scale,
-                                speculate_max_draft_token_num,
-                                is_decoder,
-                                stream,
-                                out);
-                          })})})})})})
+                  head_dim_qk,
+                  HEAD_DIM_QK,
+                  {DISPATCH_HEAD_DIM(
+                      head_dim_v,
+                      HEAD_DIM_V,
+                      {DISPATCH_BLOCK_SIZE(
+                          block_size,
+                          BLOCK_SIZE,
+                          {DISPATCH_BLOCKSHAPE_Q(
+                              block_shape_q, BLOCK_SHAPE_Q, NUM_WARP_Q, {
+                                MultiQueryAppendAttention<T,
+                                                          GROUP_SIZE,
+                                                          HEAD_DIM_QK,
+                                                          HEAD_DIM_V,
+                                                          BLOCK_SIZE,
+                                                          CAUSAL,
+                                                          BLOCK_SHAPE_Q,
+                                                          NUM_WARP_Q,
+                                                          OutT,
+                                                          ENABLE_PREFILL>(
+                                    meta_data,
+                                    qkv,
+                                    cache_k,
+                                    cache_v,
+                                    attn_mask,
+                                    shift_bias,
+                                    smooth_weight,
+                                    seq_lens_q,
+                                    seq_lens_kv,
+                                    seq_lens_encoder,
+                                    padding_offsets,
+                                    cum_offsets,
+                                    block_table,
+                                    batch_ids,
+                                    tile_ids_per_batch,
+                                    num_blocks,
+                                    max_seq_len,
+                                    max_dec_len,
+                                    softmax_scale,
+                                    quant_max_bound,
+                                    quant_min_bound,
+                                    in_scale,
+                                    speculate_max_draft_token_num,
+                                    is_decoder,
+                                    stream,
+                                    out);
+                              })})})})})})})
 }
diff --git a/csrc/gpu/append_attn/append_attention_c4_impl.cuh b/csrc/gpu/append_attn/append_attention_c4_impl.cuh
index 7d49de3966e0..fac1baf6f4c2 100644
--- a/csrc/gpu/append_attn/append_attention_c4_impl.cuh
+++ b/csrc/gpu/append_attn/append_attention_c4_impl.cuh
@@ -51,7 +51,7 @@ __global__ void multi_query_append_attention_c4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -189,7 +189,7 @@ __global__ void multi_query_append_attention_c4_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y, T>(&qo_smem,
-                                                                scale);
+                                                                softmax_scale);
 
   T cache_k_scale_frag[num_frags_y][4];
   T cache_k_zp_frag[num_frags_y][4];
@@ -509,7 +509,7 @@ __global__ void multi_query_append_attention_c4_warp1_4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -649,7 +649,7 @@ __global__ void multi_query_append_attention_c4_warp1_4_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y, T>(
-      &qo_smem, scale);
+      &qo_smem, softmax_scale);
 
   T cache_k_scale_frag[num_frags_y][4];
   T cache_k_zp_frag[num_frags_y][4];
@@ -970,6 +970,7 @@ void MultiQueryAppendC4Attention(
     const int num_blocks_x_cpu,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -994,8 +995,6 @@ void MultiQueryAppendC4Attention(
 
   auto *allocator = paddle::GetAllocator(qkv.place());
 
-  const float scale = 1.f / sqrt(HEAD_DIM);
-
   if constexpr (NUM_WARP_Q == 4) {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16;
     constexpr uint32_t smem_size =
@@ -1091,7 +1090,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1154,7 +1153,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1336,7 +1335,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1412,7 +1411,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1533,6 +1532,7 @@ void CascadeAppendAttentionC4Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -1597,6 +1597,7 @@ void CascadeAppendAttentionC4Kernel(
                                 num_blocks,
                                 max_seq_len,
                                 max_dec_len,
+                                softmax_scale,
                                 quant_max_bound,
                                 quant_min_bound,
                                 in_scale,
diff --git a/csrc/gpu/append_attn/append_attention_c8_impl.cuh b/csrc/gpu/append_attn/append_attention_c8_impl.cuh
index e0ede51a9c81..df2357bb192b 100644
--- a/csrc/gpu/append_attn/append_attention_c8_impl.cuh
+++ b/csrc/gpu/append_attn/append_attention_c8_impl.cuh
@@ -32,7 +32,7 @@ template <typename T,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_c8_kernel(
-    T *__restrict__ q,             // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     CacheT *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                                    // head_dim]
     CacheT *__restrict__ cache_v,
@@ -49,7 +49,7 @@ __global__ void multi_query_append_attention_c8_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -172,7 +172,7 @@ __global__ void multi_query_append_attention_c8_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y, T>(&qo_smem,
-                                                                scale);
+                                                                softmax_scale);
   smem_t k_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM * sizeof(T)),
       v_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM * sizeof(T) +
              num_frags_z * 16 * HEAD_DIM * sizeof(CacheT));
@@ -206,8 +206,7 @@ __global__ void multi_query_append_attention_c8_kernel(
 
   uint32_t k_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_head_k, inv_k_stride>(
-          wid * 4 + tid / 8,
-          tid % 8);  
+          wid * 4 + tid / 8, tid % 8);
   uint32_t v_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_blocksize, inv_v_stride>(
           wid * 8 + tid / 4, tid % 4);  // 4 * 128 / 8 = 64
@@ -338,7 +337,6 @@ __global__ void multi_query_append_attention_c8_kernel(
                                        chunk_end,
                                        const_v_offset);
     commit_group();
-
   }
   wait_group<0>();
   __syncthreads();
@@ -434,7 +432,7 @@ template <typename T,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_c8_warp1_4_kernel(
-    T *__restrict__ q,             // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     CacheT *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                                    // head_dim]
     CacheT *__restrict__ cache_v,
@@ -451,7 +449,7 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -575,7 +573,7 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y, T>(
-      &qo_smem, scale);
+      &qo_smem, softmax_scale);
 
   smem_t k_smem(smem + num_frags_x * 16 * HEAD_DIM * sizeof(T)),
       v_smem(smem + num_frags_x * 16 * HEAD_DIM * sizeof(T) +
@@ -610,12 +608,10 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
 
   uint32_t k_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_head_k, inv_k_stride>(
-          wid * 4 + tid / 8,
-          tid %
-              8);  
+          wid * 4 + tid / 8, tid % 8);
   uint32_t v_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_blocksize, inv_v_stride>(
-          wid * 8 + tid / 4, tid % 4);  
+          wid * 8 + tid / 4, tid % 4);
 
   uint32_t kv_idx_base = chunk_start;
   const uint32_t const_k_offset = kv_head_idx * kv_h_stride +
@@ -805,7 +801,6 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
           const uint32_t qo_head_idx = q_head_idx + qo_idx_now % GROUP_SIZE;
           const uint32_t qo_idx = q_start_seq_id + qo_idx_now / GROUP_SIZE;
           if (qo_idx - q_start_seq_id < q_len) {
-
             uint32_t offset;
             if (ENABLE_PREFILL) {
               offset = (batch_id * num_chunks + chunk_idx) * q_num_heads +
@@ -857,6 +852,7 @@ void MultiQueryAppendC8Attention(
     const int num_blocks_x_cpu,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -881,8 +877,6 @@ void MultiQueryAppendC8Attention(
 
   auto *allocator = paddle::GetAllocator(qkv.place());
 
-  const float scale = 1.f / sqrt(HEAD_DIM);
-
   if constexpr (NUM_WARP_Q == 4) {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16;
     constexpr uint32_t smem_size =
@@ -963,7 +957,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1020,7 +1014,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1069,8 +1063,7 @@ void MultiQueryAppendC8Attention(
       } else {
         constexpr int blockx = HEAD_DIM / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
-        dim3 grids_merge(min(sm_count * 4, token_num),
-                         num_heads);
+        dim3 grids_merge(min(sm_count * 4, token_num), num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
@@ -1186,7 +1179,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1256,7 +1249,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1300,8 +1293,7 @@ void MultiQueryAppendC8Attention(
       } else {
         constexpr int blockx = HEAD_DIM / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
-        dim3 grids_merge(min(sm_count * 4, token_num),
-                         num_heads);
+        dim3 grids_merge(min(sm_count * 4, token_num), num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
@@ -1341,37 +1333,39 @@ void MultiQueryAppendC8Attention(
 
 template <typename T, typename OutT>
 void CascadeAppendAttentionC8Kernel(
-    const AppendAttnMetaData& meta_data,
-    const paddle::Tensor& qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
-    const paddle::Tensor&
-        cache_k,  // [max_block_num, num_heads, block_size, head_dim]
-    const paddle::Tensor&
-        cache_v,  // [max_block_num, num_heads, head_dim, block_size]
-    const paddle::optional<paddle::Tensor>& attn_mask,
-    const paddle::optional<paddle::Tensor>&
-        cache_k_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_k_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        shift_bias,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        smooth_weight,  // [num_kv_heads, head_dim]
-    const paddle::Tensor& seq_lens_q,
-    const paddle::Tensor& seq_lens_kv,
-    const paddle::Tensor& seq_lens_encoder,
-    const paddle::Tensor& padding_offsets,
-    const paddle::Tensor& cum_offsets,
-    const paddle::Tensor& block_table,
-    const paddle::Tensor& batch_ids,
-    const paddle::Tensor& tile_ids_per_batch,
+    const AppendAttnMetaData &meta_data,
+    const paddle::Tensor
+        &qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    const paddle::Tensor
+        &cache_k,  // [max_block_num, num_heads, block_size, head_dim]
+    const paddle::Tensor
+        &cache_v,  // [max_block_num, num_heads, head_dim, block_size]
+    const paddle::optional<paddle::Tensor> &attn_mask,
+    const paddle::optional<paddle::Tensor>
+        &cache_k_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_k_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &shift_bias,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &smooth_weight,  // [num_kv_heads, head_dim]
+    const paddle::Tensor &seq_lens_q,
+    const paddle::Tensor &seq_lens_kv,
+    const paddle::Tensor &seq_lens_encoder,
+    const paddle::Tensor &padding_offsets,
+    const paddle::Tensor &cum_offsets,
+    const paddle::Tensor &block_table,
+    const paddle::Tensor &batch_ids,
+    const paddle::Tensor &tile_ids_per_batch,
     const int num_blocks,
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -1379,8 +1373,8 @@ void CascadeAppendAttentionC8Kernel(
     const bool causal,
     const bool is_decoder,
     const bool enable_prefill,
-    cudaStream_t& stream,
-    paddle::Tensor* out) {
+    cudaStream_t &stream,
+    paddle::Tensor *out) {
   const auto token_num = meta_data.token_nums;
   const auto block_size = meta_data.block_size;
   const auto bsz = meta_data.batch_size;
@@ -1434,6 +1428,7 @@ void CascadeAppendAttentionC8Kernel(
                                 num_blocks,
                                 max_seq_len,
                                 max_dec_len,
+                                softmax_scale,
                                 quant_max_bound,
                                 quant_min_bound,
                                 in_scale,
diff --git a/csrc/gpu/append_attn/append_attention_kernel.h b/csrc/gpu/append_attn/append_attention_kernel.h
index b0fabcf893d3..b34c2a044733 100644
--- a/csrc/gpu/append_attn/append_attention_kernel.h
+++ b/csrc/gpu/append_attn/append_attention_kernel.h
@@ -49,6 +49,7 @@ void CascadeAppendAttentionC16Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -92,6 +93,7 @@ void CascadeAppendAttentionC8Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -135,6 +137,7 @@ void CascadeAppendAttentionC4Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -179,6 +182,7 @@ void CascadeAppendAttentionKernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -212,6 +216,7 @@ void CascadeAppendAttentionKernel(
                                              block_shape_q,
                                              max_seq_len,
                                              max_dec_len,
+                                             softmax_scale,
                                              quant_max_bound,
                                              quant_min_bound,
                                              in_scale,
@@ -245,6 +250,7 @@ void CascadeAppendAttentionKernel(
                                             block_shape_q,
                                             max_seq_len,
                                             max_dec_len,
+                                            softmax_scale,
                                             quant_max_bound,
                                             quant_min_bound,
                                             in_scale,
@@ -278,6 +284,7 @@ void CascadeAppendAttentionKernel(
                                             block_shape_q,
                                             max_seq_len,
                                             max_dec_len,
+                                            softmax_scale,
                                             quant_max_bound,
                                             quant_min_bound,
                                             in_scale,
diff --git a/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh b/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh
index 1a8e73759022..5fbb53f05801 100644
--- a/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh
+++ b/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh
@@ -122,6 +122,91 @@ __global__ void append_decode_cache_T_rope_kernel(
   }
 }
 
+template <typename T, int VecSize = 1>
+__global__ void append_decode_cache_T_kernel(
+    const T* __restrict__ qkv,  // [bsz, num_heads + 2 * kv_num_heads,
+                                      // head_size]
+    T* __restrict__ key_cache,    // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    T* __restrict__ value_cache,  // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    const int* __restrict__ block_tables,     // [bsz, max_blocks_per_seq]
+    const int* __restrict__ padding_offsets,  // [num_tokens]
+    const int* __restrict__ cum_offsets,
+    const int* __restrict__ seq_lens,          // [bsz]
+    const int* __restrict__ seq_lens_encoder,  // [bsz]
+    const int max_seq_len,
+    const int max_blocks_per_seq,
+    const int num_heads,
+    const int head_size_qk,
+    const int head_size_v,
+    const int block_size,
+    const uint32_t elem_cnt,
+    const int kv_num_heads) {
+  using LoadT = AlignedVector<T, VecSize>;
+  using LoadBiasT = AlignedVector<T, VecSize>;
+  using LoadKVT = AlignedVector<T, VecSize>;
+  constexpr int HalfVecSize = VecSize / 2;
+  using LoadEmbT = AlignedVector<float, HalfVecSize>;
+  LoadT src_vec;
+  LoadBiasT out_vec;
+  LoadKVT cache_vec;
+
+  int64_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  // const int64_t hidden_size = (num_heads + 2 * kv_num_heads) * head_size;
+  const uint32_t hidden_size_q = num_heads * head_size_qk;
+  const uint32_t hidden_size_k = kv_num_heads * head_size_qk;
+  const uint32_t hidden_size_v = kv_num_heads * head_size_v;
+  const int64_t hidden_size = hidden_size_q + hidden_size_k + hidden_size_v;
+  const uint32_t offset = kv_num_heads * (head_size_qk + head_size_v);
+  // const int64_t offset = 2 * hidden_size;
+  // const int half_head_size = head_size / 2;
+  for (int32_t linear_index = global_thread_idx * VecSize,
+               step = gridDim.x * blockDim.x * VecSize;
+       linear_index < elem_cnt;
+       linear_index += step) {
+    const int ori_bi = linear_index / offset;
+    const int bias = linear_index % offset;
+    const int start_token_idx = ori_bi * max_seq_len - cum_offsets[ori_bi];
+    if (seq_lens_encoder[ori_bi] > 0) return;
+    const int write_seq_id = seq_lens[ori_bi];
+    
+    if (write_seq_id == 0) continue;
+
+    const int* block_table_now = nullptr;
+
+    block_table_now = block_tables + ori_bi * max_blocks_per_seq;
+    const int block_idx = block_table_now[write_seq_id / block_size];
+    const int block_offset = write_seq_id % block_size;
+
+    if (bias < hidden_size_k) {
+      const uint32_t qkv_bias = bias;
+      const uint32_t hi = qkv_bias / head_size_qk;
+      const uint32_t h_bias = qkv_bias % head_size_qk;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_qk +
+                             hi * block_size * head_size_qk +
+                             block_offset * head_size_qk + h_bias;
+      const uint32_t ori_idx =
+          start_token_idx * hidden_size +
+          hidden_size_q + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &key_cache[tgt_idx]);
+    } else {
+      const uint32_t qkv_bias = bias - hidden_size_k;
+      const uint32_t hi = qkv_bias / head_size_v;
+      const uint32_t h_bias = qkv_bias % head_size_v;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_v +
+                             hi * block_size * head_size_v +
+                             block_offset * head_size_v + h_bias;
+      const uint32_t ori_idx =
+          start_token_idx * hidden_size +
+          hidden_size_q + hidden_size_k + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &value_cache[tgt_idx]);
+    }
+  }
+}
+
 template <typename T, int VecSize = 1>
 __global__ void append_decode_cache_T_rope_kernel(
     const int* __restrict__ quant_qkv,  // [bsz, num_heads + 2 * kv_num_heads,
diff --git a/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu b/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu
index ee0cd57e307c..08483feb2a5c 100644
--- a/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu
+++ b/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu
@@ -15,6 +15,54 @@
 #include "decoder_write_cache_with_rope_kernel.h"
 #include "utils.cuh"
 
+
+template <typename T>
+void DecoderWriteCacheKV(const AppendAttnMetaData& meta_data,
+                         const paddle::Tensor& qkv,
+                         const paddle::Tensor& seq_lens,
+                         const paddle::Tensor& seq_lens_encoder,
+                         const paddle::Tensor& padding_offsets,
+                         const paddle::Tensor& cum_offsets,
+                         const paddle::Tensor& block_tables,
+                         const int max_seq_len,
+                         cudaStream_t& stream,
+                         paddle::Tensor* key_cache_out,
+                         paddle::Tensor* value_cache_out) {
+  auto max_blocks_per_seq = meta_data.max_blocks_per_seq;
+  auto bsz = meta_data.batch_size;
+  auto block_size = meta_data.block_size;
+  auto head_dim_qk = meta_data.head_dims;
+  auto head_dim_v = meta_data.head_dims_v;
+  auto num_heads = meta_data.q_num_heads;
+  auto kv_num_heads = meta_data.kv_num_heads;
+  const uint32_t elem_nums = bsz * kv_num_heads * (head_dim_qk + head_dim_v);
+
+  constexpr int PackSize = 16 / sizeof(T);
+  const int pack_num = elem_nums / PackSize;
+  const int blocksize = 128;
+  int grid_size = 1;
+  GetNumBlocks<128>(pack_num, &grid_size);
+
+  append_decode_cache_T_kernel<T, PackSize>
+      <<<grid_size, blocksize, 0, stream>>>(
+          reinterpret_cast<T*>(const_cast<T*>(qkv.data<T>())),
+          reinterpret_cast<T*>(key_cache_out->data<T>()),
+          reinterpret_cast<T*>(value_cache_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          head_dim_qk,
+          head_dim_v,
+          block_size,
+          elem_nums,
+          kv_num_heads);
+}
+
 template <typename T, typename QKV_TYPE>
 void append_decode_cache_rope(const QKV_TYPE* qkv,
                               T* key_cache,
@@ -449,115 +497,125 @@ void DecoderWriteCacheWithRoPEKernel(
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
 
-  const float* cos_emb =
-      rotary_embs ? rotary_embs.get().data<float>() : nullptr;
-  const float* sin_emb;
   if (rotary_embs) {
-    sin_emb =
+    const float* cos_emb = rotary_embs.get().data<float>();
+    const float* sin_emb =
         use_neox_rotary_style
             ? rotary_embs.get().data<float>() + max_seq_len * dim_head
             : rotary_embs.get().data<float>() + max_seq_len * dim_head / 2;
-  }
-  if (cache_quant_type_str == "none") {
-    append_decode_cache_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int8") {
-    append_decode_cache_int8_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int4_zp") {
-    append_decode_cache_int4_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        cache_k_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_k_zp.get().data<T>()))
-                   : nullptr,
-        cache_v_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_v_zp.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        stream,
-        use_neox_rotary_style);
+    if (cache_quant_type_str == "none") {
+      append_decode_cache_rope(
+          reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+          reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
+          reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
+          reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          cos_emb,
+          sin_emb,
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(qkv_biases.get().data<T>()))
+                     : nullptr,
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          kv_num_heads,
+          dim_head,
+          block_size,
+          bsz,
+          stream,
+          use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int8") {
+      append_decode_cache_int8_rope(
+          reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+          key_cache_out->data<uint8_t>(),
+          value_cache_out->data<uint8_t>(),
+          reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          cos_emb,
+          sin_emb,
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(qkv_biases.get().data<T>()))
+                     : nullptr,
+          cache_k_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+          cache_v_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          kv_num_heads,
+          dim_head,
+          block_size,
+          bsz,
+          stream,
+          use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int4_zp") {
+      append_decode_cache_int4_rope(
+          reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+          key_cache_out->data<uint8_t>(),
+          value_cache_out->data<uint8_t>(),
+          reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          cos_emb,
+          sin_emb,
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(qkv_biases.get().data<T>()))
+                     : nullptr,
+          cache_k_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+          cache_v_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+          cache_k_zp ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(cache_k_zp.get().data<T>()))
+                     : nullptr,
+          cache_v_zp ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(cache_v_zp.get().data<T>()))
+                     : nullptr,
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          kv_num_heads,
+          dim_head,
+          block_size,
+          bsz,
+          stream,
+          use_neox_rotary_style);
+    } else {
+      PD_THROW(
+          "cache_quant_type_str should be one of [none, cache_int8, "
+          "cache_int4_zp]");
+    }
   } else {
-    PD_THROW(
-        "cache_quant_type_str should be one of [none, cache_int8, "
-        "cache_int4_zp]");
+    DecoderWriteCacheKV<QKV_TYPE>(meta_data,
+                                  qkv,
+                                  seq_lens,
+                                  seq_lens_encoder,
+                                  padding_offsets,
+                                  cum_offsets,
+                                  block_tables,
+                                  max_seq_len,
+                                  stream,
+                                  key_cache_out,
+                                  value_cache_out);
   }
 }
 
diff --git a/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh b/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh
index eef8bcf2038d..c1dd09d4b3de 100644
--- a/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh
+++ b/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh
@@ -405,19 +405,18 @@ __global__ void GQAVariableLengthRotaryKernel(
 }
 
 template <typename T, int VecSize = 1>
-__global__ void GQAVariableLengthRotaryKernel(
-    const T *qkv,
-    const float *cos_emb,
-    const float *sin_emb,
-    const int *padding_offsets,
-    const int *seq_lens,
-    const int *seq_lens_decoder,
-    T *qkv_out,
-    const int64_t elem_cnt,
-    const int q_num_head,
-    const int kv_num_head,
-    const int seq_len,
-    const int last_dim) {
+__global__ void GQAVariableLengthRotaryKernel(const T *qkv,
+                                              const float *cos_emb,
+                                              const float *sin_emb,
+                                              const int *padding_offsets,
+                                              const int *seq_lens,
+                                              const int *seq_lens_decoder,
+                                              T *qkv_out,
+                                              const int64_t elem_cnt,
+                                              const int q_num_head,
+                                              const int kv_num_head,
+                                              const int seq_len,
+                                              const int last_dim) {
   using LoadT = AlignedVector<T, VecSize>;
   constexpr int HalfVecSize = VecSize / 2;
   using LoadEmbT = AlignedVector<float, HalfVecSize>;
@@ -555,21 +554,20 @@ __global__ void GQANeoxVariableLengthRotaryKernel(
 }
 
 template <typename T, int VecSize = 1>
-__global__ void GQANeoxVariableLengthRotaryKernel(
-    const T *qkv,
-    const float *cos_emb,
-    const float *sin_emb,
-    const int *padding_offsets,
-    const int *seq_lens,
-    const int *seq_lens_decoder,
-    const float *qkv_out_scales,
-    const T *qkv_biases,
-    T *qkv_out,
-    const int64_t elem_cnt,
-    const int q_num_head,
-    const int kv_num_head,
-    const int seq_len,
-    const int last_dim) {
+__global__ void GQANeoxVariableLengthRotaryKernel(const T *qkv,
+                                                  const float *cos_emb,
+                                                  const float *sin_emb,
+                                                  const int *padding_offsets,
+                                                  const int *seq_lens,
+                                                  const int *seq_lens_decoder,
+                                                  const float *qkv_out_scales,
+                                                  const T *qkv_biases,
+                                                  T *qkv_out,
+                                                  const int64_t elem_cnt,
+                                                  const int q_num_head,
+                                                  const int kv_num_head,
+                                                  const int seq_len,
+                                                  const int last_dim) {
   using LoadT = AlignedVector<T, VecSize>;
   using LoadEmbT = AlignedVector<float, VecSize>;
   LoadT left_vec;
@@ -634,7 +632,8 @@ __global__ void cache_kernel(
     const int max_seq_len,
     const int max_blocks_per_seq,
     const int num_heads,
-    const int head_size,
+    const int head_size_qk,
+    const int head_size_v,
     const int block_size,
     const uint32_t elem_cnt,
     const int kv_num_heads) {
@@ -642,24 +641,21 @@ __global__ void cache_kernel(
   LoadT src_vec;
 
   uint32_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  const uint32_t hidden_size = kv_num_heads * head_size;
-  const uint32_t offset = 2 * hidden_size;
+  const uint32_t hidden_size_q = num_heads * head_size_qk;
+  const uint32_t hidden_size_k = kv_num_heads * head_size_qk;
+  const uint32_t hidden_size_v = kv_num_heads * head_size_v;
+  const uint32_t offset = hidden_size_k + hidden_size_v;
   for (uint32_t linear_index = global_thread_idx * VecSize,
                 step = gridDim.x * blockDim.x * VecSize;
        linear_index < elem_cnt;
        linear_index += step) {
     const uint32_t token_idx = linear_index / offset;
     const uint32_t bias = linear_index % offset;
-    const uint32_t qkv_id = bias / hidden_size;  // skip q
-    const uint32_t qkv_bias = bias % hidden_size;
-    const uint32_t hi = qkv_bias / head_size;
-    const uint32_t h_bias = qkv_bias % head_size;
     const uint32_t ori_token_idx = token_idx + padding_offsets[token_idx];
     const uint32_t ori_bi = ori_token_idx / max_seq_len;
     if (seq_lens[ori_bi] == 0) continue;
     const uint32_t ori_seq_id =
         ori_token_idx % max_seq_len + seq_lens_decoder[ori_bi];
-
     const int32_t *block_table_now = nullptr;
 
     block_table_now = block_tables + ori_bi * max_blocks_per_seq;
@@ -667,16 +663,29 @@ __global__ void cache_kernel(
     const uint32_t block_idx = block_table_now[ori_seq_id / block_size];
     const uint32_t block_offset = ori_seq_id % block_size;
 
-    const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size +
-                             hi * block_size * head_size +
-                             block_offset * head_size + h_bias;
-    const uint32_t ori_idx =
-        token_idx * (num_heads + 2 * kv_num_heads) * head_size +
-        num_heads * head_size + qkv_id * hidden_size + hi * head_size + h_bias;
-    Load<T, VecSize>(&qkv[ori_idx], &src_vec);
-    if (qkv_id == 0) {
+    if (bias < hidden_size_k) {
+      const uint32_t qkv_bias = bias;
+      const uint32_t hi = qkv_bias / head_size_qk;
+      const uint32_t h_bias = qkv_bias % head_size_qk;
+      const uint32_t tgt_idx =
+          block_idx * kv_num_heads * block_size * head_size_qk +
+          hi * block_size * head_size_qk + block_offset * head_size_qk + h_bias;
+      const uint32_t ori_idx =
+          token_idx * (hidden_size_q + hidden_size_k + hidden_size_v) +
+          hidden_size_q + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
       Store<T, VecSize>(src_vec, &key_cache[tgt_idx]);
     } else {
+      const uint32_t qkv_bias = bias - hidden_size_k;
+      const uint32_t hi = qkv_bias / head_size_v;
+      const uint32_t h_bias = qkv_bias % head_size_v;
+      const uint32_t tgt_idx =
+          block_idx * kv_num_heads * block_size * head_size_v +
+          hi * block_size * head_size_v + block_offset * head_size_v + h_bias;
+      const uint32_t ori_idx =
+          token_idx * (hidden_size_q + hidden_size_k + hidden_size_v) +
+          hidden_size_q + hidden_size_k + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
       Store<T, VecSize>(src_vec, &value_cache[tgt_idx]);
     }
   }
@@ -736,8 +745,11 @@ __global__ void append_write_cache_kv_c8_qkv(
       batch_id * max_seq_len - cum_offsets[batch_id];
   const uint32_t kv_batch_stride = (num_heads + 2 * kv_num_heads) * HEAD_DIM;
   const uint32_t kv_h_stride = HEAD_DIM;
-  __shared__ T k_smem_ori[num_rows_per_block * HEAD_DIM];
-  __shared__ T v_smem_ori[num_rows_per_block * HEAD_DIM];
+  extern __shared__ uint8_t smem[];
+  T *k_smem_ori = (T *)smem;  // [num_rows_per_block * HEAD_DIM];
+  T *v_smem_ori =
+      (T *)(smem + num_rows_per_block * HEAD_DIM *
+                       sizeof(T));  // [num_rows_per_block * HEAD_DIM];
 
   smem_t k_smem(k_smem_ori);
   smem_t v_smem(v_smem_ori);
@@ -983,12 +995,22 @@ __global__ void append_write_cache_kv_c4_qkv(
       batch_id * max_seq_len - cum_offsets[batch_id];
   const uint32_t kv_batch_stride = (num_heads + 2 * kv_num_heads) * HEAD_DIM;
   const uint32_t kv_h_stride = HEAD_DIM;
-  __shared__ T k_smem_ori[num_rows_per_block * HEAD_DIM];
-  __shared__ T v_smem_ori[num_rows_per_block * HEAD_DIM];
-  __shared__ T k_scale_smem[HEAD_DIM];
-  __shared__ T v_scale_smem[HEAD_DIM];
-  __shared__ T k_zero_point_smem[HEAD_DIM];
-  __shared__ T v_zero_point_smem[HEAD_DIM];
+  extern __shared__ uint8_t smem[];
+  T *k_smem_ori = (T *)smem;  // [num_rows_per_block * HEAD_DIM];
+  T *v_smem_ori =
+      (T *)(smem + num_rows_per_block * HEAD_DIM *
+                       sizeof(T));  // [num_rows_per_block * HEAD_DIM];
+  T *k_scale_smem = (T *)(smem + num_rows_per_block * HEAD_DIM * 2 *
+                                     sizeof(T));  // [HEAD_DIM];
+  T *v_scale_smem =
+      (T *)(smem + (num_rows_per_block * HEAD_DIM * 2 + HEAD_DIM) *
+                       sizeof(T));  // [HEAD_DIM];
+  T *k_zero_point_smem =
+      (T *)(smem + (num_rows_per_block * HEAD_DIM * 2 + HEAD_DIM * 2) *
+                       sizeof(T));  // [HEAD_DIM];
+  T *v_zero_point_smem =
+      (T *)(smem + (num_rows_per_block * HEAD_DIM * 2 + HEAD_DIM * 3) *
+                       sizeof(T));  // [HEAD_DIM];
   const T *cache_k_scale_now = cache_k_scales + kv_head_idx * HEAD_DIM;
   const T *cache_k_zp_now = cache_k_zero_points + kv_head_idx * HEAD_DIM;
   const T *cache_v_scale_now = cache_v_scales + kv_head_idx * HEAD_DIM;
@@ -1033,16 +1055,10 @@ __global__ void append_write_cache_kv_c4_qkv(
       for (uint32_t fy = 0; fy < num_frags_y / 4;
            ++fy) {  // (num_frags_y * 16) / (8 *  num_elems_per_128b<T>())
         if (chunk_start >= start_len && chunk_start < end_len) {
-          k_smem
-              .load_128b_async<SharedMemFillMode::kNoFill>(
-                  kv_smem_offset_w,
-                  qkv_input + k_read_idx,
-                  chunk_start < end_len);
-          v_smem
-              .load_128b_async<SharedMemFillMode::kNoFill>(
-                  kv_smem_offset_w,
-                  qkv_input + v_read_idx,
-                  chunk_start < end_len);
+          k_smem.load_128b_async<SharedMemFillMode::kNoFill>(
+              kv_smem_offset_w, qkv_input + k_read_idx, chunk_start < end_len);
+          v_smem.load_128b_async<SharedMemFillMode::kNoFill>(
+              kv_smem_offset_w, qkv_input + v_read_idx, chunk_start < end_len);
         }
         kv_smem_offset_w =
             k_smem.advance_offset_by_column<8>(kv_smem_offset_w, fy);
@@ -1248,9 +1264,8 @@ void rotary_qk_variable(
     const int dim_head,
     const cudaStream_t &stream,
     bool use_neox_style = false) {
-  int64_t elem_nums =
-      qkv_out_scales ? token_num * 3 * head_num * dim_head
-                     : token_num * 2 * head_num * dim_head;
+  int64_t elem_nums = qkv_out_scales ? token_num * 3 * head_num * dim_head
+                                     : token_num * 2 * head_num * dim_head;
   if (use_neox_style) {
     elem_nums /= 2;
   }
@@ -1458,11 +1473,12 @@ void CascadeAppendWriteCacheKVQKV(
   auto num_tokens = meta_data.token_nums;
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
-  auto head_dim = meta_data.head_dims;
+  auto head_dim_qk = meta_data.head_dims;
+  auto head_dim_v = meta_data.head_dims_v;
   auto block_size = meta_data.block_size;
 
   const uint32_t elem_nums =
-      num_tokens * 2 * kv_num_heads * head_dim;
+      num_tokens * kv_num_heads * (head_dim_qk + head_dim_v);
   constexpr int PackSize = 16 / sizeof(T);
   const int pack_num = elem_nums / PackSize;
   const int blocksize = 128;
@@ -1479,7 +1495,8 @@ void CascadeAppendWriteCacheKVQKV(
       max_seq_len,
       max_blocks_per_seq,
       num_heads,
-      head_dim,
+      head_dim_qk,
+      head_dim_v,
       block_size,
       elem_nums,
       kv_num_heads);
@@ -1511,7 +1528,6 @@ void CascadeAppendWriteCacheKVC8QKV(
   auto num_tokens = meta_data.token_nums;
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
-  auto head_dim = meta_data.head_dims;
 
   const uint32_t pad_len = BLOCK_SIZE;
 
@@ -1530,24 +1546,27 @@ void CascadeAppendWriteCacheKVC8QKV(
                                                 HEAD_DIM,
                                                 BLOCK_SIZE,
                                                 num_warps>;
-  cudaFuncSetAttribute(
-      kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
-  kernel_fn<<<grids, blocks, 0, stream>>>(cache_k_out->data<uint8_t>(),
-                                          cache_v_out->data<uint8_t>(),
-                                          qkv.data<T>(),
-                                          cache_k_scale.data<T>(),
-                                          cache_v_scale.data<T>(),
-                                          batch_ids.data<int>(),
-                                          tile_ids_per_batch.data<int>(),
-                                          seq_lens_this_time.data<int>(),
-                                          seq_lens_decoder.data<int>(),
-                                          padding_offsets.data<int>(),
-                                          cum_offsets.data<int>(),
-                                          block_table.data<int>(),
-                                          max_seq_len,
-                                          max_blocks_per_seq,
-                                          num_heads,
-                                          kv_num_heads);
+  if (smem_size >= 48 * 1024) {
+    cudaFuncSetAttribute(
+        kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
+  }
+  kernel_fn<<<grids, blocks, smem_size, stream>>>(
+      cache_k_out->data<uint8_t>(),
+      cache_v_out->data<uint8_t>(),
+      qkv.data<T>(),
+      cache_k_scale.data<T>(),
+      cache_v_scale.data<T>(),
+      batch_ids.data<int>(),
+      tile_ids_per_batch.data<int>(),
+      seq_lens_this_time.data<int>(),
+      seq_lens_decoder.data<int>(),
+      padding_offsets.data<int>(),
+      cum_offsets.data<int>(),
+      block_table.data<int>(),
+      max_seq_len,
+      max_blocks_per_seq,
+      num_heads,
+      kv_num_heads);
 }
 
 template <typename T, uint32_t HEAD_DIM, uint32_t BLOCK_SIZE>
@@ -1578,7 +1597,6 @@ void CascadeAppendWriteCacheKVC4QKV(
   auto num_tokens = meta_data.token_nums;
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
-  auto head_dim = meta_data.head_dims;
 
   const uint32_t pad_len = BLOCK_SIZE;
 
@@ -1598,24 +1616,27 @@ void CascadeAppendWriteCacheKVC4QKV(
                                                 HEAD_DIM,
                                                 BLOCK_SIZE,
                                                 num_warps>;
-  cudaFuncSetAttribute(
-      kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
-  kernel_fn<<<grids, blocks, 0, stream>>>(cache_k_out->data<uint8_t>(),
-                                          cache_v_out->data<uint8_t>(),
-                                          qkv.data<T>(),
-                                          cache_k_scale.data<T>(),
-                                          cache_v_scale.data<T>(),
-                                          cache_k_zp.data<T>(),
-                                          cache_v_zp.data<T>(),
-                                          batch_ids.data<int>(),
-                                          tile_ids_per_batch.data<int>(),
-                                          seq_lens_this_time.data<int>(),
-                                          seq_lens_decoder.data<int>(),
-                                          padding_offsets.data<int>(),
-                                          cum_offsets.data<int>(),
-                                          block_table.data<int>(),
-                                          max_seq_len,
-                                          max_blocks_per_seq,
-                                          num_heads,
-                                          kv_num_heads);
+  if (smem_size >= 48 * 1024) {
+    cudaFuncSetAttribute(
+        kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
+  }
+  kernel_fn<<<grids, blocks, smem_size, stream>>>(
+      cache_k_out->data<uint8_t>(),
+      cache_v_out->data<uint8_t>(),
+      qkv.data<T>(),
+      cache_k_scale.data<T>(),
+      cache_v_scale.data<T>(),
+      cache_k_zp.data<T>(),
+      cache_v_zp.data<T>(),
+      batch_ids.data<int>(),
+      tile_ids_per_batch.data<int>(),
+      seq_lens_this_time.data<int>(),
+      seq_lens_decoder.data<int>(),
+      padding_offsets.data<int>(),
+      cum_offsets.data<int>(),
+      block_table.data<int>(),
+      max_seq_len,
+      max_blocks_per_seq,
+      num_heads,
+      kv_num_heads);
 }
\ No newline at end of file
diff --git a/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h b/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h
index 6a14eaf3dde4..3c2f1100964a 100644
--- a/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h
+++ b/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h
@@ -48,43 +48,45 @@ void EncoderWriteCacheWithRopeKernel(
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
   auto head_dim = meta_data.head_dims;
-
-  if (num_heads == kv_num_heads) {
-    rotary_qk_variable(
-        qkv_out->data<T>(),
-        qkv.data<QKV_TYPE>(),
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? qkv_biases.get().data<T>() : nullptr,
-        rotary_embs.get().data<float>(),
-        padding_offsets.data<int>(),
-        seq_lens_encoder.data<int>(),
-        seq_lens_decoder.data<int>(),
-        token_num,
-        num_heads,
-        max_seq_len,
-        rotary_embs.get().dims()[2],
-        head_dim,
-        stream,
-        use_neox_style);
-  } else {
-    gqa_rotary_qk_variable(
-        qkv_out->data<T>(),
-        qkv.data<QKV_TYPE>(),
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? qkv_biases.get().data<T>() : nullptr,
-        rotary_embs.get().data<float>(),
-        padding_offsets.data<int>(),
-        seq_lens_encoder.data<int>(),
-        seq_lens_decoder.data<int>(),
-        token_num,
-        num_heads,
-        kv_num_heads,
-        max_seq_len,
-        rotary_embs.get().dims()[2],
-        head_dim,
-        stream,
-        use_neox_style);
+  if (rotary_embs) {
+    if (num_heads == kv_num_heads) {
+      rotary_qk_variable(
+          qkv_out->data<T>(),
+          qkv.data<QKV_TYPE>(),
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? qkv_biases.get().data<T>() : nullptr,
+          rotary_embs.get().data<float>(),
+          padding_offsets.data<int>(),
+          seq_lens_encoder.data<int>(),
+          seq_lens_decoder.data<int>(),
+          token_num,
+          num_heads,
+          max_seq_len,
+          rotary_embs.get().dims()[2],
+          head_dim,
+          stream,
+          use_neox_style);
+    } else {
+      gqa_rotary_qk_variable(
+          qkv_out->data<T>(),
+          qkv.data<QKV_TYPE>(),
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? qkv_biases.get().data<T>() : nullptr,
+          rotary_embs.get().data<float>(),
+          padding_offsets.data<int>(),
+          seq_lens_encoder.data<int>(),
+          seq_lens_decoder.data<int>(),
+          token_num,
+          num_heads,
+          kv_num_heads,
+          max_seq_len,
+          rotary_embs.get().dims()[2],
+          head_dim,
+          stream,
+          use_neox_style);
+    }
   }
+  
   const uint32_t block_size = meta_data.block_size;
   if (cache_quant_type_str == "none") {
     CascadeAppendWriteCacheKVQKV<T>(meta_data,
diff --git a/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh b/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh
index 50fa4e458e9a..7940dd3f94d3 100644
--- a/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh
+++ b/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh
@@ -301,6 +301,96 @@ __global__ void append_speculate_cache_rope_kernel(
   }
 }
 
+template <typename T, int VecSize = 1>
+__global__ void append_speculate_cache_kernel(
+    const T* __restrict__ qkv,  // [bsz, num_heads + 2 * kv_num_heads,
+                                      // head_size]
+    T* __restrict__ key_cache,    // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    T* __restrict__ value_cache,  // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    const int* __restrict__ block_tables,     // [bsz, max_blocks_per_seq]
+    const int* __restrict__ padding_offsets,  // [num_tokens]
+    const int* __restrict__ cum_offsets,
+    const int* __restrict__ seq_lens_decoder,  // [bsz]
+    const int max_seq_len,
+    const int max_blocks_per_seq,
+    const int num_heads,
+    const int head_size_qk,
+    const int head_size_v,
+    const int block_size,
+    const uint32_t elem_cnt,
+    const int kv_num_heads) {
+  using LoadT = AlignedVector<T, VecSize>;
+  constexpr int HalfVecSize = VecSize / 2;
+  LoadT src_vec;
+
+  int64_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  // const int64_t hidden_size = (num_heads + 2 * kv_num_heads) * head_size;
+  const uint32_t hidden_size_q = num_heads * head_size_qk;
+  const uint32_t hidden_size_k = kv_num_heads * head_size_qk;
+  const uint32_t hidden_size_v = kv_num_heads * head_size_v;
+  const int64_t hidden_size = hidden_size_q + hidden_size_k + hidden_size_v;
+  const uint32_t offset = kv_num_heads * (head_size_qk + head_size_v);
+  // const int64_t offset = 2 * hidden_size;
+  // const int half_head_size = head_size / 2;
+  for (int32_t linear_index = global_thread_idx * VecSize,
+               step = gridDim.x * blockDim.x * VecSize;
+       linear_index < elem_cnt;
+       linear_index += step) {
+    const int token_id = linear_index / offset;
+    const int ori_bi = (token_id + padding_offsets[token_id]) / max_seq_len;
+    if (seq_lens_decoder[ori_bi] == 0) continue;
+    const int bias = linear_index % offset;
+    const int start_token_idx = ori_bi * max_seq_len - cum_offsets[ori_bi];
+    const int write_seq_id = 
+        seq_lens_decoder[ori_bi] + token_id - start_token_idx;;
+    if (write_seq_id == 0) continue;
+
+    const int* block_table_now = nullptr;
+    block_table_now = block_tables + ori_bi * max_blocks_per_seq;
+    const int block_idx = block_table_now[write_seq_id / block_size];
+    if (block_idx < 0) {
+      printf(
+          "Fatal Error!!!, block idx %d when write_seq_id is %d\n some key var "
+          "%d %d %d %d\n",
+          block_idx,
+          write_seq_id,
+          ori_bi,
+          seq_lens_decoder[ori_bi],
+          token_id,
+          cum_offsets[ori_bi]);
+    }
+    const int block_offset = write_seq_id % block_size;
+
+    if (bias < hidden_size_k) {
+      const uint32_t qkv_bias = bias;
+      const uint32_t hi = qkv_bias / head_size_qk;
+      const uint32_t h_bias = qkv_bias % head_size_qk;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_qk +
+                             hi * block_size * head_size_qk +
+                             block_offset * head_size_qk + h_bias;
+      const uint32_t ori_idx =
+          token_id * hidden_size +
+          hidden_size_q + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &key_cache[tgt_idx]);
+    } else {
+      const uint32_t qkv_bias = bias - hidden_size_k;
+      const uint32_t hi = qkv_bias / head_size_v;
+      const uint32_t h_bias = qkv_bias % head_size_v;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_v +
+                             hi * block_size * head_size_v +
+                             block_offset * head_size_v + h_bias;
+      const uint32_t ori_idx =
+          token_id * hidden_size +
+          hidden_size_q + hidden_size_k + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &value_cache[tgt_idx]);
+    }
+  }
+}
+
 template <typename T, int VecSize = 1, typename InT = T>
 __global__ void append_speculate_cache_neox_rope_kernel(
     const InT* __restrict__ qkv,  // [token_num, num_heads + 2 * gqa_group_size,
diff --git a/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu b/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu
index 588442183d1d..9aab503972d5 100644
--- a/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu
+++ b/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu
@@ -15,6 +15,52 @@
 #include "speculate_write_cache_with_rope_kernel.h"
 #include "utils.cuh"
 
+template <typename T>
+void SpeculateWriteCacheKV(const AppendAttnMetaData& meta_data,
+                         const paddle::Tensor& qkv,
+                         const paddle::Tensor& seq_lens,
+                         const paddle::Tensor& padding_offsets,
+                         const paddle::Tensor& cum_offsets,
+                         const paddle::Tensor& block_tables,
+                         const int max_seq_len,
+                         cudaStream_t& stream,
+                         paddle::Tensor* key_cache_out,
+                         paddle::Tensor* value_cache_out) {
+  auto max_blocks_per_seq = meta_data.max_blocks_per_seq;
+  auto bsz = meta_data.batch_size;
+  auto block_size = meta_data.block_size;
+  auto head_dim_qk = meta_data.head_dims;
+  auto head_dim_v = meta_data.head_dims_v;
+  auto num_heads = meta_data.q_num_heads;
+  auto kv_num_heads = meta_data.kv_num_heads;
+  auto token_num = meta_data.token_nums;
+  const uint32_t elem_nums = token_num * kv_num_heads * (head_dim_qk + head_dim_v);
+
+  constexpr int PackSize = 16 / sizeof(T);
+  const int pack_num = elem_nums / PackSize;
+  const int blocksize = 128;
+  int grid_size = 1;
+  GetNumBlocks<128>(pack_num, &grid_size);
+
+  append_speculate_cache_kernel<T, PackSize>
+      <<<grid_size, blocksize, 0, stream>>>(
+          reinterpret_cast<T*>(const_cast<T*>(qkv.data<T>())),
+          reinterpret_cast<T*>(key_cache_out->data<T>()),
+          reinterpret_cast<T*>(value_cache_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          head_dim_qk,
+          head_dim_v,
+          block_size,
+          elem_nums,
+          kv_num_heads);
+}
+
 // rope + write
 template <typename T, typename QKV_TYPE>
 void append_speculate_cache_rope(const QKV_TYPE* qkv,
@@ -332,119 +378,129 @@ void SpeculateWriteCacheWithRoPEKernel(
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
 
-
-  const float* cos_emb =
-      rotary_embs ? rotary_embs.get().data<float>() : nullptr;
-  const float* sin_emb;
   if (rotary_embs) {
-    sin_emb =
-        use_neox_rotary_style
-            ? rotary_embs.get().data<float>() + max_seq_len * dim_head
-            : rotary_embs.get().data<float>() + max_seq_len * dim_head / 2;
-  }
-  if (cache_quant_type_str == "none") {
-    append_speculate_cache_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        token_nums,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int8") {
-    append_speculate_cache_int8_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        token_nums,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int4_zp") {
-    append_speculate_cache_int4_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        cache_k_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_k_zp.get().data<T>()))
-                   : nullptr,
-        cache_v_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_v_zp.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        token_nums,
-        stream,
-        use_neox_rotary_style);
+    const float* cos_emb =
+        rotary_embs ? rotary_embs.get().data<float>() : nullptr;
+    const float* sin_emb =
+            use_neox_rotary_style
+                ? rotary_embs.get().data<float>() + max_seq_len * dim_head
+                : rotary_embs.get().data<float>() + max_seq_len * dim_head / 2;
+    
+    if (cache_quant_type_str == "none") {
+        append_speculate_cache_rope(
+            reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+            reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
+            reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
+            reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+            block_tables.data<int>(),
+            padding_offsets.data<int>(),
+            cum_offsets.data<int>(),
+            seq_lens.data<int>(),
+            seq_lens_encoder.data<int>(),
+            cos_emb,
+            sin_emb,
+            qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+            qkv_biases ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(qkv_biases.get().data<T>()))
+                    : nullptr,
+            max_seq_len,
+            max_blocks_per_seq,
+            num_heads,
+            kv_num_heads,
+            dim_head,
+            block_size,
+            bsz,
+            token_nums,
+            stream,
+            use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int8") {
+        append_speculate_cache_int8_rope(
+            reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+            key_cache_out->data<uint8_t>(),
+            value_cache_out->data<uint8_t>(),
+            reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+            block_tables.data<int>(),
+            padding_offsets.data<int>(),
+            cum_offsets.data<int>(),
+            seq_lens.data<int>(),
+            seq_lens_encoder.data<int>(),
+            cos_emb,
+            sin_emb,
+            qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+            qkv_biases ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(qkv_biases.get().data<T>()))
+                    : nullptr,
+            cache_k_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+            cache_v_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+            max_seq_len,
+            max_blocks_per_seq,
+            num_heads,
+            kv_num_heads,
+            dim_head,
+            block_size,
+            bsz,
+            token_nums,
+            stream,
+            use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int4_zp") {
+        append_speculate_cache_int4_rope(
+            reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+            key_cache_out->data<uint8_t>(),
+            value_cache_out->data<uint8_t>(),
+            reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
+            block_tables.data<int>(),
+            padding_offsets.data<int>(),
+            cum_offsets.data<int>(),
+            seq_lens.data<int>(),
+            seq_lens_encoder.data<int>(),
+            cos_emb,
+            sin_emb,
+            qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+            qkv_biases ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(qkv_biases.get().data<T>()))
+                    : nullptr,
+            cache_k_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+            cache_v_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+            cache_k_zp ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(cache_k_zp.get().data<T>()))
+                    : nullptr,
+            cache_v_zp ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(cache_v_zp.get().data<T>()))
+                    : nullptr,
+            max_seq_len,
+            max_blocks_per_seq,
+            num_heads,
+            kv_num_heads,
+            dim_head,
+            block_size,
+            bsz,
+            token_nums,
+            stream,
+            use_neox_rotary_style);
+    } else {
+        PD_THROW(
+            "cache_quant_type_str should be one of [none, cache_int8, "
+            "cache_int4_zp]");
+    }
   } else {
-    PD_THROW(
-        "cache_quant_type_str should be one of [none, cache_int8, "
-        "cache_int4_zp]");
+    SpeculateWriteCacheKV<QKV_TYPE>(meta_data,
+                                    qkv,
+                                    seq_lens,
+                                    padding_offsets,
+                                    cum_offsets,
+                                    block_tables,
+                                    max_seq_len,
+                                    stream,
+                                    key_cache_out,
+                                    value_cache_out);
   }
 }
 
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu
index 79ba5cd7bc85..78857845f61c 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu
@@ -46,6 +46,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::bfloat16, paddle::bfloat16
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu
index 6ba3604cf8c3..fce608b905b3 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::bfloat16, paddle::float8_e
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu
index b34b6be4058a..1b77674cb946 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::bfloat16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu
index 09e149c25233..e10ebb01f08c 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::float16, paddle::float16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu
index 648d301880b8..f60b0b079f12 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::float16, paddle::float8_e4
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu
index a0f4a87b6927..8526d6049d5b 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::float16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu
index a3f0c95f02e2..818bab5d3aa1 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::bfloat16, paddle::bfloat16>
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu
index 63b03741b0e7..5a483a5fff82 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::bfloat16, paddle::float8_e4
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu
index a9d560dfef9b..56a4db05f6ac 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::bfloat16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu
index aae73a837de4..5ab5eb449ad2 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu
@@ -46,6 +46,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::float16, paddle::float16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu
index 57c5e36fca93..6404610407c3 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::float16, paddle::float8_e4m
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu
index 89c4bb58d5f2..9c429d52185a 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::float16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu
index e5d85cad2b5e..dc0388814692 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu
@@ -47,6 +47,7 @@ CascadeAppendAttentionC8Kernel<paddle::bfloat16, paddle::bfloat16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu
index e115efacf907..5818d9b5a934 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::bfloat16, paddle::float8_e4
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu
index 017018f118f0..01138421f11c 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::bfloat16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu
index cfa10da809da..530b75dab128 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::float16, paddle::float16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu
index 842fb6415fca..bb92b986e603 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::float16, paddle::float8_e4m
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu
index 0d143e3d87b4..cfeec3fa5660 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::float16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/utils.cuh b/csrc/gpu/append_attn/utils.cuh
index d5545caf103c..5c871a238025 100644
--- a/csrc/gpu/append_attn/utils.cuh
+++ b/csrc/gpu/append_attn/utils.cuh
@@ -25,6 +25,7 @@ struct AppendAttnMetaData {
   int kv_num_heads;
   int token_nums;
   int head_dims;
+  int head_dims_v;
   int max_blocks_per_seq;
 };
 
@@ -277,6 +278,16 @@ __forceinline__ __host__ __device__ void vec_cast<nv_bfloat16, float>(
       __VA_ARGS__                                  \
       break;                                       \
     }                                              \
+    case 192: {                                    \
+      constexpr size_t HEAD_DIM = 192;             \
+      __VA_ARGS__                                  \
+      break;                                       \
+    }                                              \
+    case 256: {                                    \
+      constexpr size_t HEAD_DIM = 256;             \
+      __VA_ARGS__                                  \
+      break;                                       \
+    }                                              \
     default: {                                     \
       PD_THROW("not support the head_dim: ", head_dim);        \
     }                                              \
diff --git a/csrc/gpu/fused_rotary_position_encoding.cu b/csrc/gpu/fused_rotary_position_encoding.cu
new file mode 100644
index 000000000000..c405045890cc
--- /dev/null
+++ b/csrc/gpu/fused_rotary_position_encoding.cu
@@ -0,0 +1,141 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+template <typename T, bool IS_NEOX>
+inline __device__ void apply_token_rotary_embedding_kernel(
+    T* __restrict__ arr,
+    const T* __restrict__ cos_ptr,
+    const T* __restrict__ sin_ptr,
+    int rot_offset,
+    int embed_dim) {
+  int x_index, y_index;
+  T cos, sin;
+  if (IS_NEOX) {
+    x_index = rot_offset;
+    y_index = embed_dim + rot_offset;
+    cos = cos_ptr[x_index];
+    sin = sin_ptr[x_index];
+  } else {
+    x_index = 2 * rot_offset;
+    y_index = 2 * rot_offset + 1;
+    cos = cos_ptr[x_index / 2];
+    sin = sin_ptr[x_index / 2];
+  }
+
+  const T x = arr[x_index];
+  const T y = arr[y_index];
+  arr[x_index] = x * cos - y * sin;
+  arr[y_index] = y * cos + x * sin;
+}
+
+
+template <typename T, bool IS_NEOX>
+__global__ void apply_rotary_embedding_kernel(
+    T* __restrict__ query,  // [num_tokens, num_heads, head_size]
+    T* __restrict__ key,    // [num_tokens, num_kv_heads, head_size]
+    const int* __restrict__ position_ids,  // [num_tokens]
+    const T* __restrict__ cos_sin_cache,   // [max_position, 2, rot_dim // 2]
+    const int rot_dim,
+    const int64_t query_stride,
+    const int64_t key_stride,
+    const int num_heads,
+    const int num_kv_heads,
+    const int head_size) {
+  // Each thread block is responsible for one token.
+  const int token_idx = blockIdx.x;
+  int pos = position_ids[token_idx];
+  const T* cache_ptr = cos_sin_cache + pos * rot_dim;
+
+  const int embed_dim = rot_dim / 2;
+  const T* cos_ptr = cache_ptr;
+  const T* sin_ptr = cache_ptr + embed_dim;
+
+  const int nq = num_heads * embed_dim;
+  for (int i = threadIdx.x; i < nq; i += blockDim.x) {
+    const int head_idx = i / embed_dim;
+    const int64_t token_head = token_idx * query_stride + head_idx * head_size;
+    const int rot_offset = i % embed_dim;
+    apply_token_rotary_embedding_kernel<T, IS_NEOX>(
+        query + token_head, cos_ptr, sin_ptr, rot_offset, embed_dim);
+  }
+
+  const int nk = num_kv_heads * embed_dim;
+  for (int i = threadIdx.x; i < nk; i += blockDim.x) {
+    const int head_idx = i / embed_dim;
+    const int64_t token_head = token_idx * key_stride + head_idx * head_size;
+    const int rot_offset = i % embed_dim;
+    apply_token_rotary_embedding_kernel<T, IS_NEOX>(
+        key + token_head, cos_ptr, sin_ptr, rot_offset, embed_dim);
+  }
+}
+
+
+void FusedRotaryPositionEncoding(
+    paddle::Tensor& query,  // [num_tokens, num_heads, head_size] or
+                            // [num_tokens, num_heads * head_size]
+    paddle::Tensor& key,
+    // [num_tokens, num_kv_heads, head_size] or [num_tokens, num_kv_heads *
+    // head_size]
+    const paddle::Tensor& position_ids,   // [num_tokens]
+    const paddle::Tensor& cos_sin_cache,  // [max_position, rot_dim]
+    int head_size,
+    bool is_neox) {
+  int64_t num_tokens = query.dims()[0];
+  int num_heads = query.numel() / num_tokens / head_size;
+  int num_kv_heads = key.numel() / num_tokens / head_size;
+  int rot_dim = cos_sin_cache.dims()[1];
+  int64_t query_stride = num_heads * head_size;
+  int64_t key_stride = num_kv_heads * head_size;
+
+  dim3 grid(num_tokens);
+  dim3 block(std::min<int64_t>(num_heads * rot_dim / 2, 512));
+  PD_DISPATCH_FLOATING_AND_HALF_TYPES(
+      query.dtype(), "apply_rotary_embedding_kernel", [&] {
+        if (is_neox) {
+          apply_rotary_embedding_kernel<data_t, true>
+              <<<grid, block, 0, query.stream()>>>(query.data<data_t>(),
+                                                   key.data<data_t>(),
+                                                   position_ids.data<int>(),
+                                                   cos_sin_cache.data<data_t>(),
+                                                   rot_dim,
+                                                   query_stride,
+                                                   key_stride,
+                                                   num_heads,
+                                                   num_kv_heads,
+                                                   head_size);
+        } else {
+          apply_rotary_embedding_kernel<data_t, false>
+              <<<grid, block, 0, query.stream()>>>(query.data<data_t>(),
+                                                   key.data<data_t>(),
+                                                   position_ids.data<int>(),
+                                                   cos_sin_cache.data<data_t>(),
+                                                   rot_dim,
+                                                   query_stride,
+                                                   key_stride,
+                                                   num_heads,
+                                                   num_kv_heads,
+                                                   head_size);
+        }
+      });
+}
+
+PD_BUILD_OP(fused_rotary_position_encoding)
+    .Inputs({"query", "key", "position_ids", "cos_sin_cache"})
+    .Outputs({"query_out", "key_out"})
+    .Attrs({"head_size: int", "is_neox: bool"})
+    .SetInplaceMap({{"query", "query_out"}, {"key", "key_out"}})
+    .SetKernelFn(PD_KERNEL(FusedRotaryPositionEncoding));
\ No newline at end of file
diff --git a/csrc/gpu/get_position_ids.cu b/csrc/gpu/get_position_ids.cu
new file mode 100644
index 000000000000..dbd25497a2fa
--- /dev/null
+++ b/csrc/gpu/get_position_ids.cu
@@ -0,0 +1,75 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+__global__ void GetPositionIdsKernel(
+    const int* seq_lens_encoder,  // [bsz] 每个批次的 encoder 长度
+    const int* seq_lens_decoder,  // [bsz] 每个批次的 decoder 长度
+    const int* seq_lens_this_time,
+    int* position_ids,            // 输出的一维 position_ids
+    const int bsz) {              // 批次大小
+  // 当前线程索引（每个线程对应一个批次）
+  int tid = threadIdx.x;
+  if (tid >= bsz) return;
+
+  // 动态计算当前批次的偏移量
+  int offset = 0;
+  for (int i = 0; i < tid; i++) {
+    offset += seq_lens_encoder[i];
+    if (seq_lens_decoder[i] > 0) {
+      offset += seq_lens_this_time[i];
+    }
+  }
+
+  // 当前批次的 encoder 和 decoder 长度
+  int encoder_len = seq_lens_encoder[tid];
+  int decoder_len = seq_lens_decoder[tid];
+  int seq_len_this_time = seq_lens_this_time[tid];
+
+  // 写入 encoder 的 position_ids
+  for (int i = 0; i < encoder_len; i++) {
+    position_ids[offset + i] = i;
+  }
+  offset += encoder_len;
+
+  // 写入 decoder 的 position_ids
+  if (decoder_len > 0) {
+    for (int i = 0; i < seq_len_this_time; i++) {
+      position_ids[offset + i] = decoder_len + i;  // 使用 decoder 长度本身
+    }
+  }
+}
+
+
+void GetPositionIds(const paddle::Tensor& seq_lens_encoder,
+                    const paddle::Tensor& seq_lens_decoder,
+                    const paddle::Tensor& seq_lens_this_time,
+                    const paddle::Tensor& position_ids) {
+  const int bsz = seq_lens_encoder.shape()[0];
+
+  GetPositionIdsKernel<<<1, bsz, 0, position_ids.stream()>>>(
+      seq_lens_encoder.data<int>(),
+      seq_lens_decoder.data<int>(),
+      seq_lens_this_time.data<int>(),
+      const_cast<int*>(position_ids.data<int>()),
+      bsz);
+}
+
+PD_BUILD_OP(get_position_ids)
+    .Inputs({"seq_lens_encoder", "seq_lens_decoder", "seq_lens_this_time", "position_ids"})
+    .Outputs({"position_ids_out"})
+    .SetInplaceMap({{"position_ids", "position_ids_out"}})
+    .SetKernelFn(PD_KERNEL(GetPositionIds));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu
index 853edd874580..d878ef32cdb6 100644
--- a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu
@@ -128,7 +128,7 @@ void DraftModelPreprocess(const paddle::Tensor& draft_tokens,
                           const paddle::Tensor& base_model_stop_flags,
                           const paddle::Tensor& base_model_draft_tokens,
                           const int max_draft_token,
-                          const std::string& draft_type) {
+                          const bool truncate_first_token) {
   int real_bsz = seq_lens_this_time.shape()[0];
   int accept_tokens_len = accept_tokens.shape()[1];
   int input_ids_len = input_ids.shape()[1];
@@ -140,7 +140,7 @@ void DraftModelPreprocess(const paddle::Tensor& draft_tokens,
       not_need_stop.copy_to(seq_lens_this_time.place(), false);
 
 
-  if (draft_type == "eagle") {
+  if (truncate_first_token) {
     draft_model_preprocess_kernel<BlockSize, true>
         <<<1, BlockSize, 0, cu_stream>>>(
             const_cast<int64_t*>(draft_tokens.data<int64_t>()),
@@ -226,7 +226,7 @@ PD_BUILD_OP(draft_model_preprocess)
               "step_idx_out",
               "not_need_stop_out",
               "first_token_record_out"})
-    .Attrs({"max_draft_token: int", "draft_type: std::string"})
+    .Attrs({"max_draft_token: int", "truncate_first_token: bool"})
     .SetInplaceMap({{"draft_tokens", "draft_tokens_out"},
                     {"input_ids", "input_ids_out"},
                     {"stop_flags", "stop_flags_out"},
diff --git a/csrc/gpu/speculate_decoding_kernels/ngram_match.cc b/csrc/gpu/speculate_decoding_kernels/ngram_match.cc
index 3c19064b2f66..958b01ddece3 100644
--- a/csrc/gpu/speculate_decoding_kernels/ngram_match.cc
+++ b/csrc/gpu/speculate_decoding_kernels/ngram_match.cc
@@ -37,6 +37,7 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
         int32_t *seq_lens_this_time,
         int32_t *seq_lens_encoder,
         int32_t *seq_lens_decoder,
+        int64_t *max_dec_len,
         int64_t input_ids_stride,
         int64_t pre_ids_stride,
         int64_t draft_tokens_stride,
@@ -55,8 +56,8 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
         }
     }
     for (int batch_idx = 0; batch_idx < real_batch_size; batch_idx++) {
-        max_draft_tokens = draft_token_num[batch_idx];
-        // int local_draft_tokens = max_draft_tokens;
+        max_draft_tokens = std::min(static_cast<int64_t>(
+            draft_token_num[batch_idx]), max_dec_len[batch_idx] - step_idx[batch_idx] - 1);
         if (seq_lens_encoder[batch_idx] > 0) {
             continue;
         } else if (seq_lens_decoder[batch_idx] == 0) {
@@ -90,14 +91,7 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                 continue;
             }
             const int64_t *ngram = cur_pre_ids + (cur_step_idx + 1 - ngram_size);
-#ifdef _DEBUG
-            if (batch_idx == 0) {
-                for (int mm = 0; mm < ngram_size; mm++) {
-                    printf("idx %d: %lld\n", mm, ngram[mm]);
-                }
-            }
-            printf("cur_input_ids_len %d\n", cur_input_ids_len);
-#endif
+
             // Iterate through sliding windows of size ngram_size
             bool match_input = false;
             for (int64_t i = 0; i <= cur_input_ids_len - ngram_size; ++i) {
@@ -114,13 +108,7 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                     int64_t end_idx = std::min(start_idx + max_draft_tokens, cur_input_ids_len);
                     if (start_idx >= end_idx)
                         continue;
-#ifdef _DEBUG
-                    printf("batch_idx:%d. ngram_size:%d. idx:%lld. \n", batch_idx, ngram_size, i);
-                    printf("start:%d. end:%d.\n", start_idx, end_idx);
-#endif
-                    // Ensure we don't go beyond the length of input_ids and avoid self-match
-                    // if (end_idx <= cur_input_ids_len && start_idx < cur_input_ids_len - ngram_size) {
-                    // Return a pointer to the next num_pred_tokens
+
                     int64_t cur_draft_token_num = end_idx - start_idx;
 
                     seq_lens_this_time[batch_idx] = cur_draft_token_num + 1;
@@ -133,15 +121,10 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                 }
             }
             if (!match_input) {
-#ifdef _DEBUG
-                printf("match_input is false so match output\n");
-#endif
                 for (int64_t i = 0; i <= cur_step_idx - ngram_size; ++i) {
                     // Check if the current window matches the ngram
                     bool match = true;
-#ifdef _DEBUG
-                    printf("search %d token in pre_ids\n", i);
-#endif
+
                     for (int j = 0; j < ngram_size; j++) {
                         if (ngram[j] != cur_pre_ids[i + j]) {
                             match = false;
@@ -150,26 +133,14 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                     }
 
                     if (match) {
-#ifdef _DEBUG
-                        printf("%d token in pre_ids matched\n", i);
-#endif
                         int64_t start_idx = i + ngram_size;
                         int64_t end_idx = std::min(start_idx + max_draft_tokens, cur_step_idx);
                         int64_t cur_draft_token_num = end_idx - start_idx;
                         if (start_idx >= end_idx)
                             continue;
 
-#ifdef _DEBUG
-                        printf("cur_step_idx %d, start_idx %lld, end_idx %lld, cur_draft_token_num is %lld\n",
-                                cur_step_idx,
-                                start_idx,
-                                end_idx,
-                                cur_draft_token_num);
-#endif
-
                         seq_lens_this_time[batch_idx] = cur_draft_token_num + 1;
                         memcpy(cur_draft_tokens + 1, cur_pre_ids + start_idx, sizeof(int64_t) * cur_draft_token_num);
-                        // To break the current batch_idx for-loop
                         ngram_size = 0;
                         break;
                     }
@@ -188,6 +159,7 @@ void NgramMatch(const paddle::Tensor &input_ids,
         const paddle::Tensor &seq_lens_this_time,
         const paddle::Tensor &seq_lens_encoder,
         const paddle::Tensor &seq_lens_decoder,
+        const paddle::Tensor &max_dec_len,
         const int real_batch_size,
         const int max_ngram_size,
         const int max_draft_tokens) {
@@ -210,6 +182,7 @@ void NgramMatch(const paddle::Tensor &input_ids,
             const_cast<int32_t *>(seq_lens_this_time.data<int32_t>()),
             const_cast<int32_t *>(seq_lens_encoder.data<int32_t>()),
             const_cast<int32_t *>(seq_lens_decoder.data<int32_t>()),
+            const_cast<int64_t *>(max_dec_len.data<int64_t>()),
             input_ids_stride,
             pre_ids_stride,
             draft_tokens_stride,
@@ -227,7 +200,8 @@ PD_BUILD_OP(ngram_match)
                 "draft_tokens",
                 "seq_lens_this_time",
                 "seq_lens_encoder",
-                "seq_lens_decoder"})
+                "seq_lens_decoder",
+                "max_dec_len"})
         .Attrs({"real_batch_size: int", "max_ngram_size: int", "max_draft_tokens: int"})
         .Outputs({"draft_tokens_out", "seq_lens_this_time_out"})
         .SetKernelFn(PD_KERNEL(NgramMatch))
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_clear_accept_nums.cu b/csrc/gpu/speculate_decoding_kernels/speculate_clear_accept_nums.cu
new file mode 100644
index 000000000000..cbcd6c0b5a3f
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/speculate_clear_accept_nums.cu
@@ -0,0 +1,42 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+__global__ void speculate_clear_accept_nums_kernel(
+                                 int* accept_num,
+                                 const int* seq_lens_decoder,
+                                 const int max_bsz
+                                 ) {
+    const int bid = threadIdx.x;
+    if (bid >= max_bsz) return;
+    accept_num[bid] = seq_lens_decoder[bid] == 0 ? 0 : accept_num[bid];
+
+}
+
+void SpeculateClearAcceptNums(const paddle::Tensor& accept_num,
+                   const paddle::Tensor& seq_lens_decoder
+                   ) {
+    // printf("enter clear \n");
+    const int max_bsz = seq_lens_decoder.shape()[0];
+    speculate_clear_accept_nums_kernel<<<1, 1024, 0, accept_num.stream()>>>(const_cast<int*>(accept_num.data<int>()),
+                                                                            seq_lens_decoder.data<int>(), max_bsz);
+}
+
+PD_BUILD_OP(speculate_clear_accept_nums)
+    .Inputs({"accept_num", 
+             "seq_lens_decoder"})
+    .Outputs({"seq_lens_decoder_out"})
+    .SetInplaceMap({{"seq_lens_decoder", "seq_lens_decoder_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateClearAcceptNums));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_update.cu b/csrc/gpu/speculate_decoding_kernels/speculate_update.cu
new file mode 100644
index 000000000000..596805d6b61d
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/speculate_update.cu
@@ -0,0 +1,140 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <int THREADBLOCK_SIZE>
+__global__ void speculate_update(int *seq_lens_encoder,
+                                 int *seq_lens_decoder,
+                                 bool *not_need_stop,
+                                 int64_t *draft_tokens,
+                                 int *actual_draft_token_nums,
+                                 const int64_t *accept_tokens,
+                                 const int *accept_num,
+                                 const bool *stop_flags,
+                                 const int *seq_lens_this_time,
+                                 const bool *is_block_step,
+                                 const int real_bsz,
+                                 const int max_draft_tokens) {
+    const int bid = threadIdx.x;
+    const int accept_num_now = accept_num[bid];
+    int stop_flag_now_int = 0;
+    if (!(is_block_step[bid] || bid >= real_bsz)) {
+        if (stop_flags[bid]) {
+            stop_flag_now_int = 1;
+        }
+        if (seq_lens_encoder[bid] == 0) {
+            seq_lens_decoder[bid] += accept_num_now;
+        }
+
+        if (seq_lens_this_time[bid] > 1 &&
+            seq_lens_encoder[bid] ==
+                0) {  // 对于append模式，需要根据接收与否确定是否要降低下次draft
+                      // token的数量
+            auto current_actual_draft_token_num = actual_draft_token_nums[bid];
+            if (accept_num_now - 1 == current_actual_draft_token_num) {
+                if (current_actual_draft_token_num + 2 <=
+                    max_draft_tokens - 1) {
+                    actual_draft_token_nums[bid] =
+                        current_actual_draft_token_num + 2;
+                } else if (current_actual_draft_token_num + 1 <=
+                           max_draft_tokens - 1) {
+                    actual_draft_token_nums[bid] =
+                        current_actual_draft_token_num + 1;
+                } else {
+                    actual_draft_token_nums[bid] = max_draft_tokens - 1;
+                }
+            } else {
+                actual_draft_token_nums[bid] =
+                    actual_draft_token_nums[bid] - 1 >= 1
+                        ? actual_draft_token_nums[bid] - 1
+                        : 1;
+            }
+        }
+
+        if (seq_lens_encoder[bid] != 0) {
+            seq_lens_decoder[bid] += seq_lens_encoder[bid];
+            seq_lens_encoder[bid] = 0;
+        }
+        if (!stop_flags[bid]) {
+            draft_tokens[bid * max_draft_tokens] =
+                accept_tokens[bid * max_draft_tokens + accept_num_now - 1];
+        }
+        if (stop_flag_now_int) {
+            seq_lens_decoder[bid] = 0;
+        }
+    }
+    __syncthreads();
+    typedef cub::BlockReduce<int64_t, THREADBLOCK_SIZE> BlockReduce;
+    __shared__ typename BlockReduce::TempStorage temp_storage;
+
+    int64_t stop_sum = BlockReduce(temp_storage).Sum(stop_flag_now_int);
+
+    if (threadIdx.x == 0) {
+        not_need_stop[0] = stop_sum < real_bsz;
+    }
+}
+
+void SpeculateUpdate(const paddle::Tensor &seq_lens_encoder,
+                       const paddle::Tensor &seq_lens_decoder,
+                       const paddle::Tensor &not_need_stop,
+                       const paddle::Tensor &draft_tokens,
+                       const paddle::Tensor &actual_draft_token_nums,
+                       const paddle::Tensor &accept_tokens,
+                       const paddle::Tensor &accept_num,
+                       const paddle::Tensor &stop_flags,
+                       const paddle::Tensor &seq_lens_this_time,
+                       const paddle::Tensor &is_block_step) {
+    int real_bsz = seq_lens_this_time.shape()[0];
+    auto max_draft_tokens = draft_tokens.shape()[1];
+
+    constexpr int BlockSize = 512;
+
+    speculate_update<BlockSize><<<1, BlockSize, 0, accept_tokens.stream()>>>(
+        const_cast<int *>(seq_lens_encoder.data<int>()),
+        const_cast<int *>(seq_lens_decoder.data<int>()),
+        const_cast<bool *>(not_need_stop.data<bool>()),
+        const_cast<int64_t *>(draft_tokens.data<int64_t>()),
+        const_cast<int *>(actual_draft_token_nums.data<int>()),
+        accept_tokens.data<int64_t>(),
+        accept_num.data<int>(),
+        stop_flags.data<bool>(),
+        seq_lens_this_time.data<int>(),
+        is_block_step.data<bool>(),
+        real_bsz,
+        max_draft_tokens);
+}
+
+PD_BUILD_OP(speculate_update)
+    .Inputs({"seq_lens_encoder",
+             "seq_lens_decoder",
+             "not_need_stop",
+             "draft_tokens",
+             "actual_draft_token_nums",
+             "accept_tokens",
+             "accept_num",
+             "stop_flags",
+             "seq_lens_this_time",
+             "is_block_step"})
+    .Outputs({"seq_lens_encoder_out",
+              "seq_lens_decoder_out",
+              "not_need_stop_out",
+              "draft_tokens_out",
+              "actual_draft_token_nums_out"})
+    .SetInplaceMap({{"seq_lens_encoder", "seq_lens_encoder_out"},
+                    {"seq_lens_decoder", "seq_lens_decoder_out"},
+                    {"not_need_stop", "not_need_stop_out"},
+                    {"draft_tokens", "draft_tokens_out"},
+                    {"actual_draft_token_nums", "actual_draft_token_nums_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateUpdate));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_verify.cu b/csrc/gpu/speculate_decoding_kernels/speculate_verify.cu
new file mode 100644
index 000000000000..e09cf785bb7f
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/speculate_verify.cu
@@ -0,0 +1,260 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <curand_kernel.h>
+#include <cstdlib>
+#include <string>
+#include "helper.h"
+
+__device__ bool is_in(const int64_t *candidates,
+                      const int64_t draft,
+                      const int candidate_len) {
+    for (int i = 0; i < candidate_len; i++) {
+        if (draft == candidates[i]) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static uint64_t seed = 0;
+static uint64_t offset = 0;
+
+__device__ int64_t topp_sampling_kernel(const int64_t *candidate_ids,
+                                        const float *candidate_scores,
+                                        curandState_t *dev_curand_states,
+                                        const int candidate_len,
+                                        const float topp) {
+    const int tid = threadIdx.x;
+
+    float sum_scores = 0.0f;
+    float rand_top_p = curand_uniform(dev_curand_states + tid) * topp;
+    for (int i = 0; i < candidate_len; i++) {
+        sum_scores += candidate_scores[i];
+        if (rand_top_p <= sum_scores) {
+            return candidate_ids[i];
+        }
+    }
+    return candidate_ids[0];
+}
+
+__global__ void setup_kernel(curandState_t *state,
+                             const uint64_t seed,
+                             const uint64_t offset,
+                             const int bs,
+                             const bool need_batch_random) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    for (int i = idx; i < bs; i += gridDim.x * blockDim.x) {
+        if (need_batch_random) {
+            curand_init(seed, i, offset, &state[i]);
+        } else {
+            curand_init(seed, 0, offset, &state[i]);
+        }
+    }
+}
+
+__global__ void speculate_verify(int64_t *accept_tokens,
+                                 int *accept_num,
+                                 int64_t *step_idx,
+                                 bool *stop_flags,
+                                 const int *seq_lens_encoder,
+                                 const int *seq_lens_decoder,
+                                 const int64_t *draft_tokens,
+                                 const int *actual_draft_token_nums,
+                                 curandState_t *dev_curand_states,
+                                 const float *topp,
+                                 const int *seq_lens_this_time,
+                                 const int64_t *verify_tokens,
+                                 const float *verify_scores,
+                                 const int64_t *max_dec_len,
+                                 const int64_t *end_tokens,
+                                 const bool *is_block_step,
+                                 const int *output_cum_offsets,
+                                 const int *actual_candidate_len,
+                                 const int real_bsz,
+                                 const int max_draft_tokens,
+                                 const int end_length,
+                                 const int max_seq_len,
+                                 const int max_candidate_len,
+                                 const int verify_window) {
+    const int bid = threadIdx.x;
+    const int start_token_id = bid * max_seq_len - output_cum_offsets[bid];
+    int accept_num_now = 1;
+    int stop_flag_now_int = 0;
+
+    if (!(is_block_step[bid] || bid >= real_bsz)) {
+        if (stop_flags[bid]) {
+            stop_flag_now_int = 1;
+        // 这里 prefill 阶段也会进入，但是因为 draft tokens 会置零，因此会直接到最后的采样阶段
+        } else {
+            auto *verify_tokens_now =
+                verify_tokens + start_token_id * max_candidate_len;
+            auto *draft_tokens_now = draft_tokens + bid * max_draft_tokens;
+            auto *actual_candidate_len_now =
+                actual_candidate_len + start_token_id;
+
+            int i = 0;
+            if (seq_lens_encoder[bid] == 0) {
+                for (; i < seq_lens_this_time[bid] - 1; i++) {
+                    if (verify_tokens_now[i * max_candidate_len] == draft_tokens_now[i + 1]) {
+                        step_idx[bid]++;
+                        auto accept_token = draft_tokens_now[i + 1];
+                        accept_tokens[bid * max_draft_tokens + i] =
+                            accept_token;
+                        if (is_in_end(accept_token, end_tokens, end_length) ||
+                            step_idx[bid] >= max_dec_len[bid]) {
+                            stop_flags[bid] = true;
+                            stop_flag_now_int = 1;
+                            if (step_idx[bid] >= max_dec_len[bid])
+                                accept_tokens[bid * max_draft_tokens + i] =
+                                    end_tokens[0];
+                            break;
+                        } else {
+                            accept_num_now++;
+                        }
+                    } else {
+                        break;
+                    }
+                }
+            }
+            // sampling 阶段
+            // 第一种，draft_token[i+1]被拒绝，需要从 verify_tokens_now[i] 中选一个
+            // 第二种，i == seq_lens_this_time[bid]-1,
+            // 也是从verify_tokens_now[i]中选一个 但是停止的情况不算
+            if (!stop_flag_now_int) {
+                int64_t accept_token;
+                const float *verify_scores_now =
+                    verify_scores + start_token_id * max_candidate_len;
+                step_idx[bid]++;
+                // sampling
+                auto actual_candidate_len_value =
+                    actual_candidate_len_now[i] > max_candidate_len
+                        ? max_candidate_len
+                        : actual_candidate_len_now[i];
+
+                accept_token = topp_sampling_kernel(
+                    verify_tokens_now + i * max_candidate_len,
+                    verify_scores_now + i * max_candidate_len,
+                    dev_curand_states,
+                    actual_candidate_len_value,
+                    topp[bid]);
+
+                accept_tokens[bid * max_draft_tokens + i] = accept_token;
+                if (is_in_end(accept_token, end_tokens, end_length) ||
+                    step_idx[bid] >= max_dec_len[bid]) {
+                    stop_flags[bid] = true;
+                    stop_flag_now_int = 1;
+                    if (step_idx[bid] >= max_dec_len[bid])
+                        accept_tokens[bid * max_draft_tokens + i] =
+                            end_tokens[0];
+                }
+            }
+            accept_num[bid] = accept_num_now;
+        }
+    }
+}
+
+void SpeculateVerify(const paddle::Tensor &accept_tokens,
+                     const paddle::Tensor &accept_num,
+                     const paddle::Tensor &step_idx,
+                     const paddle::Tensor &stop_flags,
+                     const paddle::Tensor &seq_lens_encoder,
+                     const paddle::Tensor &seq_lens_decoder,
+                     const paddle::Tensor &draft_tokens,
+                     const paddle::Tensor &seq_lens_this_time,
+                     const paddle::Tensor &verify_tokens,
+                     const paddle::Tensor &verify_scores,
+                     const paddle::Tensor &max_dec_len,
+                     const paddle::Tensor &end_tokens,
+                     const paddle::Tensor &is_block_step,
+                     const paddle::Tensor &output_cum_offsets,
+                     const paddle::Tensor &actual_candidate_len,
+                     const paddle::Tensor &actual_draft_token_nums,
+                     const paddle::Tensor &topp,
+                     int max_seq_len,
+                     int verify_window) {
+    //   printf("Enter speculate update\n");
+    auto bsz = accept_tokens.shape()[0];
+    int real_bsz = seq_lens_this_time.shape()[0];
+    auto max_draft_tokens = draft_tokens.shape()[1];
+    auto end_length = end_tokens.shape()[0];
+    auto max_candidate_len = verify_tokens.shape()[1];
+
+    constexpr int BlockSize = 512;
+
+    curandState_t *dev_curand_states;
+    cudaMalloc(&dev_curand_states, sizeof(curandState_t) * bsz);
+    setup_kernel<<<1, BlockSize, 0, accept_tokens.stream()>>>(
+        dev_curand_states, seed, offset, bsz, true);
+    seed++;
+    offset++;
+
+    speculate_verify<<<1, BlockSize, 0, accept_tokens.stream()>>>(
+            const_cast<int64_t *>(accept_tokens.data<int64_t>()),
+            const_cast<int *>(accept_num.data<int>()),
+            const_cast<int64_t *>(step_idx.data<int64_t>()),
+            const_cast<bool *>(stop_flags.data<bool>()),
+            seq_lens_encoder.data<int>(),
+            seq_lens_decoder.data<int>(),
+            draft_tokens.data<int64_t>(),
+            actual_draft_token_nums.data<int>(),
+            dev_curand_states,
+            topp.data<float>(),
+            seq_lens_this_time.data<int>(),
+            verify_tokens.data<int64_t>(),
+            verify_scores.data<float>(),
+            max_dec_len.data<int64_t>(),
+            end_tokens.data<int64_t>(),
+            is_block_step.data<bool>(),
+            output_cum_offsets.data<int>(),
+            actual_candidate_len.data<int>(),
+            real_bsz,
+            max_draft_tokens,
+            end_length,
+            max_seq_len,
+            max_candidate_len,
+            verify_window);
+
+
+    cudaFree(dev_curand_states);
+}
+
+PD_BUILD_OP(speculate_verify)
+    .Inputs({"accept_tokens",
+             "accept_num",
+             "step_idx",
+             "seq_lens_encoder",
+             "seq_lens_decoder",
+             "stop_flags",
+             "draft_tokens",
+             "seq_lens_this_time",
+             "verify_tokens",
+             "verify_scores",
+             "max_dec_len",
+             "end_tokens",
+             "is_block_step",
+             "output_cum_offsets",
+             "actual_candidate_len",
+             "actual_draft_token_nums",
+             "topp"})
+    .Outputs({"accept_tokens_out",
+              "accept_num_out",
+              "step_idx_out",
+              "stop_flags_out"})
+    .Attrs({"max_seq_len: int", "verify_window: int", "enable_topp: bool"})
+    .SetInplaceMap({{"accept_tokens", "accept_tokens_out"},
+                    {"accept_num", "accept_num_out"},
+                    {"step_idx", "step_idx_out"},
+                    {"stop_flags", "stop_flags_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateVerify));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_verify_and_update.cu b/csrc/gpu/speculate_decoding_kernels/speculate_verify_and_update.cu
deleted file mode 100644
index 03cc4fa376e3..000000000000
--- a/csrc/gpu/speculate_decoding_kernels/speculate_verify_and_update.cu
+++ /dev/null
@@ -1,454 +0,0 @@
-// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
-// 
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-// 
-//     http://www.apache.org/licenses/LICENSE-2.0
-// 
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "helper.h"
-#include <curand_kernel.h>
-#include <cstdlib>
-#include <string>
-
-__device__ bool is_in(const int64_t* candidates, const int64_t draft, const int candidate_len) {
-    for (int i = 0; i < candidate_len; i++) {
-        if (draft == candidates[i]) {
-            return true;
-        }
-    }
-    return false;
-}
-
-static uint64_t seed = 0;
-static uint64_t offset = 0;
-
-__device__ int64_t topp_sampling_kernel(const int64_t* candidate_ids,
-        const float* candidate_scores,
-        curandState_t* dev_curand_states,
-        const int candidate_len,
-        const float topp) {
-
-    const int tid = threadIdx.x;
-
-    float sum_scores = 0.0f;
-    float rand_top_p = curand_uniform(dev_curand_states + tid) * topp;
-    for (int i = 0; i < candidate_len; i++) {
-        sum_scores += candidate_scores[i];
-        if (rand_top_p <= sum_scores) {
-            return candidate_ids[i];
-        }
-    }
-    return candidate_ids[0];
-}
-
-__global__ void setup_kernel(curandState_t* state,
-        const uint64_t seed,
-        const uint64_t offset,
-        const int bs,
-        const bool need_batch_random) {
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    for (int i = idx; i < bs; i += gridDim.x * blockDim.x) {
-        if (need_batch_random) {
-            curand_init(seed, i, offset, &state[i]);
-        } else {
-            curand_init(seed, 0, offset, &state[i]);
-        }
-    }
-}
-
-template <int THREADBLOCK_SIZE, bool ENABLE_TOPP, bool USE_TOPK>
-__global__ void speculate_verify_and_update_kernel(int64_t* accept_tokens,
-        int* accept_num,
-        int64_t* step_idx,
-        int* seq_lens_encoder,
-        int* seq_lens_decoder,
-        bool* stop_flags,
-        bool* not_need_stop,
-        int64_t* draft_tokens,
-        int* actual_draft_token_nums,
-        curandState_t* dev_curand_states,
-        const float* topp,
-        const int* seq_lens_this_time,
-        const int64_t* verify_tokens,
-        const float* verify_scores,
-        const int64_t* max_dec_len,
-        const int64_t* end_tokens,
-        const bool* is_block_step,
-        const int* output_cum_offsets,
-        const int* actual_candidate_len,
-        const int real_bsz,
-        const int max_draft_tokens,
-        const int end_length,
-        const int max_seq_len,
-        const int max_candidate_len,
-        const int verify_window) {
-    const int bid = threadIdx.x;
-    // start token's id of bid batch
-    const int start_token_id = bid * max_seq_len - output_cum_offsets[bid];
-    // verify and set stop flags
-    int accept_num_now = 1;
-    int stop_flag_now_int = 0;
-
-    if (!(is_block_step[bid] || bid >= real_bsz)) {
-
-        if (stop_flags[bid]) {
-            stop_flag_now_int = 1;
-        } else { // Here the prefill stage also goes in, but since the draft tokens are zero in prefill stage, it goes straight to the final sampling stage.
-            auto* verify_tokens_now = verify_tokens + start_token_id * max_candidate_len;
-            auto* draft_tokens_now = draft_tokens + bid * max_draft_tokens;
-            auto* actual_candidate_len_now = actual_candidate_len + start_token_id;
-
-            int i = 0;
-            for (; i < seq_lens_this_time[bid] - 1; i++) {
-                if (seq_lens_encoder[bid] != 0) {
-                    break;
-                }
-                if (USE_TOPK) {
-                    if (verify_tokens_now[i * max_candidate_len] == draft_tokens_now[i + 1]) {
-                        accept_num_now++;
-                        step_idx[bid]++;
-                        auto accept_token = draft_tokens_now[i + 1];
-                        accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                        if (is_in_end(accept_token, end_tokens, end_length) || step_idx[bid] >= max_dec_len[bid]) {
-                            stop_flags[bid] = true;
-                            stop_flag_now_int = 1;
-                            if (step_idx[bid] >= max_dec_len[bid])
-                                accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                            break;
-                        }
-                    } else {
-                        break;
-                    }
-                } else {
-                    auto actual_candidate_len_value = actual_candidate_len_now[i] > max_candidate_len
-                            ? max_candidate_len
-                            : actual_candidate_len_now[i];
-                    if (is_in(verify_tokens_now + i * max_candidate_len,
-                                draft_tokens_now[i + 1],
-                                actual_candidate_len_value)) {
-                        // Top P verify
-                        accept_num_now++;
-                        step_idx[bid]++;
-                        auto accept_token = draft_tokens_now[i + 1];
-                        accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                        if (is_in_end(accept_token, end_tokens, end_length) || step_idx[bid] >= max_dec_len[bid]) {
-                            stop_flags[bid] = true;
-                            stop_flag_now_int = 1;
-                            if (step_idx[bid] >= max_dec_len[bid])
-                                accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                            break;
-                        }
-                    } else {
-                        // TopK verify
-                        int ii = i;
-                        if (max_candidate_len >= 2 &&
-                                verify_tokens_now[ii * max_candidate_len + 1] == draft_tokens_now[ii + 1]) { // top-2
-                            int j = 0;
-                            ii += 1;
-                            for (; j < verify_window && ii < seq_lens_this_time[bid] - 1; j++, ii++) {
-                                if (verify_tokens_now[ii * max_candidate_len] != draft_tokens_now[ii + 1]) {
-                                    break;
-                                }
-                            }
-                            if (j >= verify_window) { // accept all
-                                accept_num_now += verify_window + 1;
-                                step_idx[bid] += verify_window + 1;
-                                for (; i < ii; i++) {
-                                    auto accept_token = draft_tokens_now[i + 1];
-                                    accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                                    if (is_in_end(accept_token, end_tokens, end_length) ||
-                                            step_idx[bid] >= max_dec_len[bid]) {
-                                        stop_flags[bid] = true;
-                                        stop_flag_now_int = 1;
-                                        if (step_idx[bid] >= max_dec_len[bid])
-                                            accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                                        break;
-                                    }
-                                }
-                            }
-                        }
-                        break;
-                    }
-                }
-            }
-
-            if (!stop_flag_now_int) {
-                int64_t accept_token;
-                const float* verify_scores_now = verify_scores + start_token_id * max_candidate_len;
-                if (ENABLE_TOPP) {
-                    auto actual_candidate_len_value = actual_candidate_len_now[i] > max_candidate_len
-                            ? max_candidate_len
-                            : actual_candidate_len_now[i];
-                    accept_token = topp_sampling_kernel(verify_tokens_now + i * max_candidate_len,
-                            verify_scores_now + i * max_candidate_len,
-                            dev_curand_states,
-                            actual_candidate_len_value,
-                            topp[bid]);
-                } else {
-                    accept_token = verify_tokens_now[i * max_candidate_len];
-                }
-                accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                if (is_in_end(accept_token, end_tokens, end_length) || step_idx[bid] >= max_dec_len[bid]) {
-                    stop_flags[bid] = true;
-                    stop_flag_now_int = 1;
-                    if (step_idx[bid] >= max_dec_len[bid])
-                        accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                }
-                step_idx[bid]++;
-            }
-
-            seq_lens_decoder[bid] += accept_num_now;
-
-            // For append mode, determine whether to reduce the number of draft tokens depending on whether they are received or not.
-            if (seq_lens_this_time[bid] > 1 && seq_lens_encoder[bid] == 0) {
-                auto current_actual_draft_token_num = actual_draft_token_nums[bid];
-                if (accept_num_now - 1 == current_actual_draft_token_num) {
-                    if (current_actual_draft_token_num + 2 <= max_draft_tokens - 1) {
-                        actual_draft_token_nums[bid] = current_actual_draft_token_num + 2;
-                    } else if (current_actual_draft_token_num + 1 <= max_draft_tokens - 1) {
-                        actual_draft_token_nums[bid] = current_actual_draft_token_num + 1;
-                    } else {
-                        actual_draft_token_nums[bid] = max_draft_tokens - 1;
-                    }
-                } else {
-                    actual_draft_token_nums[bid] =
-                            actual_draft_token_nums[bid] - 1 >= 1 ? actual_draft_token_nums[bid] - 1 : 1;
-                }
-            }
-
-            if (seq_lens_encoder[bid] != 0) {
-                seq_lens_decoder[bid] = seq_lens_encoder[bid];
-                seq_lens_encoder[bid] = 0;
-            }
-
-            accept_num[bid] = accept_num_now;
-            draft_tokens[bid * max_draft_tokens] = accept_tokens[bid * max_draft_tokens + accept_num_now - 1];
-        }
-    }
-    if (stop_flag_now_int) {
-        seq_lens_decoder[bid] = 0;
-    }
-
-    __syncthreads();
-    typedef cub::BlockReduce<int64_t, THREADBLOCK_SIZE> BlockReduce;
-    __shared__ typename BlockReduce::TempStorage temp_storage;
-
-    int64_t stop_sum = BlockReduce(temp_storage).Sum(stop_flag_now_int);
-
-    if (threadIdx.x == 0) {
-        not_need_stop[0] = stop_sum < real_bsz;
-    }
-}
-
-void SpeculateVerifyAndUpdate(const paddle::Tensor& accept_tokens,
-        const paddle::Tensor& accept_num,
-        const paddle::Tensor& step_idx,
-        const paddle::Tensor& seq_lens_encoder,
-        const paddle::Tensor& seq_lens_decoder,
-        const paddle::Tensor& stop_flags,
-        const paddle::Tensor& not_need_stop,
-        const paddle::Tensor& draft_tokens,
-        const paddle::Tensor& seq_lens_this_time,
-        const paddle::Tensor& verify_tokens,
-        const paddle::Tensor& verify_scores,
-        const paddle::Tensor& max_dec_len,
-        const paddle::Tensor& end_tokens,
-        const paddle::Tensor& is_block_step,
-        const paddle::Tensor& output_cum_offsets,
-        const paddle::Tensor& actual_candidate_len,
-        const paddle::Tensor& actual_draft_token_nums,
-        const paddle::Tensor& topp,
-        int max_seq_len,
-        int verify_window,
-        bool enable_topp) {
-    auto bsz = accept_tokens.shape()[0];
-    int real_bsz = seq_lens_this_time.shape()[0];
-    auto max_draft_tokens = draft_tokens.shape()[1];
-    auto end_length = end_tokens.shape()[0];
-    auto max_candidate_len = verify_tokens.shape()[1];
-
-    constexpr int BlockSize = 512;
-
-    curandState_t* dev_curand_states;
-    cudaMalloc(&dev_curand_states, sizeof(curandState_t) * bsz);
-    setup_kernel<<<1, BlockSize, 0, accept_tokens.stream()>>>(dev_curand_states, seed, offset, bsz, true);
-    seed++;
-    offset++;
-
-    auto err = cudaDeviceSynchronize();
-    if (err != 0) {
-        printf("err %d\n", err);
-    }
-
-    err = cudaGetLastError();
-
-    if (err != 0) {
-        printf("err %d\n", err);
-    }
-
-    bool use_topk = false;
-    char* env_var = getenv("SPECULATE_VERIFY_USE_TOPK");
-    if (env_var) {
-        use_topk = (bool)std::stoi(env_var);
-    }
-    if (use_topk) {
-        if (enable_topp) {
-            speculate_verify_and_update_kernel<BlockSize, true, true>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        } else {
-            speculate_verify_and_update_kernel<BlockSize, false, true>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        }
-    } else {
-        if (enable_topp) {
-            speculate_verify_and_update_kernel<BlockSize, true, false>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        } else {
-            speculate_verify_and_update_kernel<BlockSize, false, false>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        }
-    }
-
-    cudaFree(dev_curand_states);
-}
-
-PD_BUILD_OP(speculate_verify_and_update)
-        .Inputs({"accept_tokens",
-                "accept_num",
-                "step_idx",
-                "seq_lens_encoder",
-                "seq_lens_decoder",
-                "stop_flags",
-                "not_need_stop",
-                "draft_tokens",
-                "seq_lens_this_time",
-                "verify_tokens",
-                "verify_scores",
-                "max_dec_len",
-                "end_tokens",
-                "is_block_step",
-                "output_cum_offsets",
-                "actual_candidate_len",
-                "actual_draft_token_nums",
-                "topp"})
-        .Outputs({"accept_tokens_out",
-                "accept_num_out",
-                "step_idx_out",
-                "seq_lens_encoder_out",
-                "seq_lens_decoder_out",
-                "stop_flags_out",
-                "not_need_stop_out",
-                "draft_tokens_out"})
-        .Attrs({"max_seq_len: int", "verify_window: int", "enable_topp: bool"})
-        .SetInplaceMap({{"accept_tokens", "accept_tokens_out"},
-                {"accept_num", "accept_num_out"},
-                {"step_idx", "step_idx_out"},
-                {"seq_lens_encoder", "seq_lens_encoder_out"},
-                {"seq_lens_decoder", "seq_lens_decoder_out"},
-                {"stop_flags", "stop_flags_out"},
-                {"not_need_stop", "not_need_stop_out"},
-                {"draft_tokens", "draft_tokens_out"}})
-        .SetKernelFn(PD_KERNEL(SpeculateVerifyAndUpdate));
\ No newline at end of file
diff --git a/csrc/gpu/step.cu b/csrc/gpu/step.cu
index f495b6b9981b..b2091464181a 100644
--- a/csrc/gpu/step.cu
+++ b/csrc/gpu/step.cu
@@ -31,6 +31,7 @@ __global__ void free_and_dispatch_block(bool *stop_flags,
                                         int *used_list_len,
                                         int *free_list,
                                         int *free_list_len,
+                                        int64_t *first_token_ids,
                                         const int bsz,
                                         const int block_size,
                                         const int block_num_per_seq,
@@ -43,6 +44,7 @@ __global__ void free_and_dispatch_block(bool *stop_flags,
         int *block_table_now = block_tables + tid * block_num_per_seq;
         if (stop_flags[tid] && !is_block_step[tid]) {
             // 回收block块
+            first_token_ids[tid] = -1;
             const int encoder_block_len = encoder_block_lens[tid];
             const int decoder_used_len = used_list_len[tid];
             if (decoder_used_len > 0) {
@@ -166,11 +168,11 @@ __global__ void recover_block(int *recover_block_list, // [bsz]
                               int *encoder_block_lens,
                               int *used_list_len,
                               const int64_t *next_tokens,
+                              const int64_t *first_token_ids,
                               const int bsz,
                               const int block_num_per_seq,
                               const int length,
-                              const int pre_id_length,
-                              const int first_token_id) {
+                              const int pre_id_length) {
     const int bid = blockIdx.x;
     const int tid = threadIdx.x;
     __shared__ int ori_free_list_len;
@@ -189,7 +191,8 @@ __global__ void recover_block(int *recover_block_list, // [bsz]
             seq_lens_encoder[recover_id] = seq_len;
             stop_flags[recover_id] = false;
             input_ids_now[ori_seq_len_encoder + step_idx_now - 1] = next_tokens[recover_id]; // next tokens
-            input_ids_now[0] = first_token_id; // set first prompt token
+            input_ids_now[0] =
+                first_token_ids[recover_id];  // set first prompt token
             const int ori_free_list_len_tid0 = atomicSub(free_list_len, decoder_used_len);
             ori_free_list_len = ori_free_list_len_tid0;
 #ifdef DEBUG_STEP
@@ -234,9 +237,9 @@ void StepPaddle(const paddle::Tensor& stop_flags,
                 const paddle::Tensor& pre_ids,
                 const paddle::Tensor& step_idx,
                 const paddle::Tensor& next_tokens,
+                const paddle::Tensor &first_token_ids,
                 const int block_size,
                 const int encoder_decoder_block_num,
-                const int64_t first_token_id,
                 const int speculate_step_token_num) {
     auto cu_stream = seq_lens_this_time.stream();
     const int bsz = seq_lens_this_time.shape()[0];
@@ -264,6 +267,7 @@ void StepPaddle(const paddle::Tensor& stop_flags,
         const_cast<int*>(used_list_len.data<int>()),
         const_cast<int*>(free_list.data<int>()),
         const_cast<int*>(free_list_len.data<int>()),
+        const_cast<int64_t *>(first_token_ids.data<int64_t>()),
         bsz,
         block_size,
         block_num_per_seq,
@@ -300,11 +304,11 @@ void StepPaddle(const paddle::Tensor& stop_flags,
             const_cast<int*>(encoder_block_lens.data<int>()),
             const_cast<int*>(used_list_len.data<int>()),
             next_tokens.data<int64_t>(),
+            first_token_ids.data<int64_t>(),
             bsz,
             block_num_per_seq,
             length,
-            pre_id_length,
-            first_token_id
+            pre_id_length
         );
 #ifdef DEBUG_STEP
 #ifdef PADDLE_WITH_HIP
@@ -337,10 +341,10 @@ PD_BUILD_OP(step_paddle)
              "input_ids",
              "pre_ids",
              "step_idx",
-             "next_tokens"})
+             "next_tokens",
+             "first_token_ids",})
     .Attrs({"block_size: int",
             "encoder_decoder_block_num: int",
-            "first_token_id: int64_t",
             "speculate_step_token_num: int"})
     .Outputs({"stop_flags_out",
               "seq_lens_this_time_out",
@@ -358,7 +362,8 @@ PD_BUILD_OP(step_paddle)
               "used_list_len_out",
               "free_list_out",
               "free_list_len_out",
-              "input_ids_out"})
+              "input_ids_out",
+              "first_token_ids_out",})
     .SetInplaceMap({{"stop_flags", "stop_flags_out"},
                     {"seq_lens_this_time", "seq_lens_this_time_out"},
                     {"seq_lens_encoder", "seq_lens_encoder_out"},
@@ -375,5 +380,6 @@ PD_BUILD_OP(step_paddle)
                     {"used_list_len", "used_list_len_out"},
                     {"free_list", "free_list_out"},
                     {"free_list_len", "free_list_len_out"},
-                    {"input_ids", "input_ids_out"}})
+                    {"input_ids", "input_ids_out"},
+                    {"first_token_ids", "first_token_ids_out"}})
     .SetKernelFn(PD_KERNEL(StepPaddle));
\ No newline at end of file
diff --git a/csrc/paddlenlp_ops/__init__.py b/csrc/paddlenlp_ops/__init__.py
new file mode 100644
index 000000000000..afed279ac87c
--- /dev/null
+++ b/csrc/paddlenlp_ops/__init__.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+cuda_version = float(paddle.version.cuda())
+SUPPORTED_SM_VERSIONS = {70, 75, 80, 86, 89, 90} if cuda_version >= 12.4 else {70, 75, 80, 86, 89}
+
+
+def get_sm_version():
+    prop = paddle.device.cuda.get_device_properties()
+    cc = prop.major * 10 + prop.minor
+    return cc
+
+
+sm_version = get_sm_version()
+if sm_version not in SUPPORTED_SM_VERSIONS:
+    raise RuntimeError("Unsupported SM version")
+module_name = f"paddlenlp_ops.sm{sm_version}"
+
+try:
+    module = importlib.import_module(module_name)
+    globals().update(vars(module))
+except ImportError:
+    logger.WARNING(f"No {module_name} ")
diff --git a/csrc/paddlenlp_ops/sm70/__init__.py b/csrc/paddlenlp_ops/sm70/__init__.py
new file mode 100644
index 000000000000..b3507acbba74
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm70/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_70 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_70 ops")
diff --git a/csrc/paddlenlp_ops/sm75/__init__.py b/csrc/paddlenlp_ops/sm75/__init__.py
new file mode 100644
index 000000000000..667f5061f0e9
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm75/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_75 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_75 ops")
diff --git a/csrc/paddlenlp_ops/sm80/__init__.py b/csrc/paddlenlp_ops/sm80/__init__.py
new file mode 100644
index 000000000000..6bfec0821b27
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm80/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_80 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_80 ops")
diff --git a/csrc/paddlenlp_ops/sm86/__init__.py b/csrc/paddlenlp_ops/sm86/__init__.py
new file mode 100644
index 000000000000..47a614e6c81f
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm86/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_86 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_86 ops")
diff --git a/csrc/paddlenlp_ops/sm89/__init__.py b/csrc/paddlenlp_ops/sm89/__init__.py
new file mode 100644
index 000000000000..32f36383e056
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm89/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_89 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_89 ops")
diff --git a/csrc/paddlenlp_ops/sm90/__init__.py b/csrc/paddlenlp_ops/sm90/__init__.py
new file mode 100644
index 000000000000..5a5ba3a1da85
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm90/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_90 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_90 ops")
diff --git a/csrc/setup.py b/csrc/setup.py
new file mode 100644
index 000000000000..bc5e8dde8834
--- /dev/null
+++ b/csrc/setup.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" setup for EfficentLLM """
+
+import os
+
+from setuptools import find_packages, setup
+
+description = "Paddlenlp_ops : inference framework implemented based on PaddlePaddle"
+VERSION = "0.0.0"
+
+
+def read(file: str):
+    """
+    read file and return content
+    """
+    current_dir = os.path.dirname(__file__)
+    path = os.path.join(current_dir, file)
+    with open(path, "r", encoding="utf-8") as f:
+        content = f.read().strip()
+    return content
+
+
+def read_version():
+    """
+    read version and return content
+    """
+    return VERSION
+
+
+def read_readme():
+    """
+    read README.md and return content
+    """
+    return read("README.md")
+
+
+setup(
+    name="paddlenlp_ops",
+    packages=find_packages(),
+    version="0.0.0",
+    author="Paddle Infernce Team",
+    author_email="paddle-inference@baidu.com",
+    description=description,
+    long_description=read_readme(),
+    long_description_content_type="text/markdown",
+    url="",
+    python_requires=">=3.8",
+    package_dir={"paddlenlp_ops": "paddlenlp_ops/"},
+    package_data={"paddlenlp_ops": ["sm70/*", "sm75/*", "sm80/*", "sm86/*", "sm89/*", "sm90/*"]},
+    include_package_data=True,
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: OS Independent",
+    ],
+    license="Apache 2.0",
+)
diff --git a/csrc/setup_cuda.py b/csrc/setup_cuda.py
index 6e0ce8e20658..7f240cdcf575 100644
--- a/csrc/setup_cuda.py
+++ b/csrc/setup_cuda.py
@@ -19,6 +19,8 @@
 import paddle
 from paddle.utils.cpp_extension import CUDAExtension, setup
 
+sm_version = int(os.getenv("CUDA_SM_VERSION", "0"))
+
 
 def update_git_submodule():
     try:
@@ -38,9 +40,12 @@ def find_end_files(directory, end_str):
 
 
 def get_sm_version():
-    prop = paddle.device.cuda.get_device_properties()
-    cc = prop.major * 10 + prop.minor
-    return cc
+    if sm_version > 0:
+        return sm_version
+    else:
+        prop = paddle.device.cuda.get_device_properties()
+        cc = prop.major * 10 + prop.minor
+        return cc
 
 
 def strtobool(v):
@@ -77,8 +82,6 @@ def get_gencode_flags():
 gencode_flags = get_gencode_flags()
 library_path = os.environ.get("LD_LIBRARY_PATH", "/usr/local/cuda/lib64")
 
-sm_version = get_sm_version()
-
 sources = [
     "./gpu/save_with_output.cc",
     "./gpu/set_value_by_flags.cu",
@@ -103,6 +106,8 @@ def get_gencode_flags():
     "./gpu/step.cu",
     "./gpu/quant_int8.cu",
     "./gpu/dequant_int8.cu",
+    "./gpu/get_position_ids.cu",
+    "./gpu/fused_rotary_position_encoding.cu",
     "./gpu/flash_attn_bwd.cc",
     "./gpu/tune_cublaslt_gemm.cu",
     "./gpu/sample_kernels/top_p_sampling_reject.cu",
@@ -174,8 +179,9 @@ def get_gencode_flags():
         "gpu/fp8_gemm_with_cutlass/fp8_fp8_fp8_dual_gemm.cu",
     ]
 
+ops_name = f"paddlenlp_ops_{sm_version}" if sm_version != 0 else "paddlenlp_ops"
 setup(
-    name="paddlenlp_ops",
+    name=ops_name,
     ext_modules=CUDAExtension(
         sources=sources,
         extra_compile_args={"cxx": ["-O3"], "nvcc": nvcc_compile_args},
diff --git a/csrc/tools/build_wheel.sh b/csrc/tools/build_wheel.sh
new file mode 100644
index 000000000000..e3d6ad9fbd97
--- /dev/null
+++ b/csrc/tools/build_wheel.sh
@@ -0,0 +1,193 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+PYTHON_VERSION=python
+PYTHON_VERSION=${1:-$PYTHON_VERSION}
+SM_VERSION=${2:-$SM_VERSION}
+export python=$PYTHON_VERSION
+
+# directory config
+DIST_DIR="dist"
+BUILD_DIR="build"
+EGG_DIR="paddlenlp_ops.egg-info"
+
+# custom_ops directory config
+OPS_SRC_DIR="./"
+OPS_BUILD_DIR="build"
+OPS_EGG_DIR="paddlenlp_ops_*.egg-info"
+# OPS_TMP_DIR_BASE="tmp_base"
+OPS_TMP_DIR="tmp_*"
+
+# TEST_DIR="tests"
+
+# command line log config
+RED='\033[0;31m'
+BLUE='\033[0;34m'
+GREEN='\033[1;32m'
+BOLD='\033[1m'
+NONE='\033[0m'
+
+
+function python_version_check() {
+  PY_MAIN_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $1}'`
+  PY_SUB_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $2}'`
+  echo -e "find python version ${PY_MAIN_VERSION}.${PY_SUB_VERSION}"
+  if [ $PY_MAIN_VERSION -ne "3" -o $PY_SUB_VERSION -lt "8" ]; then
+    echo -e "${RED}FAIL:${NONE} please use Python >= 3.8 !"
+    exit 1
+  fi
+}
+
+function init() {
+    echo -e "${BLUE}[init]${NONE} removing building directory..."
+    rm -rf $DIST_DIR $BUILD_DIR $EGG_DIR
+    if [ `${python} -m pip list | grep paddlenlp_ops | wc -l` -gt 0  ]; then
+      echo -e "${BLUE}[init]${NONE} uninstalling paddlenlp_ops..."
+      ${python} -m pip uninstall -y paddlenlp_ops
+    fi
+
+    ${python} -m pip install setuptools_scm
+    echo -e "${BLUE}[init]${NONE} ${GREEN}init success\n"
+}
+
+function generate_sm_versions_and_build_ops() {
+   cuda_version=`${python} -c "import paddle; print(float(paddle.version.cuda()))"`
+   echo "CUDA version is: $cuda_version"
+   if [ ! -z "$SM_VERSION" ]; then
+      sm_versions=($SM_VERSION )
+   elif echo "$cuda_version >= 12.4" | awk '{if ($0) exit 0; exit 1}'; then
+       sm_versions=(70 80 80 86 89 90 )
+   else
+       sm_versions=(70 75 80 86 89 ) 
+    fi 
+    
+    for sm_version in "${sm_versions[@]}"; do
+        echo "Building and installing for sm_version: $sm_version"
+        build_and_install_ops $sm_version
+    done
+    return 
+}
+
+function copy_ops(){
+    local sm_version="$1"
+    OPS_VERSION="0.0.0"
+    PY_MAIN_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $1}'`
+    PY_SUB_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $2}'`
+    PY_VERSION="py${PY_MAIN_VERSION}.${PY_SUB_VERSION}"
+    SYSTEM_VERSION=`${python} -c "import platform; print(platform.system().lower())"`
+    PROCESSER_VERSION=`${python} -c "import platform; print(platform.processor())"`
+    WHEEL_NAME="paddlenlp_ops_${sm_version}-${OPS_VERSION}-${PY_VERSION}-${SYSTEM_VERSION}-${PROCESSER_VERSION}.egg"
+    echo -e "gpu ops -- paddlenlp_ops_${sm_version} ..."
+    cp -r ./tmp_${sm_version}/${WHEEL_NAME}/* ./paddlenlp_ops/sm${sm_version}
+    return
+}
+
+function build_and_install_ops() {
+  local sm_version="$1"
+  cd $OPS_SRC_DIR
+  export no_proxy=bcebos.com,paddlepaddle.org.cn,${no_proxy}
+  echo -e "${BLUE}[build]${NONE} build and install paddlenlp_ops_sm${sm_version} ops..."
+  CUDA_SM_VERSION=${sm_version} ${python} setup_cuda.py install --install-lib tmp_${sm_version}
+  echo -e "${BLUE}[build]${NONE} build and install paddlenlp_ops_${sm_version}..."
+  if [ $? -ne 0 ]; then
+    echo -e "${RED}[FAIL]${NONE} build paddlenlp_ops_${sm_version} wheel failed !"
+    exit 1
+  fi
+  echo -e "${BLUE}[build]${NONE} ${GREEN}build paddlenlp_ops_sm${sm_version} wheel success\n"
+
+  copy_ops "${sm_version}"
+}
+
+function build_and_install() {
+  echo -e "${BLUE}[build]${NONE} building paddlenlp_ops wheel..."
+  ${python} setup.py bdist_wheel
+  if [ $? -ne 0 ]; then
+    echo -e "${RED}[FAIL]${NONE} build paddlenlp_ops wheel failed !"
+    exit 1
+  fi
+  echo -e "${BLUE}[build]${NONE} ${GREEN}build paddlenlp_ops wheel success\n"
+
+  echo -e "${BLUE}[install]${NONE} installing paddlenlp_ops..."
+  cd $DIST_DIR
+  find . -name "paddlenlp_ops*.whl" | xargs ${python} -m pip install
+  if [ $? -ne 0 ]; then
+    cd ..
+    echo -e "${RED}[FAIL]${NONE} install paddlenlp_ops wheel failed !"
+    exit 1
+  fi
+  echo -e "${BLUE}[install]${NONE} ${GREEN}paddlenlp_ops install success\n"
+  cd ..
+}
+
+
+function unittest() {
+  # run UT
+  echo -e "${BLUE}[unittest]${NONE} ${GREEN}unittests success\n${NONE}"
+}
+
+function cleanup() {
+  rm -rf $BUILD_DIR $EGG_DIR
+  ${python} -m pip uninstall -y paddlenlp_ops
+
+  rm -rf $OPS_SRC_DIR/$BUILD_DIR $OPS_SRC_DIR/$EGG_DIR $OPS_SRC_DIR/$OPS_TMP_DIR
+}
+
+function abort() {
+  echo -e "${RED}[FAIL]${NONE} build wheel and unittest failed !
+          please check your code" 1>&2
+
+  cur_dir=`basename "$pwd"`
+
+  rm -rf $BUILD_DIR $EGG_DIR $DIST_DIR
+  ${python} -m pip uninstall -y paddlenlp_ops
+
+  rm -rf $OPS_SRC_DIR/$OPS_BUILD_DIR $OPS_SRC_DIR/$OPS_EGG_DIR $OPS_SRC_DIR/$OPS_TMP_DIR
+}
+
+python_version_check
+
+trap 'abort' 0
+set -e
+
+init
+generate_sm_versions_and_build_ops
+build_and_install
+unittest
+cleanup
+
+# get Paddle version
+PADDLE_VERSION=`${python} -c "import paddle; print(paddle.version.full_version)"`
+PADDLE_COMMIT=`${python} -c "import paddle; print(paddle.version.commit)"`
+
+# get paddlenlp_ops version
+EFFLLM_BRANCH=`git rev-parse --abbrev-ref HEAD`
+EFFLLM_COMMIT=`git rev-parse --short HEAD`
+
+# get Python version
+PYTHON_VERSION=`${python} -c "import platform; print(platform.python_version())"`
+
+echo -e "\n${GREEN}paddlenlp_ops wheel compiled and checked success !${NONE}
+        ${BLUE}Python version:${NONE} $PYTHON_VERSION
+        ${BLUE}Paddle version:${NONE} $PADDLE_VERSION ($PADDLE_COMMIT)
+        ${BLUE}paddlenlp_ops branch:${NONE} $EFFLLM_BRANCH ($EFFLLM_COMMIT)\n"
+
+echo -e "${GREEN}wheel saved under${NONE} ${RED}${BOLD}./dist${NONE}"
+
+# install wheel
+${python} -m pip install ./dist/paddlenlp_ops*.whl
+echo -e "${GREEN}wheel install success!${NONE}\n"
+
+trap 0
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
index 7234c729bcfc..2e5299367819 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -49,18 +49,15 @@
    :maxdepth: 1
    :caption: 飞桨大模型
 
-   大模型预训练文档 <llm/docs/pretrain.rst>
-   大模型精调文档 <llm/docs/finetune.md>
-   大模型FlashMask算法 <llm/docs/flashmask.md>
-   大模型常用算法文档 <llm/docs/algorithm_overview.md>
-   大模型RLHF文档 <llm/docs/rlhf.md>
-   大模型量化教程 <llm/docs/quantization.md>
-   大模型推理教程 <llm/docs/inference.md>
-   大模型统一存储文档 <llm/docs/unified_checkpoint.md>
-   混合并行训练教程 <llm/docs/llm_trainer.rst>
-   模型权重转换教程 <llm/docs/torch2paddle.md>
-   大模型DPO文档 <llm/docs/dpo.md>
-
+   飞桨大模型主文档 <llm/README.md>
+   大模型-预训练文档 <llm/docs/pretrain.rst>
+   大模型-精调文档 <llm/docs/finetune.md>
+   大模型-DPO文档 <llm/docs/dpo.md>
+   大模型-RLHF文档 <llm/docs/rlhf.md>
+   大模型-推理部署教程 <llm/docs/predict/index.rst>
+   大模型-量化教程 <llm/docs/quantization.md>
+   大模型-高级技术文档 <llm/docs/advanced.rst>
+   
 .. toctree::
    :maxdepth: 1
    :caption: 模型库
diff --git a/docs/llm/docs/advanced.rst b/docs/llm/docs/advanced.rst
new file mode 100644
index 000000000000..fd237eaed1f0
--- /dev/null
+++ b/docs/llm/docs/advanced.rst
@@ -0,0 +1,15 @@
+============
+大模型技术文档
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   unified_checkpoint.md
+   llm_trainer.rst
+   flashmask.md
+   mergekit.md
+   chat_template.md
+   torch2paddle.md
+   
+
diff --git a/docs/llm/llm_trainer.rst b/docs/llm/docs/llm_trainer.rst
similarity index 100%
rename from docs/llm/llm_trainer.rst
rename to docs/llm/docs/llm_trainer.rst
diff --git a/docs/llm/docs/predict/devices.rst b/docs/llm/docs/predict/devices.rst
new file mode 100644
index 000000000000..e0d8e5bbb92b
--- /dev/null
+++ b/docs/llm/docs/predict/devices.rst
@@ -0,0 +1,13 @@
+============
+大模型异构设备推理
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   昆仑 XPU <../../devices/xpu/llama/README.md>
+   昇腾 NPU <../../devices/npu/llama/README.md>
+   海光 K100 <../dcu_install.md>
+   燧原 GCU <../../devices/gcu/llama/README.md>
+   太初 SDAA <../../devices/sdaa/llama/README.md>
+   X86 CPU <../cpu_install.md>
\ No newline at end of file
diff --git a/docs/llm/docs/predict/index.rst b/docs/llm/docs/predict/index.rst
new file mode 100644
index 000000000000..6f4a986b2239
--- /dev/null
+++ b/docs/llm/docs/predict/index.rst
@@ -0,0 +1,14 @@
+============
+大模型推理
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   installation.md
+   inference.md
+   ../../server/docs/deploy_usage_tutorial.md
+   best_practices.md
+   speculative_decoding.md
+   各个模型推理量化教程 <models.rst>
+   大模型异构设备推理 <devices.rst>
\ No newline at end of file
diff --git a/docs/llm/docs/predict/models.rst b/docs/llm/docs/predict/models.rst
new file mode 100644
index 000000000000..907ef5948b99
--- /dev/null
+++ b/docs/llm/docs/predict/models.rst
@@ -0,0 +1,10 @@
+============
+各个模型推理、量化教程
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   llama.md
+   qwen.md
+   mixtral.md
\ No newline at end of file
diff --git a/docs/llm/server/index.rst b/docs/llm/server/index.rst
new file mode 100644
index 000000000000..2731969abfd4
--- /dev/null
+++ b/docs/llm/server/index.rst
@@ -0,0 +1,9 @@
+============
+大模型服务化模型部署
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   README.md
+   docs/deploy_usage_tutorial.md
\ No newline at end of file
diff --git a/llm/README.md b/llm/README.md
index 728f311295f4..a216d6f9fb2a 100644
--- a/llm/README.md
+++ b/llm/README.md
@@ -37,6 +37,11 @@
 
 ## 🚀 快速开始 🚀
 
+开始之前，您可以安装先 PaddleNLP 最新 develop 版本:
+```shell
+pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+```
+
 ### 1. 预训练
 
 PaddleNLP 将飞桨4D 并行策略加入到 Trainer API 中， 用户只需修改 Trainer 配置即可使用不同的分布式策略。目前大模型套件提供[LLaMA/LLaMA2/LLaMA3](./config/llama)、[GPT-3](./config/gpt-3)、[Qwen](./config/qwen)、[Baichuan/Baichuan2](./config/baichuan)、[Mixtral](./config/mixtral) 等模型预训练功能，更多模型支持持续更新中。
@@ -73,19 +78,30 @@ mkdir data
 mv llama_openwebtext_100k.bin ./data
 mv llama_openwebtext_100k.idx ./data
 ```
+单卡训练:
+```shell
+# 16G 显存可训练
+python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
+```
+- 该配置16G 显存可训练，可以开启 use_flash_attention,use_fused_rms_norm,recompute 进一步省显存
+- 如果上述配置无法开启，或显存依然不够，可以开启`offload_optim`,此时显存约为11G  `python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json  --offload_optim  1`
 
+高性能、多卡、多机训练:
 ```shell
 # 编译自定义算子，可选
 cd ../slm/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
 
-# 模型预训练参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
+# 多卡模型预训练参考:
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
+# 多机训练参考: 占用45G显存左右
+python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7"  --master=192.168.1.1:8090 --nnodes=2  run_pretrain.py ./config/llama/pretrain_argument.json
 ```
+- 更详细的分布式启动命令请参考[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.6/api/paddle/distributed/launch_cn.html#launch)。
 
 注意：
 
 1. 建议使用 paddle develop 版本训练，需要安装`pip install fast_dataindex visualdl==2.5.3`等相关缺失 whl 包
-2. `use_flash_attention` 需要在 A100机器开启，建议使用 cuda11.8环境。
+2. `use_flash_attention` 需要在 A100 以上机器开启，建议使用 cuda11.8以上环境。
 3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子，需要额外设置 PYTHONPATH
 4. `continue_training` 表示从现有的预训练模型加载训练。7b 模型初始 loss 大概为2.xx, 随机初始化模型 loss 从11.x 左右下降。
 5. 多机训练时，若各机器使用的训练数据文件位置相同（例如挂载共享硬盘情况），请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据，
@@ -125,29 +141,45 @@ PaddleNLP 支持多个主流大模型的 SFT、PEFT 等精调策略，提供统
 为了方便测试，我们也提供了[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)demo 数据集可以直接使用：
 
 ```shell
+# 在 PaddleNLP/llm 目录执行
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz
 tar -xvf alpaca_demo.gz
 ```
 
 #### 2.2 全参精调：SFT
 
+单卡
+```bash
+# 需要12G显存左右
+python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
+# 单卡性能最佳实践，16G显存，可以参考打开开关。
+# ./config/qwen/sft_argument_0p5b_best.json
+```
+
+多卡
 ```bash
-# SFT 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
+# SFT 启动命令参考，需要45G显存左右
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_finetune.py ./config/qwen/sft_argument.json
 ```
 
 #### 2.3 LoRA
 
+LoRA 启动命令参考
 ```bash
-# LoRA 启动命令参考
-python  run_finetune.py ./config/llama/lora_argument.json
+# 需要9G左右显存
+python run_finetune.py ./config/qwen/lora_argument_0p5b.json
+# 需要29G左右显存
+python run_finetune.py ./config/qwen/lora_argument.json
 ```
 
 #### 2.4 Prefix Tuning
 
+Prefix Tuning 启动命令参考
 ```bash
-# Prefix Tuning 启动命令参考
-python  run_finetune.py ./config/llama/pt_argument.json
+# 需要10G左右显存
+python run_finetune.py ./config/qwen/pt_argument_0p5b.json
+# 需要30G左右显存
+python run_finetune.py ./config/qwen/pt_argument.json
 ```
 
 除了 LoRA、Prefix Tuning 外，还支持 LoKr、VeRA、MoRA、ReFT、rsLoRA、LoRA+、PiSSA、MoSLoRA 等多种精调算法，更多大模型精调使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)。
@@ -192,18 +224,26 @@ tar -zxvf ultrafeedback_binarized.tar.gz
 
 ##### 全参 DPO
 
+
 ```bash
-# DPO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
+# DPO 启动命令参考, 8卡训练， 需要大概40G显存
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
+
+# 单卡训练，大概需要26G显存左右
+python -u  ./alignment/dpo/run_dpo.py ./config/qwen/dpo_argument_0p5b.json
 ```
 
 ##### LoRA DPO
 
 ```bash
 # DPO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
 ```
 更多 DPO 技术细节和使用说明详见[DPO 文档](./docs/dpo.md)。
+```bash
+# 需要52G左右显存
+python -u  ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
+```
 
 #### 3.2 KTO
 
@@ -240,13 +280,13 @@ tar -zxvf ultrafeedback_binarized.tar.gz
 
 ```bash
 # KTO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
 ```
 ##### LoRA KTO
 
 ```bash
 # KTO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
 ```
 
 #### 3.3 RLHF
@@ -322,13 +362,22 @@ PaddleNLP 提供高性能推理，内置动态插入和全环节算子融合策
      </font>
 </div>
 
+
+<a id="paddlenlpops"></a>
+paddlenlp_ops 安装高性能推理算子教程（可选）
+```shell
+cd ../csrc/
+python setup_cuda.py install
+cd -
+```
+
 ```shell
 # 动态图模型推理命令参考
-python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
+python ./predict/predictor.py --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct --inference_model --dtype float16
 
 # 静态图模型推理命令参考
 # step1 : 静态图导出
-python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct --inference_model --output_path ./inference --dtype float16
 # step2: 静态图推理
 python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
 ```
@@ -346,35 +395,80 @@ python ./predict/predictor.py --model_name_or_path ./inference --inference_model
 
 我们提供了一套基于动态图推理的简单易用 UI 服务化部署方法，用户可以快速部署服务化推理。
 
+请确保，在部署前请确保已正确安装 NLP，clone 本 repo 下位置代码。以及自定义算子库。本部署的服务是兼容 OpenAI API 接口
+
+
+
 环境准备
 
 - python >= 3.8
 - gradio
 - flask
+- paddlenlp_ops (可选，高性能自定义加速算子， 安装参考[这里](#paddlenlpops))
 
 
 服务化部署脚本
 
 ```shell
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./predict/flask_server.py \
-    --model_name_or_path meta-llama/Llama-2-7b-chat \
+# 单卡，可以使用 paddle.distributed.launch 启动多卡推理
+python  ./predict/flask_server.py \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
     --port 8010 \
     --flask_port 8011 \
     --dtype "float16"
 ```
 
-- `port`: Gradio UI 服务端口号，默认8011。
-- `flask_port`: Flask 服务端口号，默认8010。
+- `port`: Gradio UI 服务端口号，默认8010。
+- `flask_port`: Flask 服务端口号，默认8011。
 - 其他参数请参见[推理文档](./docs/predict/inference.md)中推理参数配置。
 
-此外，如果想通过 API 脚本的方式跑推理，可参考：`./predict/request_flask_server.py` 文件。
+图形化界面: 打开 `http://127.0.0.1:8010` 即可使用 gradio 图形化界面，即可开启对话。
+API 访问: 您也可用通过 flask 服务化 API 的形式
+
+1. 可参考：`./predict/request_flask_server.py` 文件访问。
+```shell
+python predict/request_flask_server.py
+```
+
+2. 或者直接使用 curl,调用开始对话
+```shell
+curl 127.0.0.1:8011/v1/chat/completions \
+-H 'Content-Type: application/json' \
+-d '{"message": [{"role": "user", "content": "你好"}]}'
+```
+3. 使用 OpenAI 客户端调用：
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://localhost:8011/v1/",
+)
+
+# Completion API
+stream = True
+completion = client.chat.completions.create(
+    model="paddlenlp",
+    messages=[
+        {"role": "user", "content": "PaddleNLP好厉害！这句话的感情色彩是？"}
+    ],
+    max_tokens=1024,
+    stream=stream,
+)
+
+if stream:
+    for c in completion:
+        print(c.choices[0].delta.content, end="")
+else:
+    print(completion.choices[0].message.content)
+```
 
 
 #### 7.2 大模型服务化部署工具
 
 该部署工具是基于英伟达 Triton 框架专为服务器场景的大模型服务化部署而设计。它提供了支持 gRPC、HTTP 协议的服务接口，以及流式 Token 输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。
 
-基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
+基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更细致的模型推理、量化教程可以参考[大模型推理教程](./docs/predict/inference.md)：
 
 ```shell
 # 下载模型
@@ -401,7 +495,8 @@ curl 127.0.0.1:9965/v1/chat/completions \
 Note:
 1. 请保证 shm-size >= 5，不然可能会导致服务启动失败
 
-更多关于该部署工具的使用方法，请查看[服务化部署流程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/server/docs/deploy_usage_tutorial.md)
+更多模型请参考[LLaMA](./docs/predict/llama.md)、[Qwen](./docs/predict/qwen.md)、[Mixtral](./docs/predict/mixtral.md)。
+更多关于该部署工具的使用方法，请查看[服务化部署流程](./server/docs/deploy_usage_tutorial.md)
 
 ### 8. PyTorch 模型权重转换
 
diff --git a/llm/application/information_extraction/README.md b/llm/application/information_extraction/README.md
new file mode 100644
index 000000000000..29b542f1beaa
--- /dev/null
+++ b/llm/application/information_extraction/README.md
@@ -0,0 +1,419 @@
+# 通用信息抽取大模型 PP-UIE
+
+ **目录**
+
+- [1. 模型简介](#模型简介)
+- [2. 开箱即用](#开箱即用)
+  - [2.1 实体抽取](#实体抽取)
+  - [2.2 关系抽取](#关系抽取)
+  - [2.3 模型选择](#模型选择)
+  - [2.4 更多配置](#更多配置)
+- [3. 训练定制](#训练定制)
+  - [3.1 代码结构](#代码结构)
+  - [3.2 数据标注](#数据标注)
+  - [3.3 模型微调](#模型微调)
+  - [3.4 定制模型一键预测](#定制模型一键预测)
+  - [3.5 实验指标](#实验指标)
+
+<a name="模型简介"></a>
+
+## 1. 模型简介
+
+通用信息抽取大模型（PP-UIE）是 PaddleNLP 团队基于开源模型和高质量数据集构建的通用信息抽取大模型， PaddleNLP 基于百度 UIE 的建模思路，通过大模型的能力来训练并开源了一款面向中、英文通用信息抽取的大模型。 支持统一训练信息抽取任务包括命名实体识别（NER），关系抽取（RE）和事件抽取（EE）。模型共包含0.5B、1.5B、7B 和14B 共4个版本，以适配不同场景下信息抽取任务使用。在多个数据集（包含 Boson、CLUENER、CCIR2021等常见数据）相比其他通用信息抽取大模型在 ACC 和 F1 指标上有大幅度提升。
+
+
+
+<a name="开箱即用"></a>
+
+## 2. 开箱即用
+
+```paddlenlp.Taskflow```提供通用信息抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用，并满足各类信息抽取需求**
+
+<a name="实体抽取"></a>
+
+#### 2.1 实体抽取
+
+  命名实体识别（Named Entity Recognition，简称 NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。
+
+  - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema 构造如下：
+
+    ```text
+    ['时间', '选手', '赛事名称']
+    ```
+
+    调用示例：
+
+    ```python
+    from pprint import pprint
+    from paddlenlp import Taskflow
+
+    schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+    ie = Taskflow('information_extraction',
+                  schema= ['时间', '选手', '赛事名称'],
+                  schema_lang="zh",
+                  batch_size=1,
+                  model='paddlenlp/PP-UIE-0.5B')
+    pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
+    # 输出
+    [{'时间': [{'text': '2月8日上午'}],
+      '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+      '选手': [{'text': '谷爱凌'}]}]
+    ```
+
+
+<a name="关系抽取"></a>
+
+#### 2.2 关系抽取
+
+  关系抽取（Relation Extraction，简称 RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。
+
+  - 例如以"竞赛名称"作为抽取主体，抽取关系类型为"主办方"、"承办方"和"时间", schema 构造如下：
+
+    ```text
+    {
+      '竞赛名称': [
+        '主办方',
+        '承办方',
+        '时间'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    schema = {'竞赛名称': ['主办方', '承办方', '时间']} # Define the schema for relation extraction
+    ie.set_schema(schema) # Reset schema
+    pprint(ie('2022年语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))
+    # 输出
+    [{'竞赛名称': [{'relations': {'主办方': [{'text': '中国中文信息学会,中国计算机学会'}],
+                          '时间': [{'text': '2022年'}],
+                          '承办方': [{'text': '百度公司,中国中文信息学会评测工作委员会,中国计算机学会自然语言处理专委会'}]},
+            'text': '语言与智能技术竞赛'}]}]
+    ```
+
+<a name="模型选择"></a>
+
+#### 2.3 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  结构  | 语言 |
+  | :---: | :--------: | :--------: |
+  | `paddlenlp/PP-UIE-0.5B` | 24-layers, 896-hidden, 14-heads | 中、英文 |
+  | `paddlenlp/PP-UIE-1.5B` | 28-layers, 1536-hidden, 12-heads | 中、英文 |
+  | `paddlenlp/PP-UIE-7B` | 28-layers, 3584-hidden, 28-heads | 中、英文 |
+  | `paddlenlp/PP-UIE-14B` | 48-layers, 5120-hidden, 40-heads | 中、英文 |
+
+<a name="更多配置"></a>
+
+#### 2.4 更多配置
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema = {'竞赛名称': ['主办方', '承办方', '时间']},
+                  schema_lang="zh",
+                  batch_size=1,
+                  model='paddlenlp/PP-UIE-0.5B',
+                  precision='float16')
+```
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置 schema 的语言，默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同，因此需要指定 schema 的语言。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，默认为`paddlenlp/PP-UIE-0.5B`，可选有`paddlenlp/PP-UIE-0.5B`, `paddlenlp/PP-UIE-1.5B`, `paddlenlp/PP-UIE-7B`, `paddlenlp/PP-UIE-14B`。
+* `precision`：选择模型精度，默认为`float16`，可选有`float16`、`bfloat16`和`float32`和。如果选择`float16`，在 GPU 硬件环境下，请先确保机器正确安装 NVIDIA 相关驱动和基础软件，**确保 CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保 GPU 设备的 CUDA 计算能力（CUDA Compute Capability）大于7.0，典型的设备包括 V100、T4、A10、A100、GTX 20系列和30系列显卡等。如果选择`bfloat16`，能有效加速处理大模型和批量数据，尤其与混合精度结合使用时性能表现更优。但需确保硬件和软件环境支持该精度。支持 `bfloat16`的硬件包括 NVIDIA A100 和 H100 GPU，同时需要确保使用 CUDA>=11.2、cuDNN>=8.1.1 等软件环境。更多关于 CUDA Compute Capability 和精度支持情况请参考 NVIDIA 文档：[GPU 硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+
+
+除此之外，也可通过以下代码快速调用模型并进行推理
+
+```python
+from paddlenlp.transformers import AutoModelForCausalLM
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.generation import GenerationConfig
+from paddlenlp.trl import llm_utils
+
+model_id = "paddlenlp/PP-UIE-0.5B"
+
+model = AutoModelForCausalLM.from_pretrained(model_id, use_flash_attention=False)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
+generation_config = GenerationConfig.from_pretrained(model_id)
+
+
+template = """
+你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。
+ **句子开始**
+ {sentence}
+ **句子结束**
+ **问题开始**
+ {prompt}
+ **问题结束**
+ **回答开始**
+ """
+
+sentences = [
+    "如有单位或个人对公示人员申请廉租住房保障资格有异议的，可以信件和电话的形式向市住建局举报，监督电话：5641079",
+    "姓名：张三，年龄：30岁，手机：13854488452，性别：男，家庭住址：北京市海淀区西北旺",
+    "张三,30岁,13854488452,男,北京市海淀区西北旺",
+]
+
+prompts = [
+    "电话号码",
+    "姓名，年龄，手机号码，性别，地址",
+    "姓名",
+]
+
+inputs = [template.format(sentence=sentence, prompt=prompt) for sentence, prompt in zip(sentences, prompts)]
+inputs = [tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in inputs]
+input_features = tokenizer(
+    inputs,
+    max_length=512,
+    return_position_ids=False,
+    truncation=True,
+    truncation_side="left",
+    padding=True,
+    return_tensors="pd",
+    add_special_tokens=False,
+)
+
+outputs = model.generate(
+    **input_features,
+    max_new_tokens=200,
+    bos_token_id=tokenizer.bos_token_id,
+    eos_token_id=llm_utils.get_eos_token_id(tokenizer, generation_config),
+    pad_token_id=tokenizer.pad_token_id,
+    decode_strategy="greedy_search",
+    temperature=1.0,
+    top_k=1,
+    top_p=1.0,
+    repetition_penalty=1.0,
+)
+
+
+def get_clean_entity(text):
+    ind1 = text.find("\n**回答结束**\n\n")
+    if ind1 != -1:
+        pred = text[:ind1]
+    else:
+        pred = text
+    return pred
+
+
+results = tokenizer.batch_decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+results = [get_clean_entity(result) for result in results]
+
+for sentence, prompt, result in zip(sentences, prompts, results):
+    print("-" * 50)
+    print(f"Sentence: {sentence}")
+    print(f"Prompt: {prompt}")
+    print(f"Result: {result}")
+```
+
+<a name="训练定制"></a>
+
+## 3. 训练定制
+
+对于简单的抽取目标可以直接使用 ```paddlenlp.Taskflow```实现零样本（zero-shot）抽取，对于细分场景我们推荐使用轻定制功能（标注少量数据进行模型微调）以进一步提升效果。下面通过`报销工单信息抽取`的例子展示如何通过几十条训练数据进行 PP-UIE 模型微调。
+
+<a name="代码结构"></a>
+
+#### 3.1 代码结构
+
+```shell
+.
+├── utils.py          # 数据处理工具
+├── doccano.py        # 数据标注脚本
+├── doccano.md        # 数据标注文档
+└── README.md
+```
+
+<a name="数据标注"></a>
+
+#### 3.2 数据标注
+
+我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano) 进行数据标注，本示例也打通了从标注到训练的通道，即 doccano 导出数据后可通过[doccano.py](./doccano.py)脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考[doccano 数据标注指南](doccano.md)。
+
+原始数据示例：
+
+```text
+深大到双龙28块钱4月24号交通费
+```
+
+抽取的目标(schema)为：
+
+```python
+schema = ['出发地', '目的地', '费用', '时间']
+```
+
+标注步骤如下：
+
+- 在 doccano 平台上，创建一个类型为``序列标注``的标注项目。
+- 定义实体标签类别，上例中需要定义的实体标签有``出发地``、``目的地``、``费用``和``时间``。
+- 使用以上定义的标签开始标注数据，下面展示了一个 doccano 标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167336891-afef1ad5-8777-456d-805b-9c65d9014b80.png height=100 hspace='10'/>
+</div>
+
+- 标注完成后，在 doccano 平台上导出文件，并将其重命名为``doccano_ext.json``后，放入``./data``目录下。
+
+- 这里我们提供预先标注好的文件[doccano_ext.json](https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json)，可直接下载并放入`./data`目录。执行以下脚本进行数据转换，执行后会在`./data`目录下生成训练/验证/测试集文件。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --save_dir ./data \
+    --splits 0.8 0.2 0 \
+    --schema_lang ch
+```
+
+
+可配置参数说明：
+
+- ``doccano_file``: 从 doccano 导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，目前只有信息抽取这一种任务。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为 False。
+- ``seed``: 随机种子，默认为1000.
+- ``schema_lang``: 选择 schema 的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从 doccano 导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+
+<a name="模型微调"></a>
+
+#### 3.3 模型微调
+
+推荐使用 [大模型精调](../../docs/finetune.md) 对模型进行微调。只需输入模型、数据集等就可以高效快速地进行微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，并且针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+使用下面的命令，使用 `paddlenlp/PP-UIE-0.5B` 作为预训练模型进行模型微调，将微调后的模型保存至指定路径中。
+
+如果在 GPU 环境中使用，可以指定 gpus 参数进行多卡训练：
+
+```shell
+cd ../../
+# 返回llm目录
+python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/qwen/sft_argument.json
+```
+
+`sft_argument.json` 的参考配置如下：
+```shell
+{
+    "model_name_or_path": "paddlenlp/PP-UIE-0.5B",
+    "dataset_name_or_path": "./application/information_extraction/data",
+    "output_dir": "./checkpoints/ie_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 1,
+    "per_device_eval_batch_size": 1,
+    "eval_accumulation_steps":8,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": false,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "zero_padding": false,
+    "unified_checkpoint": true,
+    "use_flash_attention": false
+  }
+```
+更多 sft_argument.json 配置文件说明，请参考[大模型精调](../../docs/finetune.md)
+
+
+<a name="定制模型一键预测"></a>
+
+#### 3.4 定制模型一键预测
+
+1. 使用 PaddleNLP的高性能 predictor进行快速推理
+- 内置全环节融合算子策略
+- 支持 Weight Only INT8及 INT4推理，支持权重、激活、Cache KV 进行 INT8、FP8量化的推理
+- 支持动态图推理和静态图推理两种方式
+
+```shell
+# llm目录下
+python predict/predictor.py \
+    --model_name_or_path ./checkpoints/ie_ckpts \
+    --dtype float16 \
+    --data_file ./application/information_extraction/data/test.json \
+    --output_file ./output.json \
+    --src_length  512 \
+    --max_length  20 \
+    --batch_size  4 \
+```
+更多关于 `predictor.py` 的配置参数说明，请参考[大模型推理教程](../../docs/predict/inference.md)
+
+2. 使用 taskflow进行快速推理
+`paddlenlp.Taskflow`支持装载定制模型，通过`task_path`指定模型权重文件的路径，路径下需要包含训练好的模型权重文件
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = ['出发地', '目的地', '费用', '时间']
+# 设定抽取目标和定制化模型权重路径
+>>> my_ie = Taskflow("information_extraction", schema=schema, model='paddlenlp/PP-UIE-0.5B',precision = "float16", task_path='./checkpoints/ie_ckpts')
+>>> pprint(my_ie("城市内交通费7月5日金额114广州至佛山"))
+[{'出发地': [{'text': '广州'}],
+  '时间': [{'text': '7月5日'}],
+  '目的地': [{'text': '佛山'}],
+  '费用': [{'text': '114'}]}]
+```
+
+
+
+<a name="实验指标"></a>
+
+#### 3.5 实验指标
+
+我们在通用测试集和医疗、新闻、对话与金融等垂类测试集上进行了实验：
+
+<!-- <table>
+<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
+<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
+<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
+<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
+<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
+<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
+<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
+<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>
+<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
+</table> -->
+
+<table>
+<tr><td>模型名称</td><td>数据集名称</td><td>CMeEE-V2</td><td>Boson</td><td>CLUENER</td><td>CCIR2021-NER</td><td>任务对话2018-NER</td><td>银行借贷2021-NER</td><td>SKE2019</td><td>Avg</td></tr>
+<tr><td></td><td>数据集领域</td><td>医疗领域</td><td>通用领域</td><td>通用领域</td><td>新闻领域</td><td>对话领域</td><td>金融领域</td><td>金融领域</td><td></td></tr>
+<tr><td>PP-UIE-0.5B</td><td>F1(0-shot)</td><td>0.479</td><td>0.638</td><td>0.593</td><td>0.773</td><td>0.723</td><td>0.361</td><td>0.782</td><td>0.621</td></tr>
+<tr><td>PP-UIE-1.5B</td><td>F1(0-shot)</td><td>0.485</td><td>0.688</td><td>0.61</td><td>0.799</td><td>0.768</td><td>0.444</td><td>0.803</td><td>0.657</td></tr>
+<tr><td></td><td>F1(5-shot)</td><td>0.52</td><td>0.694</td><td>0.625</td><td>0.812</td><td>0.812</td><td>0.466</td><td>0.801</td><td>0.676</td></tr>
+<tr><td>PP-UIE-7B</td><td>F1(0-shot)</td><td>0.521</td><td>0.696</td><td>0.615</td><td>0.826</td><td>0.807</td><td>0.434</td><td>0.812</td><td>0.673</td></tr>
+<tr><td></td><td>F1(5-shot)</td><td>0.527</td><td>0.705</td><td>0.626</td><td>0.826</td><td>0.861</td><td>0.483</td><td>0.801</td><td>0.69</td></tr>
+<tr><td>PP-UIE-14B</td><td>F1(0-shot)</td><td>0.556</td><td>0.712</td><td>0.637</td><td>0.841</td><td>0.843</td><td>0.488</td><td>0.832</td><td>0.701</td></tr>
+<tr><td></td><td>F1(5-shot)</td><td>0.588</td><td>0.729</td><td>0.67</td><td>0.837</td><td>0.865</td><td>0.576</td><td>0.832</td><td>0.728</td></tr>
+</table>
+
+
+0-shot 表示无训练数据直接通过模型进行预测，5-shot 表示预测时使用五个数据样例作为提示。**实验表明 PP-UIE 在垂类场景可以通过少量数据（few-shot）进一步提升效果**。
\ No newline at end of file
diff --git a/llm/application/information_extraction/doccano.md b/llm/application/information_extraction/doccano.md
new file mode 100644
index 000000000000..eaa3f0a086ff
--- /dev/null
+++ b/llm/application/information_extraction/doccano.md
@@ -0,0 +1,260 @@
+# doccano
+
+ **目录**
+
+* [1. 安装](#安装)
+* [2. 项目创建](#项目创建)
+* [3. 数据上传](#数据上传)
+* [4. 标签构建](#标签构建)
+* [5. 任务标注](#任务标注)
+* [6. 数据导出](#数据导出)
+* [7. 数据转换](#数据转换)
+
+<a name="安装"></a>
+
+## 1. 安装
+
+参考[doccano 官方文档](https://github.com/doccano/doccano) 完成 doccano 的安装与初始配置。
+
+**以下标注示例用到的环境配置：**
+
+- doccano 1.6.2
+
+<a name="项目创建"></a>
+
+## 2. 项目创建
+
+PP-UIE 支持抽取类型的任务，根据实际需要创建一个新的项目：
+
+#### 2.1 抽取式任务项目创建
+
+创建项目时选择**序列标注**任务，并勾选**Allow overlapping entity**及**Use relation Labeling**。适配**命名实体识别、关系抽取、事件抽取、评价观点抽取**等任务。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249142-44885510-51dc-4359-8054-9c89c9633700.png height=230 hspace='15'/>
+</div>
+
+<a name="数据上传"></a>
+
+## 3. 数据上传
+
+上传的文件为 txt 格式，每一行为一条待标注文本，示例:
+
+```text
+2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌
+第十四届全运会在西安举办
+```
+
+上传数据类型**选择 TextLine**:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167247061-d5795c26-7a6f-4cdb-88ad-107a3cae5446.png height=300 hspace='15'/>
+</div>
+
+**NOTE**：doccano 支持`TextFile`、`TextLine`、`JSONL`和`CoNLL`四种数据上传格式，PP-UIE 定制训练中**统一使用 TextLine**这一文件格式，即上传的文件需要为 txt 格式，且在数据标注时，该文件的每一行待标注文本显示为一页内容。
+
+<a name="标签构建"></a>
+
+## 4. 标签构建
+
+#### 4.1 构建抽取式任务标签
+
+抽取式任务包含**Span**与**Relation**两种标签类型，Span 指**原文本中的目标信息片段**，如实体识别中某个类型的实体，事件抽取中的触发词和论元；Relation 指**原文本中 Span 之间的关系**，如关系抽取中两个实体（Subject&Object）之间的关系，事件抽取中论元和触发词之间的关系。
+
+Span 类型标签构建示例:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248034-afa3f637-65c5-4038-ada0-344ffbd776a2.png height=300 hspace='15'/>
+</div>
+
+Relation 类型标签构建示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248307-916c77f6-bf80-4d6b-aa71-30c719f68257.png height=260 hspace='16'/>
+</div>
+
+
+## 5. 任务标注
+
+#### 5.1 命名实体识别
+
+命名实体识别（Named Entity Recognition，简称 NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，**抽取的类别没有限制，用户可以自己定义**。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248557-f1da3694-1063-465a-be9a-1bb811949530.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`时间`、`选手`、`赛事名称`和`得分`四种 Span 类型标签。
+
+```text
+schema = [
+    '时间',
+    '选手',
+    '赛事名称',
+    '得分'
+]
+```
+
+#### 5.2 关系抽取
+
+关系抽取（Relation Extraction，简称 RE），是指从文本中识别实体并抽取实体之间的语义关系，即抽取三元组（实体一，关系类型，实体二）。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248502-16a87902-3878-4432-b5b8-9808bd8d4de5.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`作品名`、`人物名`和`时间`三种 Span 类型标签，以及`歌手`、`发行时间`和`所属专辑`三种 Relation 标签。Relation 标签**由 Subject 对应实体指向 Object 对应实体**。
+
+该标注示例对应的 schema 为：
+
+```text
+schema = {
+    '作品名': [
+        '歌手',
+        '发行时间',
+        '所属专辑'
+    ]
+}
+```
+
+#### 5.3 事件抽取
+
+事件抽取 (Event Extraction, 简称 EE)，是指从自然语言文本中抽取事件并识别事件类型和事件论元的技术。UIE 所包含的事件抽取任务，是指根据已知事件类型，抽取该事件所包含的事件论元。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248793-138a1e37-43c9-4933-bf89-f3ac7228bf9c.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`地震触发词`（触发词）、`等级`（事件论元）和`时间`（事件论元）三种 Span 标签，以及`时间`和`震级`两种 Relation 标签。触发词标签**统一格式为`XX 触发词`**，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。Relation 标签**由触发词指向对应的事件论元**。
+
+该标注示例对应的 schema 为：
+
+```text
+schema = {
+    '地震触发词': [
+        '时间',
+        '震级'
+    ]
+}
+```
+
+
+<a name="数据导出"></a>
+
+## 6. 数据导出
+
+#### 6.1 导出抽取式任务数据
+
+选择导出的文件类型为``JSONL(relation)``，导出数据示例：
+
+```text
+{
+    "id": 38,
+    "text": "百科名片你知道我要什么，是歌手高明骏演唱的一首歌曲，1989年发行，收录于个人专辑《丛林男孩》中",
+    "relations": [
+        {
+            "id": 20,
+            "from_id": 51,
+            "to_id": 53,
+            "type": "歌手"
+        },
+        {
+            "id": 21,
+            "from_id": 51,
+            "to_id": 55,
+            "type": "发行时间"
+        },
+        {
+            "id": 22,
+            "from_id": 51,
+            "to_id": 54,
+            "type": "所属专辑"
+        }
+    ],
+    "entities": [
+        {
+            "id": 51,
+            "start_offset": 4,
+            "end_offset": 11,
+            "label": "作品名"
+        },
+        {
+            "id": 53,
+            "start_offset": 15,
+            "end_offset": 18,
+            "label": "人物名"
+        },
+        {
+            "id": 54,
+            "start_offset": 42,
+            "end_offset": 46,
+            "label": "作品名"
+        },
+        {
+            "id": 55,
+            "start_offset": 26,
+            "end_offset": 31,
+            "label": "时间"
+        }
+    ]
+}
+```
+
+标注数据保存在同一个文本文件中，每条样例占一行且存储为``json``格式，其包含以下字段
+- ``id``: 样本在数据集中的唯一标识 ID。
+- ``text``: 原始文本数据。
+- ``entities``: 数据中包含的 Span 标签，每个 Span 标签包含四个字段：
+    - ``id``: Span 在数据集中的唯一标识 ID。
+    - ``start_offset``: Span 的起始 token 在文本中的下标。
+    - ``end_offset``: Span 的结束 token 在文本中下标的下一个位置。
+    - ``label``: Span 类型。
+- ``relations``: 数据中包含的 Relation 标签，每个 Relation 标签包含四个字段：
+    - ``id``: (Span1, Relation, Span2)三元组在数据集中的唯一标识 ID，不同样本中的相同三元组对应同一个 ID。
+    - ``from_id``: Span1对应的标识 ID。
+    - ``to_id``: Span2对应的标识 ID。
+    - ``type``: Relation 类型。
+
+
+<a name="数据转换"></a>
+
+## 7.数据转换
+
+该章节详细说明如何通过`doccano.py`脚本对 doccano 平台导出的标注数据进行转换，一键生成训练/验证/测试集。
+
+#### 7.1 抽取式任务数据转换
+
+- 当标注完成后，在 doccano 平台上导出 `JSONL(relation)` 形式的文件，并将其重命名为 `doccano_ext.json` 后，放入 `./data` 目录下。
+- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以开始进行相应模型训练。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --save_dir ./data \
+    --negative_ratio 5
+```
+
+可配置参数说明：
+
+- ``doccano_file``: 从 doccano 导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全正例。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，目前只有信息抽取这一种任务。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为 True。
+- ``seed``: 随机种子，默认为1000.
+- ``schema_lang``: 选择 schema 的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从 doccano 导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+## References
+- **[doccano](https://github.com/doccano/doccano)**
diff --git a/llm/application/information_extraction/doccano.py b/llm/application/information_extraction/doccano.py
new file mode 100644
index 000000000000..8f0ff50988b6
--- /dev/null
+++ b/llm/application/information_extraction/doccano.py
@@ -0,0 +1,146 @@
+# coding=utf-8
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import time
+from decimal import Decimal
+
+import numpy as np
+from utils import convert_llm_examples, set_seed
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.doccano_file):
+        raise ValueError("Please input the correct path of doccano file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if len(args.splits) == 3 and not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.doccano_file, "r", encoding="utf-8") as f:
+        raw_examples = f.readlines()
+
+    def _create_llm_examples(
+        examples,
+        negative_ratio,
+        shuffle=False,
+        is_train=True,
+        schema_lang="ch",
+    ):
+        entities, relations = convert_llm_examples(examples, negative_ratio, is_train, schema_lang)
+        examples = entities + relations
+        if shuffle:
+            indexes = np.random.permutation(len(examples))
+            examples = [examples[i] for i in indexes]
+        return examples
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    if len(args.splits) == 0:
+        examples = _create_llm_examples(
+            raw_examples,
+            args.negative_ratio,
+            args.is_shuffle,
+            schema_lang=args.schema_lang,
+        )
+
+        _save_examples(args.save_dir, "train.json", examples)
+
+    else:
+        if args.is_shuffle:
+            indexes = np.random.permutation(len(raw_examples))
+            index_list = indexes.tolist()
+            raw_examples = [raw_examples[i] for i in indexes]
+        else:
+            index_list = list(range(len(raw_examples)))
+
+        i1, i2, _ = args.splits
+        p1 = int(len(raw_examples) * i1)
+        p2 = int(len(raw_examples) * (i1 + i2))
+
+        train_ids = index_list[:p1]
+        dev_ids = index_list[p1:p2]
+        test_ids = index_list[p2:]
+
+        with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp:
+            maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids}
+            fp.write(json.dumps(maps))
+
+        train_examples = _create_llm_examples(
+            raw_examples[:p1],
+            args.negative_ratio,
+            args.is_shuffle,
+            schema_lang=args.schema_lang,
+        )
+        dev_examples = _create_llm_examples(
+            raw_examples[p1:p2],
+            -1,
+            is_train=False,
+            schema_lang=args.schema_lang,
+        )
+        test_examples = _create_llm_examples(
+            raw_examples[p2:],
+            -1,
+            is_train=False,
+            schema_lang=args.schema_lang,
+        )
+
+        _save_examples(args.save_dir, "train.json", train_examples)
+        _save_examples(args.save_dir, "dev.json", dev_examples)
+        _save_examples(args.save_dir, "test.json", test_examples)
+
+    logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--doccano_file", default="./data/doccano_ext.json", type=str, help="The doccano file exported from doccano platform.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--task_type", choices="ie", default="ie", type=str, help="Select task type, ie for the information extraction task used qwen2, defaults to ie.")
+    parser.add_argument("--is_shuffle", default="False", type=strtobool, help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_convert()
diff --git a/llm/application/information_extraction/utils.py b/llm/application/information_extraction/utils.py
new file mode 100644
index 000000000000..be4cde905a41
--- /dev/null
+++ b/llm/application/information_extraction/utils.py
@@ -0,0 +1,348 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import random
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.utils.log import logger
+
+prompt_format = """你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。
+**句子开始**
+{sentence}
+**句子结束**
+**问题开始**
+{prompt}
+**问题结束**
+**回答开始**
+"""
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
+    """
+    Create dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
+    return dataloader
+
+
+def add_entity_negative_example(examples, texts, prompts, label_set, negative_ratio):
+    negative_examples = []
+    positive_examples = []
+    with tqdm(total=len(prompts)) as pbar:
+        for i, prompt in enumerate(prompts):
+            redundants = list(set(label_set) ^ set(prompt))
+            redundants.sort()
+
+            num_positive = len(examples[i])
+            if num_positive != 0:
+                actual_ratio = math.ceil(len(redundants) / num_positive)
+            else:
+                # Set num_positive to 1 for text without positive example
+                num_positive, actual_ratio = 1, 0
+
+            if actual_ratio <= negative_ratio or negative_ratio == -1:
+                idxs = [k for k in range(len(redundants))]
+            else:
+                idxs = random.sample(range(0, len(redundants)), negative_ratio * num_positive)
+
+            for idx in idxs:
+                src = prompt_format.format_map({"sentence": texts[i], "prompt": redundants[idx]})
+                negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+                # negative_result = {"content": texts[i], "result_list": [], "prompt": redundants[idx]}
+                negative_examples.append(negative_result)
+            positive_examples.extend(examples[i])
+            pbar.update(1)
+    return positive_examples, negative_examples
+
+
+def add_relation_negative_example(redundants, text, num_positive, ratio):
+    added_example = []
+    rest_example = []
+
+    if num_positive != 0:
+        actual_ratio = math.ceil(len(redundants) / num_positive)
+    else:
+        # Set num_positive to 1 for text without positive example
+        num_positive, actual_ratio = 1, 0
+
+    all_idxs = [k for k in range(len(redundants))]
+    if actual_ratio <= ratio or ratio == -1:
+        idxs = all_idxs
+        rest_idxs = []
+    else:
+        idxs = random.sample(range(0, len(redundants)), ratio * num_positive)
+        rest_idxs = list(set(all_idxs) ^ set(idxs))
+
+    for idx in idxs:
+        src = prompt_format.format_map({"sentence": text, "prompt": redundants[idx]})
+        negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+        added_example.append(negative_result)
+
+    for rest_idx in rest_idxs:
+        src = prompt_format.format_map({"sentence": text, "prompt": redundants[idx]})
+        negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+        rest_example.append(negative_result)
+
+    return added_example, rest_example
+
+
+def add_full_negative_example(examples, texts, relation_prompts, predicate_set, subject_goldens, schema_lang="ch"):
+    with tqdm(total=len(relation_prompts)) as pbar:
+        for i, relation_prompt in enumerate(relation_prompts):
+            negative_sample = []
+            for subject in subject_goldens[i]:
+                for predicate in predicate_set:
+                    # The relation prompt is constructed as follows:
+                    # subject + "的" + predicate -> Chinese
+                    # predicate + " of " + subject -> English
+                    if schema_lang == "ch":
+                        prompt = subject + "的" + predicate
+                    else:
+                        prompt = predicate + " of " + subject
+                    if prompt not in relation_prompt:
+                        src = prompt_format.format_map({"sentence": texts[i], "prompt": prompt})
+                        negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+                        negative_sample.append(negative_result)
+            examples[i].extend(negative_sample)
+            pbar.update(1)
+    return examples
+
+
+def convert_llm_examples(
+    raw_examples,
+    negative_ratio,
+    is_train=True,
+    schema_lang="ch",
+):
+    """
+    Convert labeled data export from doccano for extraction and aspect-level classification task.
+    """
+
+    texts = []
+    entity_examples = []
+    relation_examples = []
+    entity_prompts = []
+    relation_prompts = []
+    entity_label_set = []
+    entity_name_set = []
+    predicate_set = []
+    subject_goldens = []
+    inverse_relation_list = []
+    predicate_list = []
+
+    logger.info("Converting doccano data...")
+    with tqdm(total=len(raw_examples)) as pbar:
+        for line in raw_examples:
+            items = json.loads(line)
+            # Export file in JSONL format which doccano >= 1.7.0
+            # Export file in JSONL (relation) format
+            # e.g. {"text": "", "relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}
+            text, relations, entities = items["text"], items["relations"], items["entities"]
+            texts.append(text)
+            entity_example = []
+            entity_prompt = []
+            entity_example_map = {}
+            entity_map = {}  # id to entity name
+            for entity in entities:
+                entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                entity_label = entity["label"]
+                entity_map[entity["id"]] = {
+                    "name": entity_name,
+                    "start": entity["start_offset"],
+                    "end": entity["end_offset"],
+                }
+
+                src = prompt_format.format_map({"sentence": text, "prompt": entity_label})
+
+                if entity_label not in entity_example_map.keys():
+                    entity_example_map[entity_label] = {"src": src, "tgt": [entity_name]}
+                else:
+                    entity_example_map[entity_label]["tgt"].append(entity_name)
+
+                if entity_label not in entity_label_set:
+                    entity_label_set.append(entity_label)
+                if entity_name not in entity_name_set:
+                    entity_name_set.append(entity_name)
+                entity_prompt.append(entity_label)
+
+            for label, v in entity_example_map.items():
+                v["tgt"] = ",".join(v["tgt"]) + "\n**回答结束**\n\n"
+                entity_example.append(v)
+            entity_examples.append(entity_example)
+            entity_prompts.append(entity_prompt)
+
+            subject_golden = []  # Golden entity inputs
+            relation_example = []
+            relation_prompt = []
+            relation_example_map = {}
+            inverse_relation = []
+            predicates = []
+            for relation in relations:
+                predicate = relation["type"]
+                subject_id = relation["from_id"]
+                object_id = relation["to_id"]
+                # The relation prompt is constructed as follows:
+                # subject + "的" + predicate -> Chinese
+                # predicate + " of " + subject -> English
+                if schema_lang == "ch":
+                    prompt = entity_map[subject_id]["name"] + "的" + predicate
+                    inverse_negative = entity_map[object_id]["name"] + "的" + predicate
+                else:
+                    prompt = predicate + " of " + entity_map[subject_id]["name"]
+                    inverse_negative = predicate + " of " + entity_map[object_id]["name"]
+
+                if entity_map[subject_id]["name"] not in subject_golden:
+                    subject_golden.append(entity_map[subject_id]["name"])
+
+                src = prompt_format.format_map({"sentence": text, "prompt": prompt})
+
+                inverse_relation.append(inverse_negative)
+                predicates.append(predicate)
+
+                if prompt not in relation_example_map.keys():
+                    relation_example_map[prompt] = {"src": src, "tgt": [entity_map[object_id]["name"]]}
+                else:
+                    relation_example_map[prompt]["tgt"].append(entity_map[object_id]["name"])
+
+                if predicate not in predicate_set:
+                    predicate_set.append(predicate)
+                relation_prompt.append(prompt)
+
+            for v in relation_example_map.values():
+                v["tgt"] = ",".join(v["tgt"]) + "\n**回答结束**\n\n"
+                relation_example.append(v)
+
+            relation_examples.append(relation_example)
+            relation_prompts.append(relation_prompt)
+            subject_goldens.append(subject_golden)
+            inverse_relation_list.append(inverse_relation)
+            predicate_list.append(predicates)
+            pbar.update(1)
+
+    logger.info("Adding negative samples for first stage prompt...")
+    positive_examples, negative_examples = add_entity_negative_example(
+        entity_examples, texts, entity_prompts, entity_label_set, negative_ratio
+    )
+    if len(positive_examples) == 0:
+        all_entity_examples = []
+    else:
+        all_entity_examples = positive_examples + negative_examples
+
+    all_relation_examples = []
+    if len(predicate_set) != 0:
+        logger.info("Adding negative samples for second stage prompt...")
+        if is_train:
+
+            positive_examples = []
+            negative_examples = []
+            per_n_ratio = negative_ratio // 3
+
+            with tqdm(total=len(texts)) as pbar:
+                for i, text in enumerate(texts):
+                    negative_example = []
+                    collects = []
+                    num_positive = len(relation_examples[i])
+
+                    # 1. inverse_relation_list
+                    redundants1 = inverse_relation_list[i]
+                    # 2. entity_name_set ^ subject_goldens[i]
+                    redundants2 = []
+                    if len(predicate_list[i]) != 0:
+                        nonentity_list = list(set(entity_name_set) ^ set(subject_goldens[i]))
+                        nonentity_list.sort()
+
+                        if schema_lang == "ch":
+                            redundants2 = [
+                                nonentity + "的" + predicate_list[i][random.randrange(len(predicate_list[i]))]
+                                for nonentity in nonentity_list
+                            ]
+                        else:
+                            redundants2 = [
+                                predicate_list[i][random.randrange(len(predicate_list[i]))] + " of " + nonentity
+                                for nonentity in nonentity_list
+                            ]
+                    # 3. entity_label_set ^ entity_prompts[i]
+                    redundants3 = []
+                    if len(subject_goldens[i]) != 0:
+                        non_ent_label_list = list(set(entity_label_set) ^ set(entity_prompts[i]))
+                        non_ent_label_list.sort()
+
+                        if schema_lang == "ch":
+                            redundants3 = [
+                                subject_goldens[i][random.randrange(len(subject_goldens[i]))] + "的" + non_ent_label
+                                for non_ent_label in non_ent_label_list
+                            ]
+                        else:
+                            redundants3 = [
+                                non_ent_label + " of " + subject_goldens[i][random.randrange(len(subject_goldens[i]))]
+                                for non_ent_label in non_ent_label_list
+                            ]
+                    redundants_list = [redundants1, redundants2, redundants3]
+
+                    for redundants in redundants_list:
+                        added, rest = add_relation_negative_example(
+                            redundants,
+                            texts[i],
+                            num_positive,
+                            per_n_ratio,
+                        )
+                        negative_example.extend(added)
+                        collects.extend(rest)
+
+                    num_sup = num_positive * negative_ratio - len(negative_example)
+                    if num_sup > 0 and collects:
+                        if num_sup > len(collects):
+                            idxs = [k for k in range(len(collects))]
+                        else:
+                            idxs = random.sample(range(0, len(collects)), num_sup)
+                        for idx in idxs:
+                            negative_example.append(collects[idx])
+
+                    positive_examples.extend(relation_examples[i])
+                    negative_examples.extend(negative_example)
+                    pbar.update(1)
+            all_relation_examples = positive_examples + negative_examples
+        else:
+            relation_examples = add_full_negative_example(
+                relation_examples, texts, relation_prompts, predicate_set, subject_goldens, schema_lang=schema_lang
+            )
+            all_relation_examples = [r for relation_example in relation_examples for r in relation_example]
+
+    return all_entity_examples, all_relation_examples
diff --git a/llm/auto_parallel/deepseek-v3/run_pretrain_auto.py b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.py
new file mode 100644
index 000000000000..91381cb1e05a
--- /dev/null
+++ b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.py
@@ -0,0 +1,725 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+deepseek-v3 auto parallel pretraining scripts.
+"""
+import os
+import random
+import sys
+import types
+from collections import OrderedDict
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.distributed import fleet
+
+from paddlenlp.ops import Topology
+from paddlenlp.trainer import (
+    AutoTrainingArguments,
+    PdArgumentParser,
+    get_last_checkpoint,
+)
+from paddlenlp.trainer.auto_trainer import AutoTrainer
+from paddlenlp.trainer.trainer_utils import IntervalStrategy, _get_distributed_seeds
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    CosineAnnealingWithWarmupDecay,
+    DeepseekV2Config,
+    DeepseekV2PretrainingCriterion,
+    DeepseekV3ForCausalLMAuto,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "deepseekv3_auto": (DeepseekV2Config, DeepseekV3ForCausalLMAuto, DeepseekV2PretrainingCriterion),
+}
+
+
+from paddlenlp.data.causal_dataset import (
+    build_train_valid_test_datasets,
+    check_data_split,
+    print_rank_0,
+)
+from paddlenlp.trainer.utils.doc import add_start_docstrings
+
+
+@dataclass
+@add_start_docstrings(AutoTrainingArguments.__doc__)
+class PreTrainingArguments(AutoTrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+    enable_linear_fused_grad_add: bool = field(
+        default=False,
+        metadata={
+            "help": "Enable fused linear grad add strategy, which will reduce elementwise add for grad accumulation in the backward of nn.Linear ."
+        },
+    )
+    job_schedule_profiler_start: int = field(
+        default=-1,
+        metadata={"help": "The step to start job_schedule_profiler."},
+    )
+    job_schedule_profiler_end: int = field(
+        default=-1,
+        metadata={"help": "The step to end job_schedule_profiler."},
+    )
+    pipeline_schedule_mode: str = field(
+        default="1F1B", metadata={"help": "The pipeline schedule mode, support FThenB, 1F1B, VPP and Eager-1F1B."}
+    )
+    sr: Optional[int] = field(default=0, metadata={"help": "The count of chunks without recompute."})
+    virtual_pipeline_seg_method: str = field(
+        default="DeepseekV2DecoderLayerAuto",
+        metadata={"help": "The seg method of spliting pp layer for virtual pipeline."},
+    )
+    # NOTE(gongenlei): new add autotuner_benchmark
+    autotuner_benchmark: bool = field(
+        default=False,
+        metadata={"help": "Weather to run benchmark by autotuner. True for from_scratch and pad_max_length."},
+    )
+
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.enable_auto_parallel
+
+        # NOTE(gongenlei): new add autotuner_benchmark
+        if self.autotuner_benchmark:
+            self.max_steps = 5
+            self.do_train = True
+            self.do_export = False
+            self.do_predict = False
+            self.do_eval = False
+            self.overwrite_output_dir = True
+            self.load_best_model_at_end = False
+            self.report_to = []
+            self.save_strategy = IntervalStrategy.NO
+            self.evaluation_strategy = IntervalStrategy.NO
+
+        logger.info(self.strategy)
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+
+    data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."})
+    skip_warmup: bool = field(
+        default=True,
+        metadata={"help": "Whether to skip the warmup process of mmap files."},
+    )
+    data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(
+        default="deepseekv3", metadata={"help": "Only support for llama pre-training for now."}
+    )
+    model_name_or_path: str = field(
+        default="deepseek-ai/DeepSeek-V3",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    vocab_size: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": ".Vocabulary size of the deeepseekv2 model. Defines the number of different tokens that can be represented by the `inputs_ids`"
+        },
+    )
+    hidden_size: Optional[int] = field(default=None, metadata={"help": "Dimension of the hidden representations."})
+    intermediate_size: Optional[int] = field(default=None, metadata={"help": "Dimension of the MLP representations."})
+    num_hidden_layers: Optional[int] = field(
+        default=None, metadata={"help": "Number of hidden layers in the Transformer encoder."}
+    )
+    num_attention_heads: Optional[int] = field(
+        default=None,
+        metadata={"help": "Number of attention heads for each attention layer in the Transformer encoder."},
+    )
+    use_flash_attention: bool = field(
+        default=False,
+        metadata={"help": "use_flash_attention"},
+    )
+    use_fused_rms_norm: bool = field(
+        default=False,
+        metadata={"help": "deepseekv3, use_fused_rms_norm"},
+    )
+    fuse_attention_qkv: bool = field(
+        default=False,
+        metadata={"help": "whether to fuse attention qkv"},
+    )
+    fuse_attention_ffn: bool = field(
+        default=False,
+        metadata={"help": "whether to fuse first up and gate proj in mlp block"},
+    )
+    recompute_granularity: str = field(
+        default="full",
+        metadata={"help": "Choose among ['full', 'core_attn', 'full_attn']"},
+    )
+    virtual_pp_degree: int = field(
+        default=1,
+        metadata={"help": "virtual_pp_degree"},
+    )
+    continue_training: bool = field(
+        default=False,
+        metadata={
+            "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
+        },
+    )
+    use_fused_rope: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Enable rope fusion or not."},
+    )
+    no_recompute_layers: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "Specify the full transformer layers that should not be recomputed."},
+    )
+    pp_recompute_interval: int = field(
+        default=1,
+        metadata={
+            "help": "The interval for the number of layers at which recomputation occurs. A value of 0 indicates no recomputation. Default is 0."
+        },
+    )
+    recompute_use_reentrant: bool = field(
+        default=False,
+        metadata={"help": "recompute_use_reentrant"},
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+    need_data=True,
+):
+
+    check_data_split(data_args.split, training_args.do_train, training_args.do_eval, training_args.do_predict)
+
+    train_val_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.dataset_world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters,
+    ]
+
+    print_rank_0(" > datasets target sizes (minimum size):")
+    if training_args.do_train:
+        print_rank_0("    train:      {}".format(train_val_test_num_samples[0]))
+    if training_args.do_eval:
+        print_rank_0("    validation: {}".format(train_val_test_num_samples[1]))
+    if training_args.do_predict:
+        print_rank_0("    test:       {}".format(train_val_test_num_samples[2]))
+
+    # Build the datasets.
+    train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        data_impl=data_args.data_impl,
+        splits_string=data_args.split,
+        train_val_test_num_samples=train_val_test_num_samples,
+        seq_length=data_args.max_seq_length,
+        seed=training_args.seed,
+        skip_warmup=data_args.skip_warmup,
+        share_folder=data_args.share_folder,
+        data_cache_path=data_args.data_cache,
+        need_data=need_data,
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode.")
+        # input_ids, loss_mask, attention_mask, position_ids, labels = data
+        input_ids = data["text"]
+
+        logger.info(tokenizer._decode(input_ids))
+
+    from paddlenlp.data import Stack
+
+    def _collate_data(data, stack_fn=Stack()):
+        tokens_ = stack_fn([x["text"] for x in data])
+
+        labels = tokens_[:, 1:]
+        tokens = tokens_[:, :-1]
+
+        return {
+            "input_ids": tokens,
+            "labels": labels,
+        }
+
+    if need_data:
+        if training_args.do_train:
+            print_dataset(train_dataset[0], "train")
+        if training_args.do_eval:
+            print_dataset(valid_dataset[0], "valid")
+        if training_args.do_predict:
+            print_dataset(test_dataset[0], "test")
+
+    return train_dataset, valid_dataset, test_dataset, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]  # add
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+class PretrainingTrainer(AutoTrainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.is_pretraining = True
+
+    def _wrap_for_dist_loader(self, train_dataloader):
+        dist_loader = super()._wrap_for_dist_loader(train_dataloader)
+        dist_loader._input_keys = ["input_ids", "labels"]
+        return dist_loader
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        if self.train_dataset is None:
+            return None
+
+        total_batch_size_per_acc_step = self.args.per_device_train_batch_size * self.args.dataset_world_size
+        total_batch_size = total_batch_size_per_acc_step
+
+        # In llm/llama/run_pretrain.py, it uses paddlenlp.utils.batch_sampler.DistributedBatchSampler,
+        # which does no shuffle when shuffle is set True.
+        sampler = paddle.io.BatchSampler(
+            dataset=self.train_dataset,
+            shuffle=False,
+            batch_size=total_batch_size,
+            drop_last=self.args.dataloader_drop_last,
+        )
+        sampler._acc_steps = self.args.gradient_accumulation_steps
+        return sampler
+
+
+def print_config(args, key=""):
+    """
+    print config values
+    """
+    logger.info("=" * 60)
+    if args is None:
+        args = args
+        key = "Training"
+    import paddlenlp
+
+    logger.info("{:^40}".format("{} Configuration Arguments".format(key)))
+    logger.info("{:30}: {}".format("paddle commit id", paddle.version.commit))
+    logger.info("{:30}: {}".format("paddlenlp commit id", paddlenlp.version.commit))
+
+    for a in dir(args):
+        if a[:2] != "__":  # don't print double underscore methods
+            v = getattr(args, a)
+            if not isinstance(v, types.MethodType):
+                logger.info("{:30}: {}".format(a, v))
+
+    logger.info("")
+
+
+def init_seed(seed: int = 1234, args=None):
+    if args is None:
+        random.seed(seed)
+        np.random.seed(seed)
+        paddle.seed(seed)
+    else:
+        assert not args.use_hybrid_parallel and args.enable_auto_parallel
+        if dist.get_world_size() > 1:
+            if args.hybrid_parallel_topo_order is None or args.hybrid_parallel_topo_order == "pp_first":
+                order = ["pp", "dp", "sharding", "mp", "sep"]
+            elif args.hybrid_parallel_topo_order == "sharding_first":
+                order = ["dp", "sharding", "pp", "mp", "sep"]
+            topo = Topology(
+                dist.get_rank(),
+                dist.get_world_size(),
+                dp_degree=args.dataset_world_size,
+                pp_degree=args.pipeline_parallel_degree,
+                mp_degree=args.tensor_parallel_degree,
+                sharding_degree=1,  # auto_parallel's sharding is not orthogonal with dp, mp and pp
+                order=order,
+            )
+
+            global_seed, local_seed, random_seed = _get_distributed_seeds(args.seed, topo)
+
+            paddle.seed(local_seed)
+            random.seed(random_seed)
+            np.random.seed(random_seed)
+
+            logger.info(
+                "The global seed is set to {}, local seed is set to {} and "
+                "random seed is set to {}.".format(global_seed, local_seed, random_seed)
+            )
+        else:
+            random.seed(args.seed)
+            np.random.seed(args.seed)
+            paddle.seed(args.seed)
+
+
+def get_mesh(pp_idx=0):
+    mesh = fleet.auto.get_mesh()
+    if "pp" in mesh.dim_names:
+        mesh = mesh.get_mesh_with_dim("pp")[pp_idx]
+    return mesh
+
+
+def shard_fn(layer, mesh_idx, placements):
+    paran_name = layer.weight.name
+    layer.weight = dist.shard_tensor(layer.weight, get_mesh(mesh_idx), placements)
+    layer.weight.name = paran_name
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if training_args.enable_linear_fused_grad_add:
+        from fused_layers import mock_layers
+
+        mock_layers()
+
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    if data_args.data_cache is not None:
+        os.makedirs(data_args.data_cache, exist_ok=True)
+
+    init_seed(args=training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class, criterion_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+
+    config.seq_length = data_args.max_seq_length
+    # There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings
+    if not model_args.continue_training:
+        config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
+
+    if not model_args.continue_training:
+        config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
+        logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.")
+
+    if model_args.no_recompute_layers is not None:
+        model_args.no_recompute_layers.sort()
+
+    config.vocab_size = model_args.vocab_size if model_args.vocab_size is not None else config.vocab_size
+    config.hidden_size = model_args.hidden_size if model_args.hidden_size is not None else config.hidden_size
+    config.intermediate_size = (
+        model_args.intermediate_size if model_args.intermediate_size is not None else config.intermediate_size
+    )
+    config.num_hidden_layers = (
+        model_args.num_hidden_layers if model_args.num_hidden_layers is not None else config.num_hidden_layers
+    )
+    config.num_attention_heads = (
+        model_args.num_attention_heads if model_args.num_attention_heads is not None else config.num_attention_heads
+    )
+
+    config.use_flash_attention = model_args.use_flash_attention
+    config.use_fused_rms_norm = model_args.use_fused_rms_norm
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.fuse_attention_ffn = model_args.fuse_attention_ffn
+    config.recompute_granularity = model_args.recompute_granularity
+    config.virtual_pp_degree = model_args.virtual_pp_degree
+    config.sequence_parallel = training_args.sequence_parallel
+
+    config.fuse_sequence_parallel_allreduce = training_args.fuse_sequence_parallel_allreduce
+
+    config.use_fused_rope = model_args.use_fused_rope
+    config.no_recompute_layers = model_args.no_recompute_layers
+    config.pp_recompute_interval = model_args.pp_recompute_interval
+    config.recompute_use_reentrant = model_args.recompute_use_reentrant
+
+    config.use_recompute = training_args.recompute
+    config.tensor_parallel_degree = training_args.tensor_parallel_degree
+    config.tensor_parallel_rank = training_args.tensor_parallel_rank
+    config.sharding_parallel_degree = training_args.sharding_parallel_degree
+
+    if training_args.strategy.pipeline.enable and config.virtual_pp_degree > 1:
+        pipeline = training_args.strategy.pipeline
+        pipeline.vpp_degree = config.virtual_pp_degree
+        pipeline.vpp_seg_method = training_args.virtual_pipeline_seg_method
+
+    print("Final pre-training config:", config)
+
+    # # Set the dtype for loading model
+    # dtype = "float32"
+    # if training_args.fp16_opt_level == "O2":
+    #     if training_args.fp16:
+    #         dtype = "float16"
+    #     if training_args.bf16:
+    #         dtype = "bfloat16"
+
+    with paddle.LazyGuard():
+        model = model_class.from_config(config, dtype="float32")
+        criterion = criterion_class(config)
+
+    if training_args.recompute:
+
+        def fn(layer):
+            if hasattr(layer, "enable_recompute") and (layer.enable_recompute is False or layer.enable_recompute == 0):
+                layer.enable_recompute = True
+
+        model.apply(fn)
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+
+    if training_args.warmup_steps > 0:
+        warmup_steps = training_args.warmup_steps
+    else:
+        warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = None
+    if training_args.lr_scheduler_type.value == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+    elif training_args.lr_scheduler_type.value == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+
+    data_file = get_train_data_file(data_args)
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args,
+        training_args,
+        data_file,
+        tokenizer,
+        need_data=training_args.should_load_dataset,
+    )
+    trainer = PretrainingTrainer(
+        model=model,
+        criterion=criterion,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+
+        # NOTE(gongenlei): new add
+        if not training_args.autotuner_benchmark:
+            metrics = train_result.metrics
+            if not int(os.getenv("test_ci_no_save_model", 0)):
+                trainer.save_model()
+            trainer.log_metrics("train", metrics)
+            trainer.save_metrics("train", metrics)
+            trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # if training_args.should_load_dataset:
+    #     effective_tokens_per_second = total_effective_tokens / train_result.metrics["train_runtime"]
+    #     print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}")
+    #     print(f"ips: {effective_tokens_per_second:.2f} tokens/s")
+
+
+def shard_model(model):
+    pp_stage = 0
+    for name, layer in model.named_sublayers(include_self=False):
+        if hasattr(layer, "ipp"):
+            pp_stage = layer.ipp
+        # print(f"name {name},pp_stage {pp_stage}==>", type(layer))
+        if "embed_tokens" in name:
+            # embedding only support column split now. it will update in the future
+            shard_fn(layer, 0, [dist.Replicate(), dist.Shard(1)])
+        for n in [
+            "self_attn.q_proj",
+            "self_attn.k_proj",
+            "self_attn.v_proj",
+            "self_attn.qkv_proj",
+            "gate_proj",
+            "up_proj",
+            "gate_up_fused_proj",
+        ]:
+            if n in name:
+                shard_fn(layer, pp_stage, [dist.Replicate(), dist.Shard(1)])
+                break
+        for n in ["self_attn.o_proj", "down_proj"]:
+            if n in name:
+                shard_fn(layer, pp_stage, [dist.Replicate(), dist.Shard(0)])
+                break
+        if "lm_head" in name:
+            shard_fn(layer, -1, [dist.Replicate(), dist.Shard(1)])
+
+
+def load_model(model):
+    model_state_dict = model.state_dict()
+    state_dict = paddle.load("hand/all.pdparams")
+    tmp = OrderedDict()
+    (tmp, state_dict) = (state_dict, tmp)
+    for (k, v) in tmp.items():
+        k = map_structure_name(k)
+        state_dict[k] = v
+    model.set_state_dict(state_dict)
+    assert len(model_state_dict) == len(state_dict), f"{len(model_state_dict)} vs {len(state_dict)}"
+    """
+    print("=======model_state_dict=======")
+    for (k,v) in model_state_dict.items():
+        print(f"{k}=>{v.shape}")
+    """
+    print("=======state_dict=======")
+    for (k, v) in state_dict.items():
+        assert k in model_state_dict
+        print(f"{k}=>{v.shape}")
+
+
+def print_grad(model):
+    model_state_dict = model.state_dict()
+    name_mapping = {v.name: k for (k, v) in model_state_dict.items()}
+    for p in model.parameters():
+        assert p.name in name_mapping
+        if p.grad is not None:
+            print(f"{name_mapping[p.name]} {p.name}_grad shape: {p.grad.shape} md5sum: {p.grad._md5sum()}")
+
+
+def print_param(model):
+    model_state_dict = model.state_dict()
+    name_mapping = {v.name: k for (k, v) in model_state_dict.items()}
+    for p in model.parameters():
+        assert p.name in name_mapping
+        if p.grad is not None:
+            print(f"{name_mapping[p.name]} {p.name} shape: {p.shape} md5sum: {p._md5sum()}")
+
+
+def map_structure_name(k):
+    fs = k.split(".")
+    idx = int(fs[1])
+    if idx == 0:
+        return "deepseek_v2.embed_tokens.weight"
+    if idx == 28:
+        return "deepseek_v2.norm.weight"
+    if idx == 29:
+        return "lm_head.weight"
+    else:
+        return f"deepseek_v2.layers.{idx-1}." + ".".join(fs[2:])
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/auto_parallel/deepseek-v3/run_pretrain_auto.sh b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.sh
new file mode 100644
index 000000000000..15dd24f8a0c5
--- /dev/null
+++ b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.sh
@@ -0,0 +1,80 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/bin/bash
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+task_name="deepseekv3"
+rm -rf output/$task_name/
+rm -rf "output/$task_name""_log"
+
+export SOT_LOG_LEVEL=4
+export PYTHONPATH=../../../:$PYTHONPATH
+#ulimit -c unlimited
+# export GLOG_v=3
+
+# export FLAGS_call_stack_level=3
+# export FLAGS_use_cuda_managed_memory=true
+
+# export FLAGS_embedding_deterministic=1        
+# export FLAGS_cudnn_deterministic=1
+# export NVIDIA_TF32_OVERRIDE=0
+
+to_static=0  # 是否开启动转静训练
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3" \
+    --log_dir  "output/$task_name""_log" \
+    run_pretrain_auto.py \
+    --model_type "deepseekv3_auto" \
+    --model_name_or_path "deepseek-ai/DeepSeek-V3" \
+    --tokenizer_name_or_path "deepseek-ai/DeepSeek-V3" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 2 \
+    --use_flash_attention 0 \
+    --use_fused_rms_norm 1 \
+    --fp16 0 \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --pipeline_parallel_degree 1 \
+    --tensor_parallel_degree 2 \
+    --sharding_parallel_degree 2 \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 2 \
+    --save_steps 5000000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --logging_steps 1\
+    --dataloader_num_workers 1 \
+    --sharding "stage1" \
+    --eval_steps 1000000 \
+    --disable_tqdm true \
+    --continue_training 0\
+    --recompute 0 \
+    --do_train \
+    --do_eval \
+    --device "gpu" \
+    --data_impl "mmap" \
+    --enable_auto_parallel 1 \
+    --max_grad_norm 1.0 \
+    --num_hidden_layers 1 \
+    --use_intermediate_api true \
+    --to_static $to_static \
diff --git a/llm/auto_parallel/llama/README.md b/llm/auto_parallel/llama/README.md
index 529f7ba2a8f3..c6aa0324b708 100644
--- a/llm/auto_parallel/llama/README.md
+++ b/llm/auto_parallel/llama/README.md
@@ -45,19 +45,30 @@ import paddle.distributed as dist
 
 ckpt_path='/path/for/dist_ckpt'
 # offload=1, 参数 offload 到 CPU，减少显存占用
-merged_state_dict = dist.checkpoint.load_state_dict.load_merged_state_dict(ckpt_path, offload=1)
+# prefix="model" 参数可用于过滤掉非模型参数，例如 optimizer 状态等
+merged_state_dict = dist.checkpoint.load_state_dict.load_merged_state_dict(ckpt_path, offload=1, prefix="model")
 paddle.save(merged_state_dict, 'model_state.pdparams')
 
-# 上述合并的模型参数格式为Paddle原生格式，如需转换为unified_param格式(safetensors)，可继续执行如下代码：
-python PaddleNLP/llm/auto_parallel/utils/convert_to_safetensors.py --input_path input_path  [--output_path output_path] [--split_num split_num] [--offload offload]
+# 上述合并的模型参数格式为Paddle原生格式，如需转换为unified checkpoint格式(safetensors)，或需获取模型参数的index文件，继续执行如下代码：
+python PaddleNLP/llm/auto_parallel/utils/convert_to_safetensors.py --input_path input_path  [--output_path output_path] [--split_num split_num] [--offload] [--as_safetensors]
 
 # 参数介绍
 --input_path: 输入的单卡模型参数路径
 --output_path: 可选，输出模型参数路径，默认为'./temp'
 --split_num: 可选，输出的模型参数分片数，默认为 1
---offload: 可选，是否将参数 offload 到 CPU，默认为 false
+--offload: 可选，选项用于控制是否将参数 offload 到 CPU
+--as_safetensors: 可选，选项用于控制是否将模型参数转换为 safetensors 格式
 ```
 
 - 动态图推理
 
     [大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)
+
+## 5.PPO 训练
+自动并行当前尚未支持 PPO 训练，后续会持续支持。但您可以将自动并行训练得到的模型参数转换后用于 PPO 训练。自动并行 ckpt 转手动并行 ckpt 流程参考**推理**部分。
+
+- PPO 训练
+
+    [PPO 训练教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/rlhf.md)
+
+- 注：PPO 训练教程中 PKU-Alignment/alpaca-7b-reproduced 模型是一个类 llama 模型，但与原生 llama 模型结构存在一定差异，具体为 embedding 层和 lm_head 层 shape 不同，原生 llama 的 shape 为 [4096, 32000]，但 PKU-Alignment/alpaca-7b-reproduced 的 shape 为 [4096, 32001]。
diff --git a/llm/auto_parallel/llama/run_llama2_13b_xpu.sh b/llm/auto_parallel/llama/run_llama2_13b_xpu.sh
new file mode 100755
index 000000000000..301d19a38bb1
--- /dev/null
+++ b/llm/auto_parallel/llama/run_llama2_13b_xpu.sh
@@ -0,0 +1,106 @@
+#!/bin/bash
+
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task_name_or_path="llama2-13b-auto"
+
+#export XPUAPI_DEBUG=0x1
+#export XPURT_DISPATCH_MODE=PROFILING
+export XBLAS_FC_HBM_VERSION=40
+
+# PaddlePaddle
+export FLAGS_use_stride_kernel="0"
+export XPU_PADDLE_L3_SIZE=98566144 # 94 MB
+export XPU_CDNN_CLUSTER_PARALLEL=1
+export XPU_CDNN_CLUSTER_PARALLEL_STREAM_NUMBER=2
+
+# PDC
+unset PADDLE_ELASTIC_JOB_ID
+unset PADDLE_TRAINER_ENDPOINTS
+unset DISTRIBUTED_TRAINER_ENDPOINTS
+unset FLAGS_START_PORT
+unset PADDLE_ELASTIC_TIMEOUT
+unset PADDLE_TRAINERS_NUM
+
+# BKCL
+# export BKCL_DEBUG=1
+# Multi-computer RDMA
+#export BKCL_ENABLE_XDR=1
+#export BKCL_RDMA_FORCE_TREE=1
+#export BKCL_TREE_THRESHOLD=0
+#export BKCL_RDMA_NICS=xgbe1,xgbe1,xgbe2,xgbe2,xgbe3,xgbe3,xgbe4,xgbe4
+#export BKCL_SOCKET_IFNAME=xgbe0
+#export BKCL_FORCE_L3_RDMA=0
+export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64
+echo "bkcl version:"
+strings ${bkcl_location}/libbkcl.so | grep COM
+
+export CUDA_DEVICE_MAX_CONNECTIONS=8
+
+#PYTHONPATH
+export PYTHONPATH=../../../:$PYTHONPATH
+
+# for debug
+#export GLOG_v=10
+export FLAGS_call_stack_level=2
+
+rm -rf output/$task_name_or_path
+PYTHONPATH=../:$PYTHONPATH  \
+python -u  -m paddle.distributed.launch \
+    --xpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name_or_path/" \
+    run_pretrain_auto.py \
+    --model_name_or_path "meta-llama/Llama-2-13b" \
+    --tokenizer_name_or_path "meta-llama/Llama-2-13b" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name_or_path" \
+    --split 949,50,1 \
+    --max_seq_length 4096 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --use_flash_attention 1 \
+    --use_fused_rope 1 \
+    --fuse_attention_ffn 1 \
+    --fuse_attention_qkv 1 \
+    --use_fused_rms_norm 0 \
+    --num_hidden_layers 4 \
+    --bf16 \
+    --fp16_opt_level "O2"  \
+    --amp_master_grad true \
+    --scale_loss 1024 \
+    --learning_rate 0.00003 \
+    --min_learning_rate 0.000005 \
+    --lr_scheduler_type "cosine" \
+    --max_steps 10 \
+    --save_steps 100000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1 \
+    --sequence_parallel 0 \
+    --dataloader_num_workers 4 \
+    --pipeline_parallel_degree 1 \
+    --tensor_parallel_degree 1 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --continue_training 0 \
+    --recompute 0 \
+    --do_train \
+    --seed 1026 \
+    --device "xpu" \
+    --enable_auto_parallel 1 \
+    --to_static 1
diff --git a/llm/auto_parallel/llama/run_pretrain_auto.py b/llm/auto_parallel/llama/run_pretrain_auto.py
index fa3d8855afb5..24e737de544b 100644
--- a/llm/auto_parallel/llama/run_pretrain_auto.py
+++ b/llm/auto_parallel/llama/run_pretrain_auto.py
@@ -59,6 +59,7 @@
     print_rank_0,
 )
 from paddlenlp.trainer.utils.doc import add_start_docstrings
+from paddlenlp.utils.tools import get_env_device
 
 
 @dataclass
@@ -173,6 +174,11 @@ class ModelArguments:
         default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
     )
 
+    use_fast_layer_norm: bool = field(
+        default=False,
+        metadata={"help": "GPT3 model, use fast layernorm"},
+    )
+
     config_name: Optional[str] = field(
         default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
     )
@@ -496,6 +502,8 @@ def main():
 
     config = config_class.from_pretrained(model_args.model_name_or_path)
 
+    config.use_fast_layer_norm = model_args.use_fast_layer_norm
+
     config.seq_length = data_args.max_seq_length
     # There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings
     if not model_args.continue_training:
@@ -544,6 +552,15 @@ def main():
         pipeline = training_args.strategy.pipeline
         pipeline.vpp_degree = config.virtual_pp_degree
         pipeline.vpp_seg_method = training_args.virtual_pipeline_seg_method
+    if get_env_device() == "xpu" and training_args.gradient_accumulation_steps > 1:
+        try:
+            from paddle_xpu.layers.nn.linear import LinearConfig  # noqa: F401
+
+            LinearConfig.enable_accumulate_steps_opt()
+            LinearConfig.set_accumulate_steps(training_args.gradient_accumulation_steps)
+        except ImportError:
+            # It's OK, not use accumulate_steps optimization
+            pass
 
     print("Final pre-training config:", config)
 
diff --git a/llm/auto_parallel/utils/convert_to_safetensors.py b/llm/auto_parallel/utils/convert_to_safetensors.py
index 6f000e1e8955..c07e91af18b2 100644
--- a/llm/auto_parallel/utils/convert_to_safetensors.py
+++ b/llm/auto_parallel/utils/convert_to_safetensors.py
@@ -19,10 +19,12 @@
 from safetensors.numpy import save_file as safe_save_file
 
 from paddlenlp.transformers.utils import dtype_byte_size
-from paddlenlp.utils.env import SAFE_WEIGHTS_INDEX_NAME
+from paddlenlp.utils.env import PADDLE_WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_INDEX_NAME
 
 
-def convert_to_unified_ckpt(path: str, output_dir: str = "./tmp", split_num: int = 1, offload: bool = False):
+def convert_to_unified_ckpt(
+    path: str, output_dir: str = "./tmp", split_num: int = 1, offload: bool = False, as_safetensors: bool = False
+):
     """
     Convert a single card checkpoint to the unified format.
 
@@ -31,9 +33,10 @@ def convert_to_unified_ckpt(path: str, output_dir: str = "./tmp", split_num: int
         output_dir (str, optional): The directory where the converted files will be saved. Defaults to ".".
         split_num (int, optional): The number of shards to split the weights into output_dir. Defaults to 1.
         offload (bool, optional): Whether to offload the weights to CPU memory before saving them. Defaults to False.
+        as_safetensors (bool, optional): Whether to save the weights as safetensors. Defaults to False.
     """
 
-    def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file, total_size):
+    def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file, total_size, as_safetensors):
         """
         Get the sub-state dict and update the index weight file and total size.
         Args:
@@ -42,8 +45,12 @@ def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file,
             weight_filename (str): The filename of the corresponding weight file.
             index_weight_file (dict): The dictionary containing the mapping from keys to their corresponding weight filenames.
             total_size (int): The total size of the model so far.
+            as_safetensors (bool): Whether to save the weights as safetensors.
         """
-        sub_state_dict = {key: state_dict[key].numpy() for key in sub_keys}
+        if as_safetensors:
+            sub_state_dict = {key: state_dict[key].numpy() for key in sub_keys}
+        else:
+            sub_state_dict = {key: state_dict[key] for key in sub_keys}
         for key in sub_keys:
             index_weight_file[key] = weight_filename
             total_size += state_dict[key].numel().item() * dtype_byte_size(state_dict[key].dtype)
@@ -65,12 +72,21 @@ def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file,
         current_size = split_size + (1 if rank < extra_keys else 0)
         sub_keys = all_keys[index : index + current_size]
         index += current_size
-        weight_filename = f"model-{rank+1:04d}-of-{split_num:04d}.safetensors"
+        if as_safetensors:
+            weight_filename = f"model-{rank+1:04d}-of-{split_num:04d}.safetensors"
+        else:
+            weight_filename = f"model_state-{rank+1:04d}-of-{split_num:04d}.pdparams"
         sub_state_dict, total_size = get_sub_state_dict(
-            sub_keys, state_dict, weight_filename, index_weight_file, total_size
+            sub_keys, state_dict, weight_filename, index_weight_file, total_size, as_safetensors
         )
-        safe_save_file(sub_state_dict, os.path.join(output_dir, weight_filename))
-    with open(os.path.join(output_dir, SAFE_WEIGHTS_INDEX_NAME), "w") as f:
+        if as_safetensors:
+            safe_save_file(sub_state_dict, os.path.join(output_dir, weight_filename), metadata={"format": "np"})
+            index_file_name = SAFE_WEIGHTS_INDEX_NAME
+        else:
+            paddle.save(sub_state_dict, os.path.join(output_dir, weight_filename))
+            index_file_name = PADDLE_WEIGHTS_INDEX_NAME
+
+    with open(os.path.join(output_dir, index_file_name), "w") as f:
         json.dump({"metadata": {"total_size": total_size}, "weight_map": index_weight_file}, f, indent=4)
 
 
@@ -86,7 +102,10 @@ def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file,
         "--split_num", type=int, default=1, help="The number of shards to split the weights into output_dir."
     )
     parser.add_argument(
-        "--offload", type=bool, help="Whether to offload the weights to CPU memory before saving them."
+        "--offload", action="store_true", help="Whether to offload the weights to CPU memory before saving them."
+    )
+    parser.add_argument(
+        "--as_safetensors", action="store_true", help="Save the weights as safetensors instead of pdparams."
     )
     args = parser.parse_args()
-    convert_to_unified_ckpt(args.input_path, args.output_dir, args.split_num, args.offload)
+    convert_to_unified_ckpt(args.input_path, args.output_dir, args.split_num, args.offload, args.as_safetensors)
diff --git a/llm/config/deepseek-v2/pretrain_argument.json b/llm/config/deepseek-v2/pretrain_argument.json
index 9bc889e13f85..8ab15be1f5d9 100644
--- a/llm/config/deepseek-v2/pretrain_argument.json
+++ b/llm/config/deepseek-v2/pretrain_argument.json
@@ -4,10 +4,10 @@
     "input_dir": "./data",
     "output_dir": "./checkpoints/pretrain_ckpts",
     "per_device_train_batch_size": 1,
-    "gradient_accumulation_steps": 1,
+    "gradient_accumulation_steps": 32,
     "per_device_eval_batch_size": 1,
     "tensor_parallel_degree": 1,
-    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_degree": 8,
     "sharding_parallel_degree": 1,
     "sharding": "stage2",
     "virtual_pp_degree": 1,
diff --git a/llm/config/llama/dpo_argument.json b/llm/config/llama/dpo_argument.json
index 60065fbb7a1d..510cd2d475d2 100644
--- a/llm/config/llama/dpo_argument.json
+++ b/llm/config/llama/dpo_argument.json
@@ -1,5 +1,5 @@
 {
-    "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
+    "model_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
     "train_dataset_path": "./data/train.jsonl",
     "dev_dataset_path": "./data/dev.jsonl",
     "output_dir": "./checkpoints/dpo_ckpts",
diff --git a/llm/config/llama/pretrain_argument.json b/llm/config/llama/pretrain_argument.json
index dff5b322337e..304b6d7822a2 100644
--- a/llm/config/llama/pretrain_argument.json
+++ b/llm/config/llama/pretrain_argument.json
@@ -28,7 +28,7 @@
     "warmup_ratio": 0.01,
     "max_grad_norm": 1.0,
     "dataloader_num_workers": 1,
-    "continue_training": 1,
+    "continue_training": 0,
     "do_train": true,
     "do_eval": true,
     "do_predict": true,
diff --git a/llm/config/qwen/dpo_argument_0p5b.json b/llm/config/qwen/dpo_argument_0p5b.json
new file mode 100644
index 000000000000..4799d83bda6e
--- /dev/null
+++ b/llm/config/qwen/dpo_argument_0p5b.json
@@ -0,0 +1,39 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "train_dataset_path": "./data/train.jsonl",
+    "dev_dataset_path": "./data/dev.jsonl",
+    "output_dir": "./checkpoints/dpo_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 1,
+    "num_train_epochs": 1,
+    "max_steps": 100,
+    "learning_rate": 1e-06,
+    "warmup_steps": 10,
+    "logging_steps": 1,
+    "evaluation_strategy": "steps",
+    "save_strategy": "steps",
+    "eval_steps": 100,
+    "save_steps": 500,
+    "max_seq_len": 2048,
+    "max_prompt_len": 1024,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "tensor_parallel_degree": 1,
+    "sharding": "stage1",
+    "use_flash_attention": false,
+    "flash_mask": false,
+    "recompute": true,
+    "recompute_granularity": "full",
+    "benchmark": false,
+    "unified_checkpoint": true,
+    "autotuner_benchmark":false,
+    "beta": 0.1,
+    "loss_type": "sigmoid",
+    "greedy_zero_padding": false,
+    "label_smoothing": 0.0
+  }
diff --git a/llm/config/qwen/lora_argument.json b/llm/config/qwen/lora_argument.json
index aeb0d5d61f92..a00845c2263a 100644
--- a/llm/config/qwen/lora_argument.json
+++ b/llm/config/qwen/lora_argument.json
@@ -4,7 +4,7 @@
     "output_dir": "./checkpoints/lora_ckpts",
     "per_device_train_batch_size": 4,
     "gradient_accumulation_steps": 4,
-    "per_device_eval_batch_size": 8,
+    "per_device_eval_batch_size": 4,
     "eval_accumulation_steps":16,
     "num_train_epochs": 3,
     "learning_rate": 3e-04,
diff --git a/llm/config/qwen/lora_argument_0p5b.json b/llm/config/qwen/lora_argument_0p5b.json
new file mode 100644
index 000000000000..88014ac90268
--- /dev/null
+++ b/llm/config/qwen/lora_argument_0p5b.json
@@ -0,0 +1,34 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/lora_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 2,
+    "eval_accumulation_steps": 32,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true,
+    "unified_checkpoint": true,
+    "zero_padding": false,
+    "use_flash_attention": false,
+    "pissa": false
+  }
diff --git a/llm/config/qwen/pretrain_argument_0p5b.json b/llm/config/qwen/pretrain_argument_0p5b.json
new file mode 100644
index 000000000000..a0e2ff37c3d2
--- /dev/null
+++ b/llm/config/qwen/pretrain_argument_0p5b.json
@@ -0,0 +1,40 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B",
+    "tokenizer_name_or_path": "Qwen/Qwen2.5-0.5B",
+    "input_dir": "./data",
+    "output_dir": "./checkpoints/pretrain_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 1,
+    "per_device_eval_batch_size": 2,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "virtual_pp_degree": 1,
+    "sequence_parallel": 0,   
+    "use_flash_attention": false,
+    "use_fused_rms_norm": false,
+    "max_seq_length": 1024,
+    "learning_rate": 3e-05,
+    "min_learning_rate": 3e-06,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "max_steps": 10000,
+    "save_steps": 5000,
+    "eval_steps": 1000,
+    "weight_decay": 0.01,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "warmup_ratio": 0.01,
+    "max_grad_norm": 1.0,
+    "dataloader_num_workers": 1,
+    "continue_training": 0,
+    "do_train": true,
+    "do_eval": true,
+    "do_predict": true,
+    "disable_tqdm": true,
+    "recompute": false,
+    "distributed_dataloader": 1,
+    "recompute_granularity": "full",
+    "unified_checkpoint": true,
+    "save_total_limit": 2
+  }
diff --git a/llm/config/qwen/pt_argument.json b/llm/config/qwen/pt_argument.json
index b70e4a144c75..85ecd8ab004c 100644
--- a/llm/config/qwen/pt_argument.json
+++ b/llm/config/qwen/pt_argument.json
@@ -4,8 +4,8 @@
     "output_dir": "./checkpoints/pt_ckpts",
     "per_device_train_batch_size": 4,
     "gradient_accumulation_steps": 4,
-    "per_device_eval_batch_size": 8,
-    "eval_accumulation_steps":16,
+    "per_device_eval_batch_size": 4,
+    "eval_accumulation_steps": 32,
     "num_train_epochs": 3,
     "learning_rate": 3e-02,
     "warmup_steps": 30,
diff --git a/llm/config/qwen/pt_argument_0p5b.json b/llm/config/qwen/pt_argument_0p5b.json
new file mode 100644
index 000000000000..4ebb18ace09c
--- /dev/null
+++ b/llm/config/qwen/pt_argument_0p5b.json
@@ -0,0 +1,31 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/pt_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 4,
+    "eval_accumulation_steps": 32,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true,
+    "use_flash_attention": false
+  }
diff --git a/llm/config/qwen/sft_argument_0p5b.json b/llm/config/qwen/sft_argument_0p5b.json
new file mode 100644
index 000000000000..e5f05bc5e2cd
--- /dev/null
+++ b/llm/config/qwen/sft_argument_0p5b.json
@@ -0,0 +1,33 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/sft_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "zero_padding": false,
+    "unified_checkpoint": true,
+    "use_flash_attention": false
+  }
diff --git a/llm/config/qwen/sft_argument_0p5b_best.json b/llm/config/qwen/sft_argument_0p5b_best.json
new file mode 100644
index 000000000000..5ad6b466aab0
--- /dev/null
+++ b/llm/config/qwen/sft_argument_0p5b_best.json
@@ -0,0 +1,37 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/sft_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 2,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "zero_padding": true,
+    "unified_checkpoint": true,
+    "fuse_attention_qkv": true,
+    "fuse_attention_ffn": true,
+    "use_fused_rms_norm": true,
+    "use_fused_rope": true,
+    "use_fused_linear_cross_entropy": true,
+    "use_flash_attention": true
+  }
diff --git a/llm/docs/finetune.md b/llm/docs/finetune.md
index d5e326fa1197..47cde5ab8219 100644
--- a/llm/docs/finetune.md
+++ b/llm/docs/finetune.md
@@ -19,7 +19,7 @@
 </div>
 <div align="center">
     <font size ="1">
-    飞桨与 Huggingface Transformers 微调性能比对
+    大模型精调原理介绍
      </font>
 </div>
 
diff --git a/llm/docs/predict/best_practices.md b/llm/docs/predict/best_practices.md
index 77b29fcb5ebe..31c9382a7c4c 100644
--- a/llm/docs/predict/best_practices.md
+++ b/llm/docs/predict/best_practices.md
@@ -1,4 +1,4 @@
-# 最佳实践
+# 高性能推理最佳实践
 
 PaddleNLP 提供了多种环境变量，用于优化推理性能和资源使用。下面提供一些调整 PaddleNLP 推理性能的最佳实践。
 
@@ -29,6 +29,6 @@ PaddleNLP 提供了多种环境变量，用于优化推理性能和资源使用
 
 **Append Attention 优化**
 
-- `FLAGS_cascade_attention_max_partition_size`：Append Attention decoder计算时对cache_kv进行分chunk的chunk大小，默认值根据batchsize设置，batchsize=1时设置为128，batchsize>1时设置为512。显式设置时不再区分batchsize。
-- `FLAGS_dec_block_shape_q`：Append Attention decoder计算时对q进行分块的分块大小，默认值为16。
-- `FLAGS_enc_block_shape_q`：Append Attention encoder计算时对q进行分块的分块大小，默认值为64。
+- `FLAGS_cascade_attention_max_partition_size`：Append Attention decoder 计算时对 cache_kv 进行分 chunk 的 chunk 大小，默认值根据 batchsize 设置，batchsize=1时设置为128，batchsize>1时设置为512。显式设置时不再区分 batchsize。
+- `FLAGS_dec_block_shape_q`：Append Attention decoder 计算时对 q 进行分块的分块大小，默认值为16。
+- `FLAGS_enc_block_shape_q`：Append Attention encoder 计算时对 q 进行分块的分块大小，默认值为64。
diff --git a/llm/docs/predict/inference.md b/llm/docs/predict/inference.md
index 9c3439682573..2c1dcecd35a1 100644
--- a/llm/docs/predict/inference.md
+++ b/llm/docs/predict/inference.md
@@ -25,6 +25,7 @@ PaddleNLP 大模型推理提供压缩、推理、服务全流程体验 ：
 ## 1. 模型支持
 
 PaddleNLP 中已经添加高性能推理模型相关实现，已验证过的模型如下：
+
 | Models | Example Models |
 |--------|----------------|
 |Llama 3.x, Llama 2|`meta-llama/Llama-3.2-3B-Instruct`, `meta-llama/Meta-Llama-3.1-8B`, `meta-llama/Meta-Llama-3.1-8B-Instruct`, `meta-llama/Meta-Llama-3.1-405B`, `meta-llama/Meta-Llama-3.1-405B-Instruct`,`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-Guard-3-8B`, `Llama-2-7b, meta-llama/Llama-2-7b-chat`, `meta-llama/Llama-2-13b`, `meta-llama/Llama-2-13b-chat`, `meta-llama/Llama-2-70b`, `meta-llama/Llama-2-70b-chat`|
@@ -170,6 +171,77 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 2. `a8w8`与`a8w8_fp8`需要额外的 act 和 weight 的 scale 校准表，推理传入的 `model_name_or_path` 为 PTQ 校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
 3. `cachekv_int8_type`可选`dynamic`（已不再维护，不建议使用）和`static`两种，`static`需要额外的 cache kv 的 scale 校准表，传入的 `model_name_or_path` 为 PTQ 校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
 
+
+## 5. 服务化部署
+
+**高性能服务化部署请参考**：[静态图服务化部署教程](../../server/docs/deploy_usage_tutorial.md)。
+
+如果您想简单体验模型，我们提供了**简易的 Flash Server 动态图部署**方式，我们提供了一套基于动态图推理的简单易用 UI 服务化部署方法，用户可以快速部署服务化推理。
+
+环境准备
+
+- python >= 3.9
+- gradio
+- flask
+
+服务化部署脚本
+
+```shell
+# 单卡，可以使用 paddle.distributed.launch 启动多卡推理
+python  ./predict/flask_server.py \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
+    --port 8010 \
+    --flask_port 8011 \
+    --dtype "float16"
+```
+
+- `port`: Gradio UI 服务端口号，默认8010。
+- `flask_port`: Flask 服务端口号，默认8011。
+
+图形化界面: 打开 `http://127.0.0.1:8010` 即可使用 gradio 图形化界面，即可开启对话。
+API 访问: 您也可用通过 flask 服务化 API 的形式.
+
+1. 可参考：`./predict/request_flask_server.py` 文件。
+```shell
+python predict/request_flask_server.py
+```
+
+2. 或者直接使用 curl,调用开始对话
+```shell
+curl 127.0.0.1:8011/v1/chat/completions \
+-H 'Content-Type: application/json' \
+-d '{"message": [{"role": "user", "content": "你好"}]}'
+```
+3.使用 OpenAI 客户端调用：
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://localhost:8011/v1/",
+)
+
+# Completion API
+stream = True
+completion = client.chat.completions.create(
+    model="paddlenlp",
+    messages=[
+        {"role": "user", "content": "PaddleNLP好厉害！这句话的感情色彩是？"}
+    ],
+    max_tokens=1024,
+    stream=stream,
+)
+
+if stream:
+    for c in completion:
+        print(c.choices[0].delta.content, end="")
+else:
+    print(completion.choices[0].message.content)
+```
+该方式部署，性能一般，高性能服务化部署请参考：[静态图服务化部署教程](../../server/docs/deploy_usage_tutorial.md)。
+
+
+
 更多大模型推理教程：
 
 -  [llama](./llama.md)
@@ -188,7 +260,7 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 更多压缩、服务化推理体验：
 
 - [大模型量化教程](../quantization.md)
-- [服务化部署教程](https://github.com/PaddlePaddle/FastDeploy/blob/develop/README_CN.md)
+- [静态图服务化部署教程](../../server/docs/deploy_usage_tutorial.md)
 
 更多硬件大模型推理教程：
 
diff --git a/llm/docs/predict/installation.md b/llm/docs/predict/installation.md
index 4d077c1c9ed6..c1a57f6adf78 100644
--- a/llm/docs/predict/installation.md
+++ b/llm/docs/predict/installation.md
@@ -1,4 +1,4 @@
-# 安装
+# 高性能推理算子安装
 
 git clone 代码到本地：
 
@@ -7,17 +7,17 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git
 export PYTHONPATH=/path/to/PaddleNLP:$PYTHONPATH
 ```
 
-PaddleNLP 针对于Transformer 系列编写了高性能自定义算子，提升模型在推理和解码过程中的性能，使用之前需要预先安装自定义算子库：
+PaddleNLP 针对于 Transformer 系列编写了高性能自定义算子，提升模型在推理和解码过程中的性能，使用之前需要预先安装自定义算子库：
 
 ```shell
 #GPU设备安装自定义算子
 cd PaddleNLP/csrc && python setup_cuda.py install
 #XPU设备安装自定义算子
-cd PaddleNLP/csrc/xpu/src && sh cmake_build.sh
+# cd PaddleNLP/csrc/xpu/src && sh cmake_build.sh
 #DCU设备安装自定义算子
-cd PaddleNLP/csrc && python setup_hip.py install
+# cd PaddleNLP/csrc && python setup_hip.py install
 #SDAA设备安装自定义算子
-cd PaddleNLP/csrc/sdaa && python setup_sdaa.py install
+# cd PaddleNLP/csrc/sdaa && python setup_sdaa.py install
 ```
 
 到达运行目录，即可开始：
diff --git a/llm/predict/flask_server.py b/llm/predict/flask_server.py
index d467d6dac688..6a845b8e7ebd 100644
--- a/llm/predict/flask_server.py
+++ b/llm/predict/flask_server.py
@@ -11,11 +11,13 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+
 from __future__ import annotations
 
 import json
 import os
 import socket
+import time
 from contextlib import closing
 from dataclasses import asdict, dataclass, field
 from time import sleep
@@ -44,14 +46,13 @@ def __free_port(port):
             try:
                 s.bind(("", port))
                 return port
-            except:
+            except Exception:
                 return -1
 
     for port in range(port_l, port_u):
-        port = __free_port(port)
-        if port != -1:
-            return port
-
+        free = __free_port(port)
+        if free != -1:
+            return free
     return -1
 
 
@@ -66,17 +67,15 @@ class ServerArgument:
 
 class PredictorServer:
     def __init__(self, args: ServerArgument, predictor: BasePredictor):
-
         self.predictor = predictor
         self.args = args
         scan_l, scan_u = (
             self.args.flask_port + port_interval * predictor.tensor_parallel_rank,
             self.args.flask_port + port_interval * (predictor.tensor_parallel_rank + 1),
         )
-        self.total_max_length = predictor.config.src_length + predictor.config.max_length
+        self.total_max_length = predictor.config.total_max_length
 
         if self.predictor.tensor_parallel_rank == 0:
-            # fetch port info
             self.port = find_free_ports(scan_l, scan_u)
             self.peer_ports = {}
             while True and self.predictor.tensor_parallel_degree > 1:
@@ -84,120 +83,205 @@ def __init__(self, args: ServerArgument, predictor: BasePredictor):
                     with FileLock(FILE_LOCK), open(PORT_FILE, "r") as f:
                         cnt = 1
                         for line in f:
-                            data = json.loads(line)
-                            self.peer_ports[data["rank"]] = data["port"]
+                            port_data = json.loads(line)
+                            self.peer_ports[port_data["rank"]] = port_data["port"]
                             cnt += 1
-
                     if cnt == predictor.tensor_parallel_degree:
                         break
                     else:
                         print("waiting for port reach", cnt)
                 sleep(1)
         else:
-            # save port info
             self.port = find_free_ports(scan_l, scan_u)
             data = {"rank": predictor.tensor_parallel_rank, "port": self.port}
             with FileLock(FILE_LOCK), open(PORT_FILE, "a") as f:
                 f.write(json.dumps(data) + "\n")
-            print("rank: ", predictor.tensor_parallel_rank, " port info saving done.")
+            print("rank:", predictor.tensor_parallel_rank, " port info saving done.")
+
+    def stream_predict(self, input_texts: str | list[str]):
+        if hasattr(self.predictor, "stream_predict"):
+            return self.predictor.stream_predict(input_texts)
+        else:
+            return self.predictor.predict(input_texts)
 
     def predict(self, input_texts: str | list[str]):
-        return self.predictor.stream_predict(input_texts)
+        return self.predictor.predict(input_texts)
 
     def broadcast_msg(self, data):
+        import threading
+
+        def send_request(peer_port, data):
+            try:
+                url = f"http://0.0.0.0:{peer_port}/v1/chat/completions"
+                requests.post(url, json=data)
+            except Exception:
+                pass
+
         for _, peer_port in self.peer_ports.items():
             if peer_port != self.port:
-                _ = requests.post(f"http://0.0.0.0:{peer_port}/api/chat", json=data)
+                logger.info(f"broadcast_msg to {peer_port}")
+                # Here we need async call send_request to other card.
+                thread = threading.Thread(target=send_request, args=(peer_port, data))
+                thread.start()
 
     def start_flask_server(self):
         from flask import Flask, request, stream_with_context
 
         app = Flask(__name__)
 
-        @app.post("/api/chat")
+        @app.post("/v1/chat/completions")
         def _server():
             data = request.get_json()
-            logger.info(f"Request: {json.dumps(data, indent=2, ensure_ascii=False)}")
 
             if self.predictor.tensor_parallel_rank == 0:
                 self.broadcast_msg(data)
+            logger.info(f"Request: {json.dumps(data, indent=2, ensure_ascii=False)}")
 
-            def streaming(data):
-                query = data.pop("context", "")
-                history = data.pop("history", "")
-                data.pop("extra_info", None)
-
-                # build chat template
-                if self.predictor.tokenizer.chat_template is not None:
-                    if not history:
-                        history = []
-                    # also support history data
-                    elif isinstance(history, str):
+            # 处理 OpenAI 格式消息（支持 messages 字段）以及兼容原有格式
+            if "messages" in data:
+                messages = data["messages"]
+                if not messages:
+                    return json.dumps({"error": "Empty messages"}), 400
+                if messages[-1].get("role") == "user":
+                    query = messages[-1].get("content", "")
+                    history = []
+                    if len(messages) > 1:
+                        temp = []
+                        for msg in messages[:-1]:
+                            if msg.get("role") in ["user", "assistant"]:
+                                temp.append(msg.get("content", ""))
+                        if len(temp) % 2 != 0:
+                            temp = temp[1:]
+                        history = temp
+                else:
+                    query = ""
+                    history = [msg.get("content", "") for msg in messages if msg.get("role") in ["user", "assistant"]]
+                data["context"] = query
+                data["history"] = history
+            else:
+                data["context"] = data.get("context", "")
+                data["history"] = data.get("history", "")
+
+            # 判断是否采用流式返回，默认为非流式（可根据需求调整默认值）
+            is_stream = data.get("stream", False)
+
+            # 统一对 context/history 做处理，兼容 chat_template 格式
+            def process_input(query, history):
+                if isinstance(history, str):
+                    try:
                         history = json.loads(history)
-
-                    assert len(history) % 2 == 0
-                    chat_query = []
+                    except Exception:
+                        history = [history]
+                # 如果模型支持 chat_template，则转换为消息格式处理
+                if self.predictor.tokenizer.chat_template is not None:
+                    messages = []
                     for idx in range(0, len(history), 2):
-                        if isinstance(history[idx], str):
-                            chat_query.append([history[idx], history[idx + 1]])
-                        elif isinstance(history[idx], dict):
-                            chat_query.append([history[idx]["utterance"], history[idx + 1]["utterance"]])
-                        else:
-                            raise ValueError(
-                                "history data should be list[str] or list[dict], eg: ['sentence-1', 'sentece-2', ...], or "
-                                "[{'utterance': 'sentence-1'}, {'utterance': 'sentence-2'}, ...]"
+                        user_msg = history[idx] if isinstance(history[idx], str) else history[idx].get("utterance", "")
+                        messages.append({"role": "user", "content": user_msg})
+                        if idx + 1 < len(history):
+                            assistant_msg = (
+                                history[idx + 1]
+                                if isinstance(history[idx + 1], str)
+                                else history[idx + 1].get("utterance", "")
                             )
+                            messages.append({"role": "assistant", "content": assistant_msg})
+                    messages.append({"role": "user", "content": query})
+                    return messages
+                return query
+
+            # 提取生成参数
+            generation_args = data.copy()
+            query = generation_args.pop("context", "")
+            history = generation_args.pop("history", [])
+            query = process_input(query, history)
+
+            # 更新生成相关配置参数
+            self.predictor.config.max_length = generation_args.get(
+                "max_tokens", generation_args.get("max_length", self.predictor.config.max_length)
+            )
+            if "src_length" in generation_args:
+                self.predictor.config.src_length = generation_args["src_length"]
+
+            if self.predictor.config.src_length + self.predictor.config.max_length > self.total_max_length:
+                output = {
+                    "error_code": 1,
+                    "error_msg": (
+                        f"The sum of src_length<{self.predictor.config.src_length}> and max_length<{self.predictor.config.max_length}> "
+                        f"should be smaller than or equal to the max-total-length<{self.total_max_length}>"
+                    ),
+                }
+                return json.dumps(output, ensure_ascii=False), 400
+
+            self.predictor.config.top_p = generation_args.get("top_p", self.predictor.config.top_p)
+            self.predictor.config.temperature = generation_args.get("temperature", self.predictor.config.temperature)
+            self.predictor.config.top_k = generation_args.get("top_k", self.predictor.config.top_k)
+            self.predictor.config.repetition_penalty = generation_args.get(
+                "repetition_penalty", self.predictor.config.repetition_penalty
+            )
+
+            for key, value in generation_args.items():
+                setattr(self.args, key, value)
+
+            # 根据是否流式返回选择不同处理方式
+            if is_stream:
+                # 流式返回生成结果
+                def streaming(data):
+                    streamer = self.stream_predict(query)
+                    if self.predictor.tensor_parallel_rank != 0:
+                        return "done"
 
-                    # the input of predictor should be batched.
-                    # batched query: [ [[user, bot], [user, bot], ..., [user]]  ]
-                    query = [chat_query + [[query]]]
-
-                generation_args = data
-                self.predictor.config.max_length = generation_args["max_length"]
-                if "src_length" in generation_args:
-                    self.predictor.config.src_length = generation_args["src_length"]
-
-                if self.predictor.config.src_length + self.predictor.config.max_length > self.total_max_length:
-                    output = {
-                        "error_code": 1,
-                        "error_msg": f"The sum of src_length<{self.predictor.config.src_length}> and "
-                        f"max_length<{self.predictor.config.max_length}> should be smaller than or equal to "
-                        f"the max-total-length<{self.total_max_length}>",
-                    }
-                    yield json.dumps(output, ensure_ascii=False) + "\n"
-                    return
-
-                self.predictor.config.top_p = generation_args["top_p"]
-                self.predictor.config.temperature = generation_args["temperature"]
-                self.predictor.config.top_k = generation_args["top_k"]
-                self.predictor.config.repetition_penalty = generation_args["repetition_penalty"]
-
-                for key, value in generation_args.items():
-                    setattr(self.args, key, value)
-
-                streamer = self.predict(query)
-                if self.predictor.tensor_parallel_rank == 0:
                     for new_text in streamer:
                         if not new_text:
                             continue
-
-                        output = {
-                            "error_code": 0,
-                            "error_msg": "Success",
-                            "result": {"response": {"role": "bot", "utterance": new_text}},
+                        response_body = {
+                            "id": "YouID",
+                            "object": "chat.completion",
+                            "created": int(time.time()),
+                            "model": self.args.model_name_or_path,
+                            "choices": [
+                                {
+                                    "index": 0,
+                                    "delta": {
+                                        "role": "assistant",
+                                        "content": new_text,
+                                    },
+                                    "finish_reason": "stop",
+                                }
+                            ],
                         }
-                        yield json.dumps(output, ensure_ascii=False) + "\n"
-                else:
-                    return "done"
+                        yield f"data: {json.dumps(response_body, ensure_ascii=False)}\n\n"
+                    yield "data: [DONE]\n\n"
 
-            return app.response_class(stream_with_context(streaming(data)))
+                return app.response_class(stream_with_context(streaming(data)), mimetype="text/event-stream")
+
+            else:
+                # 非流式：一次性返回完整结果
+                result = self.predict(query)
+                if self.predictor.tensor_parallel_rank == 0:
+                    if type(result) is list and len(result) == 1:
+                        result = result[0]
+                    response_body = {
+                        "id": "YouID",
+                        "object": "chat.completion",
+                        "created": int(time.time()),
+                        "model": self.args.model_name_or_path,
+                        "choices": [
+                            {
+                                "index": 0,
+                                "message": {"role": "assistant", "content": result},
+                                "finish_reason": "stop",
+                            }
+                        ],
+                    }
+                    data = f"{json.dumps(response_body, ensure_ascii=False)}"
+                    return app.response_class(data, mimetype="application/json")
+                else:
+                    return app.response_class("done")
 
-        # set single thread to do prediction
-        # refer to: https://github.com/pallets/flask/blob/main/src/flask/app.py#L605
+        # 启动 Flask 服务（单线程预测）
         app.run(host="0.0.0.0", port=self.port, threaded=False)
 
     def start_ui_service(self, args, predictor_args):
-        # do not support start ui service in one command
         from multiprocessing import Process
 
         from gradio_ui import main
@@ -208,17 +292,16 @@ def start_ui_service(self, args, predictor_args):
 
 
 if __name__ == "__main__":
-
     parser = PdArgumentParser((PredictorArgument, ModelArgument, ServerArgument))
     predictor_args, model_args, server_args = parser.parse_args_into_dataclasses()
-    # check port
+    server_args.model_name_or_path = predictor_args.model_name_or_path
+
     if server_args.base_port is not None:
         logger.warning("`--base_port` is deprecated, please use `--flask_port` instead after 2023.12.30.")
-
         if server_args.flask_port is None:
             server_args.flask_port = server_args.base_port
         else:
-            logger.warning("`--base_port` and `--flask_port` are both set, `--base_port` will be ignored.")
+            logger.warning("Both `--base_port` and `--flask_port` are set; `--base_port` will be ignored.")
 
     log_dir = os.getenv("PADDLE_LOG_DIR", "./")
     PORT_FILE = os.path.join(log_dir, PORT_FILE)
@@ -226,10 +309,10 @@ def start_ui_service(self, args, predictor_args):
         os.remove(PORT_FILE)
 
     predictor = create_predictor(predictor_args, model_args)
-
-    server = PredictorServer(server_args, predictor)
-
+    server = PredictorServer(
+        server_args,
+        predictor,
+    )
     if server.predictor.tensor_parallel_rank == 0:
         server.start_ui_service(server_args, asdict(predictor.config))
-
     server.start_flask_server()
diff --git a/llm/predict/gradio_ui.py b/llm/predict/gradio_ui.py
index 5e43ec8ae12b..9dceb705710d 100644
--- a/llm/predict/gradio_ui.py
+++ b/llm/predict/gradio_ui.py
@@ -1,3 +1,4 @@
+#!/usr/bin/env python
 # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,17 +16,31 @@
 from __future__ import annotations
 
 import argparse
-import copy
 import json
+import logging
+import re
 
 import gradio as gr
 import requests
 
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+console_handler = logging.StreamHandler()
+console_handler.setLevel(logging.INFO)
+formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
+console_handler.setFormatter(formatter)
+logger.addHandler(console_handler)
+
 
 def setup_args():
     """Setup arguments."""
     parser = argparse.ArgumentParser()
     parser.add_argument("--port", type=int, default=8073)
+    parser.add_argument("--api_key", type=str, default=None, help="Your API key")
+    parser.add_argument("--model", type=str, default="", help="Model name")
+    parser.add_argument("--title", type=str, default="PaddleNLP Chat", help="UI Title")
+    parser.add_argument("--sub_title", type=str, default="powered by paddlenlp team.", help="UI Sub Title")
+    parser.add_argument("--flask_port", type=int, default=None, help="The port of flask service")
     args = parser.parse_args()
     return args
 
@@ -52,137 +67,147 @@ def create_max_slider(value, maximum):
     )
 
 
+def remove_think_tags(text):
+    """
+    清除文本中 <think> 和 </think> 标签之间的所有字符。
+
+    Args:
+        text: 要处理的文本字符串。
+
+    Returns:
+        清除 <think> 和 </think> 标签之间内容的文本字符串。
+    """
+    pattern = re.compile(r"\\<think\\>.*?\\<\\\/think\\>", re.DOTALL)
+    # 将匹配到的部分替换为空字符串
+    cleaned_text = pattern.sub("", text).strip()
+    return cleaned_text
+
+
 def launch(args, default_params: dict = {}):
-    """Launch characters dialogue demo."""
+    """Launch chat UI with OpenAI API."""
 
     def rollback(state):
         """Rollback context."""
         context = state.setdefault("context", [])
-        utterance = context[-2]["utterance"]
-        context = context[:-2]
-        state["context"] = context
-        shown_context = get_shown_context(context)
-        return utterance, shown_context, context, state
+        # 回退时移除最后一次对话
+        if len(context) >= 2:
+            content = context[-2]["content"]
+            context = context[:-2]
+            state["context"] = context
+            shown_context = get_shown_context(context)
+            return content, shown_context, context, state
+        else:
+            gr.Warning("没有可撤回的对话历史")
+            return None, get_shown_context(context), context, state
 
-    def regen(state, top_k, top_p, temperature, repetition_penalty, max_length, src_length):
+    def regen(state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length):
         """Regenerate response."""
         context = state.setdefault("context", [])
         if len(context) < 2:
-            gr.Warning("don't have chat history")
+            gr.Warning("No chat history!")
             shown_context = get_shown_context(context)
             return None, shown_context, context, state
 
+        # 删除上一次回复，重新生成
         context.pop()
         user_turn = context.pop()
-        context.append({"role": "user", "utterance": user_turn["utterance"]})
-        context.append({"role": "bot", "utterance": ""})
+        context.append({"role": "user", "content": user_turn["content"]})
+        context.append({"role": "assistant", "content": ""})
         shown_context = get_shown_context(context)
-        return user_turn["utterance"], shown_context, context, state
+        return user_turn["content"], shown_context, context, state
 
-    def begin(utterance, state):
-        """Model inference."""
-        utterance = utterance.strip().replace("<br>", "\n")
+    def begin(content, state):
+        """记录用户输入，并初始化 bot 回复为空。"""
         context = state.setdefault("context", [])
 
-        if not utterance:
-            gr.Warning("invalid inputs")
-            # gr.Warning("请输入有效问题")
+        if not content:
+            gr.Warning("Invalid inputs")
             shown_context = get_shown_context(context)
             return None, shown_context, context, state
 
-        context.append({"role": "user", "utterance": utterance})
-        context.append({"role": "bot", "utterance": ""})
-
+        context.append({"role": "user", "content": content})
+        context.append({"role": "assistant", "content": ""})
         shown_context = get_shown_context(context)
-        return utterance, shown_context, context, state
+        return content, shown_context, context, state
 
-    def infer(utterance, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length):
-        """Model inference."""
-        utterance = utterance.strip().replace("<br>", "\n")
+    def infer(content, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length):
+        """调用 OpenAI 接口生成回答，并以流式返回部分结果。"""
         context = state.setdefault("context", [])
-
-        if not utterance:
-            gr.Warning("invalid inputs")
-            # gr.Warning("请输入有效问题")
+        if not content:
+            gr.Warning("Invalid inputs")
             shown_context = get_shown_context(context)
             return None, shown_context, context, state
 
-        data = {
-            "context": utterance,
-            "top_k": top_k,
-            "top_p": top_p,
+        # 构造 OpenAI API 要求的 messages 格式
+        messages = []
+        for turn in context[:-1]:
+            messages.append({"role": turn["role"], "content": remove_think_tags(turn["content"])})
+
+        # 默认模型名称从参数中获取
+        model = getattr(args, "model", default_params.get("model", ""))
+        payload = {
+            "model": model,
+            "messages": messages,
             "temperature": temperature,
             "repetition_penalty": repetition_penalty,
-            "max_length": max_length,
+            "max_tokens": max_tokens,
             "src_length": src_length,
-            "min_length": 1,
+            "top_p": top_p,
+            "top_k": top_k,
+            "stream": True,
         }
-        if len(context) > 2:
-            data["history"] = json.dumps(context[:-2])
-
-        res = requests.post(f"http://0.0.0.0:{args.flask_port}/api/chat", json=data, stream=True)
-        for index, line in enumerate(res.iter_lines()):
-            result = json.loads(line)
-            if result["error_code"] != 0:
-                gr.Warning(result["error_msg"])
-                shown_context = get_shown_context(context)
-                return None, shown_context, context, state
-
-            bot_response = result["result"]["response"]
-
-            # replace \n with br: https://github.com/gradio-app/gradio/issues/4344
-            bot_response["utterance"] = bot_response["utterance"].replace("\n", "<br>")
-
-            if bot_response["utterance"].endswith("[END]"):
-                bot_response["utterance"] = bot_response["utterance"][:-5]
-
-            # the first character of gradio can not be "<br>" or "<br/>"
-            if bot_response["utterance"] in ["<br>", "<br/>"] and index == 0:
-                continue
-
-            context[-1]["utterance"] += bot_response["utterance"]
+        headers = {
+            # "Authorization": "Bearer " + args.api_key,
+            "Content-Type": "application/json"
+        }
+        url = f"http://0.0.0.0:{args.flask_port}/v1/chat/completions"
+        try:
+            res = requests.post(url, json=payload, headers=headers, stream=True)
+        except Exception as e:
+            gr.Warning(f"请求异常: {e}")
             shown_context = get_shown_context(context)
-
             yield None, shown_context, context, state
-
-    def clean_context(context):
-        """Clean context for EB input."""
-        cleaned_context = copy.deepcopy(context)
-        for turn in cleaned_context:
-            if turn["role"] == "bot":
-                bot_resp = turn["utterance"]
-                if bot_resp.startswith("<img src") or bot_resp.startswith("<audio controls>"):
-                    bot_resp = "\n".join(bot_resp.split("\n")[1:])
-                turn["utterance"] = bot_resp
-        return cleaned_context
-
-    def extract_eda(eb_debug_info):
-        """Extract EDA result from EB dispatch info."""
-        eda_res = None
-        for item in eb_debug_info:
-            if item["sys"] == "EDA":
-                eda_output = json.loads(item["output"])
-                eda_res = eda_output["result"]
-                break
-        return eda_res
-
-    def extract_eb_input(eb_debug_info, convert_for_ar=True):
-        """Extract EB raw input from EB dispatch info."""
-        eb_raw_input = None
-        for item in eb_debug_info:
-            if item["sys"] == "EB":
-                eb_output = json.loads(item["output"])
-                eb_raw_input = eb_output["text_after_process"]
-                if convert_for_ar:
-                    eb_raw_input = eb_raw_input.replace("[CLS]", "<cls>").replace("[SEP]", "<sep>")
-                break
-        return eb_raw_input
+            return
+
+        # 流式处理返回结果，实时更新最后一个对话记录（即 bot 回复）
+        for line in res.iter_lines():
+            if line:
+                try:
+                    decoded_line = line.decode("utf-8").strip()
+                    # OpenAI 流返回每行以 "data:" 开头
+                    if decoded_line.startswith("data:"):
+                        data_str = decoded_line[len("data:") :].strip()
+                        if data_str == "[DONE]":
+                            logger.info("Conversation round over.")
+                            break
+                        data_json = json.loads(data_str)
+
+                        # delta 中可能包含部分回复内容
+                        delta = data_json["choices"][0]["delta"].get("content", "")
+                        if delta:
+                            # Reformat <think> tags to show in chatbot
+                            delta = delta.replace("<think>", r"\<think\>")
+                            delta = delta.replace("</think>", r"\<\/think\>")
+                            context[-1]["content"] += delta
+                            shown_context = get_shown_context(context)
+                            yield None, shown_context, context, state
+                    else:
+                        logger.error(f"{decoded_line}")
+                        gr.Warning(f"{decoded_line}")
+
+                except Exception as e:
+                    logger.error(f"解析返回结果异常: {e}")
+                    gr.Warning(f"解析返回结果异常: {e}")
+                    continue
 
     def get_shown_context(context):
-        """Get gradio chatbot."""
+        """将对话上下文转换为 gr.Chatbot 显示格式，每一对 [用户, 助手]"""
         shown_context = []
+        # 每两项组成一对
         for turn_idx in range(0, len(context), 2):
-            shown_context.append([context[turn_idx]["utterance"], context[turn_idx + 1]["utterance"]])
+            user_text = context[turn_idx]["content"]
+            bot_text = context[turn_idx + 1]["content"] if turn_idx + 1 < len(context) else ""
+            shown_context.append([user_text, bot_text])
         return shown_context
 
     with gr.Blocks(title="LLM", theme=gr.themes.Soft()) as block:
@@ -195,7 +220,7 @@ def get_shown_context(context):
                     value=0,
                     step=1,
                     label="Top-k",
-                    info="该参数越大，模型生成结果更加随机，反之生成结果更加确定。",
+                    info="控制采样token个数。(不建议设置)",
                 )
                 top_p = gr.Slider(
                     minimum=0,
@@ -203,7 +228,7 @@ def get_shown_context(context):
                     value=default_params.get("top_p", 0.7),
                     step=0.05,
                     label="Top-p",
-                    info="该参数越大，模型生成结果更加随机，反之生成结果更加确定。",
+                    info="控制采样范围。",
                 )
                 temperature = gr.Slider(
                     minimum=0.05,
@@ -211,7 +236,7 @@ def get_shown_context(context):
                     value=default_params.get("temperature", 0.95),
                     step=0.05,
                     label="Temperature",
-                    info="该参数越小，模型生成结果更加随机，反之生成结果更加确定。",
+                    info="温度，控制生成随机性。",
                 )
                 repetition_penalty = gr.Slider(
                     minimum=0.1,
@@ -219,32 +244,40 @@ def get_shown_context(context):
                     value=default_params.get("repetition_penalty", 1.2),
                     step=0.05,
                     label="Repetition Penalty",
-                    info="该参数越大，生成结果重复的概率越低。设置 1 则不开启。",
+                    info="生成结果重复惩罚。(不建议设置)",
                 )
-                default_src_length = default_params["src_length"]
-                total_length = default_params["src_length"] + default_params["max_length"]
+                default_src_length = default_params.get("src_length", 128)
+                total_length = default_src_length + default_params.get("max_tokens", 50)
                 src_length = create_src_slider(default_src_length, total_length)
-                max_length = create_max_slider(min(total_length - default_src_length, 50), total_length)
+                max_tokens = create_max_slider(max(total_length - default_src_length, 50), total_length)
 
-                def src_length_change_event(src_length_value, max_length_value):
+                def src_length_change_event(src_length_value, max_tokens_value):
                     return create_max_slider(
-                        min(total_length - src_length_value, max_length_value),
+                        min(total_length - src_length_value, max_tokens_value),
                         total_length - src_length_value,
                     )
 
-                def max_length_change_event(src_length_value, max_length_value):
+                def max_tokens_change_event(src_length_value, max_tokens_value):
                     return create_src_slider(
-                        min(total_length - max_length_value, src_length_value),
-                        total_length - max_length_value,
+                        min(total_length - max_tokens_value, src_length_value),
+                        total_length - max_tokens_value,
                     )
 
-                src_length.change(src_length_change_event, inputs=[src_length, max_length], outputs=max_length)
-                max_length.change(max_length_change_event, inputs=[src_length, max_length], outputs=src_length)
-
+                src_length.change(src_length_change_event, inputs=[src_length, max_tokens], outputs=max_tokens)
+                max_tokens.change(max_tokens_change_event, inputs=[src_length, max_tokens], outputs=src_length)
             with gr.Column(scale=4):
                 state = gr.State({})
-                context_chatbot = gr.Chatbot(label="Context")
-                utt_text = gr.Textbox(placeholder="请输入...", label="Utterance")
+                # 这里修改 gr.Chatbot 组件，启用 Markdown 渲染并支持 LaTeX 展示
+                context_chatbot = gr.Chatbot(
+                    label="Context",
+                    render_markdown=True,
+                    latex_delimiters=[
+                        {"left": "$$", "right": "$$", "display": True},
+                        {"left": "\\[", "right": "\\]", "display": True},
+                        {"left": "$", "right": "$", "display": True},
+                    ],
+                )
+                utt_text = gr.Textbox(placeholder="请输入...", label="Content")
                 with gr.Row():
                     clear_btn = gr.Button("清空")
                     rollback_btn = gr.Button("撤回")
@@ -261,7 +294,7 @@ def max_length_change_event(src_length_value, max_length_value):
                 api_name="chat",
             ).then(
                 infer,
-                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
             )
 
@@ -280,13 +313,13 @@ def max_length_change_event(src_length_value, max_length_value):
             )
             regen_btn.click(
                 regen,
-                inputs=[state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
                 queue=False,
                 api_name="chat",
             ).then(
                 infer,
-                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
             )
 
@@ -298,7 +331,7 @@ def max_length_change_event(src_length_value, max_length_value):
                 api_name="chat",
             ).then(
                 infer,
-                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
             )
 
@@ -310,5 +343,12 @@ def main(args, default_params: dict = {}):
 
 
 if __name__ == "__main__":
+    # 可以在 default_params 中设置默认参数，如 src_length, max_tokens, temperature, top_p 等
+    default_params = {
+        "src_length": 1024,
+        "max_tokens": 1024,
+        "temperature": 0.95,
+        "top_p": 0.7,
+    }
     args = setup_args()
-    main(args)
+    main(args, default_params)
diff --git a/llm/predict/predictor.py b/llm/predict/predictor.py
index aa987f4acdf2..8a7711cffd01 100644
--- a/llm/predict/predictor.py
+++ b/llm/predict/predictor.py
@@ -24,7 +24,7 @@
 import numpy as np
 import paddle
 import paddle.incubate.multiprocessing as mp
-from paddle.base.framework import in_cinn_mode, in_pir_executor_mode, use_pir_api
+from paddle.base.framework import in_cinn_mode, in_pir_executor_mode
 from paddle.distributed import fleet
 
 try:
@@ -51,7 +51,12 @@
     PretrainedTokenizer,
 )
 from paddlenlp.trl import llm_utils
-from paddlenlp.utils.env import MAX_BSZ, MAX_DRAFT_TOKENS
+from paddlenlp.utils.env import (
+    MAX_BSZ,
+    MAX_DRAFT_TOKENS,
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+)
 from paddlenlp.utils.import_utils import is_paddlenlp_ops_available
 from paddlenlp.utils.log import logger
 
@@ -145,7 +150,9 @@ class PredictorArgument:
     )
     speculate_method: str = field(
         default=None,
-        metadata={"help": "speculate method, it should be one of ['None', 'inference_with_reference', 'eagle']"},
+        metadata={
+            "help": "speculate method, it should be one of ['None', 'inference_with_reference', 'eagle', 'mtp']"
+        },
     )
     speculate_max_draft_token_num: int = field(
         default=1,
@@ -217,21 +224,25 @@ def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = N
             self.generation_config = None
 
     def _preprocess(self, source):
+
         if self.tokenizer.chat_template is not None:
-            source = [source] if isinstance(source, str) else source
+            # for str -> List[str] eg. "hello"
+            # for List[str] -> List[str]  eg. ["hello", "hello new"]
+            # for List[List[str]] -> List[List[List[str]]]  eg. 历史对话形式,一轮
+            #             [ [ "Hello, how are you?", "I'm doing great. How can I help you today?"],
+            #                ["I'd like to show off how chat templating works!"], ]
+            # for List[Dict] -> List[List[Dict]]  [{'role': 'user', 'content': 'hello'}, {'role': 'assistant', 'content': 'nice'}]
+            #                                 ->  [[{'role': 'user', 'content': 'hello'}, {'role': 'assistant', 'content': 'nice'}]]
+            if not isinstance(source, list) or not isinstance(source[0], str):
+                source = [source]
             source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
 
-        return_position_ids = False
-        return_attention_mask = False
-        if len(source) > 1:
-            return_position_ids = True
-            return_attention_mask = True
         tokenized_source = self.tokenizer(
             source,
             max_length=self.config.src_length,
             truncation=True,
-            return_position_ids=True if not isinstance(self.tokenizer, ChatGLMTokenizer) else return_position_ids,
-            return_attention_mask=return_attention_mask,
+            return_position_ids=True if not isinstance(self.tokenizer, ChatGLMTokenizer) else False,
+            return_attention_mask=True,
             truncation_side="left",
             return_tensors=self.return_tensors,
             padding=True,
@@ -486,7 +497,8 @@ def _preprocess(self, source):
         pre_caches_length = 0 if not self.config.export_precache else self.pre_caches[0].shape[-2]
 
         if self.tokenizer.chat_template is not None:
-            source = [source] if isinstance(source, str) else source
+            if not isinstance(source, list) or not isinstance(source[0], str):
+                source = [source]
             source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
 
         inputs = llm_utils.dybatch_preprocess(
@@ -663,10 +675,11 @@ def _create_predictor(self, predictor_args: PredictorArgument):
         infer_model_path = llm_utils.get_infer_model_path(
             predictor_args.model_name_or_path, predictor_args.model_prefix
         )
-        if use_pir_api():
-            config = paddle.inference.Config(infer_model_path + ".json", infer_model_path + ".pdiparams")
-        else:
-            config = paddle.inference.Config(infer_model_path + ".pdmodel", infer_model_path + ".pdiparams")
+
+        config = paddle.inference.Config(
+            infer_model_path + PADDLE_INFERENCE_MODEL_SUFFIX,
+            infer_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+        )
 
         config.switch_ir_optim(True)
         # remove `gpu_cpu_map_matmul_v2_to_matmul_pass` to avoid mapping matmul_v2 -> matmul op
@@ -929,7 +942,8 @@ def _preprocess(self, input_text: list[str]):
             assert len(input_text) == self.batch_size
 
         if self.tokenizer.chat_template is not None:
-            input_text = [input_text] if isinstance(input_text, str) else input_text
+            if not isinstance(input_text, list) or not isinstance(input_text[0], str):
+                input_text = [input_text]
             input_text = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in input_text]
 
         input_ids = []
@@ -1000,6 +1014,8 @@ def _preprocess(self, input_text: list[str]):
                 self.model_inputs["pre_ids"][bid, 0] = self.model_inputs["input_ids"][bid][
                     self.seq_lens[bid] - 1
                 ]  # get the last token before padding of this batch
+                if self.config.speculate_method == "inference_with_reference":
+                    self.proposer.input_ids_len[bid, 0] = self.seq_lens[bid]
 
         if self.config.mode == "static":
             for k, v in self.model_inputs.items():
@@ -1010,6 +1026,7 @@ class DygraphBlockInferencePredictor(BlockInferencePredictorMixin):
     def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = None, **kwargs):
         model = kwargs.get("model", None)
         self.return_full_hidden_states = config.return_full_hidden_states
+        self.full_hidden_states = None
         if model is None:
             raise ValueError("model should be provided for DygraphBlockInferencePredictor")
         self.cache_kvs_shape = model.get_cache_kvs_shape(model.config, config.batch_size)
@@ -1039,7 +1056,7 @@ def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = N
                 config.batch_size,
                 config.max_length,
             )
-        elif config.speculate_method == "eagle":
+        elif config.speculate_method in ["eagle", "mtp"]:
             self.proposer = EagleProposer(args=config)
         else:
             self.proposer = None
@@ -1053,7 +1070,10 @@ def _infer(self, inputs: dict[str, paddle.Tensor]):
     @paddle.no_grad()
     def predict(self, input_texts: list[str], return_tokens=False):
         self._preprocess(input_texts)
-
+        if self.proposer is not None:
+            self.proposer.insert_query(
+                base_model_inputs=self.model_inputs, real_bs=len(input_texts), seq_lens=self.seq_lens
+            )
         result_queue = mp.Queue()
         tensor_queue = mp.Queue()
         done_event = mp.Event()
@@ -1085,12 +1105,16 @@ def predict(self, input_texts: list[str], return_tokens=False):
                     self.model_inputs,
                     real_batch_size=self.batch_size,
                     seq_lens_this_time=self.model_inputs["seq_lens_this_time"],
+                    base_model_full_hidden_states=self.full_hidden_states,
                 )
             if self.return_full_hidden_states:
                 self.full_hidden_states = self._infer(self.model_inputs)
             else:
                 self._infer(self.model_inputs)
-        logger.info(f"running spend {time.time()  -  s_time}")
+        logger.info(f"running spend {time.time() - s_time}")
+
+        if self.proposer is not None:
+            self.proposer.postprocess(base_model_inputs=self.model_inputs)
 
         if self.tensor_parallel_rank == 0:
             outputs = []
@@ -1155,7 +1179,7 @@ def __init__(
                 config.batch_size,
                 config.max_length,
             )
-        elif config.speculate_method == "eagle":
+        elif config.speculate_method in ["eagle", "mtp"]:
             self.proposer = EagleProposer(
                 args=config,
                 model_args=self.model_args,
@@ -1174,10 +1198,10 @@ def _create_predictor(self, predictor_args: PredictorArgument):
             predictor_args.model_name_or_path, predictor_args.model_prefix
         )
 
-        if use_pir_api():
-            config = paddle.inference.Config(infer_model_path + ".json", infer_model_path + ".pdiparams")
-        else:
-            config = paddle.inference.Config(infer_model_path + ".pdmodel", infer_model_path + ".pdiparams")
+        config = paddle.inference.Config(
+            infer_model_path + PADDLE_INFERENCE_MODEL_SUFFIX,
+            infer_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+        )
 
         config.switch_ir_optim(False)
         if predictor_args.device in paddle.device.get_all_custom_device_type():
@@ -1214,7 +1238,7 @@ def predict(self, input_texts: list[str], return_tokens=False):
             self.proposer.insert_query(
                 base_model_inputs=self.model_inputs, real_bs=len(input_texts), seq_lens=self.seq_lens
             )
-        logger.info(f"preprocess spend {time.time()  -  s_time}")
+        logger.info(f"preprocess spend {time.time() - s_time}")
 
         result_queue = mp.Queue()
         tensor_queue = mp.Queue()
@@ -1253,7 +1277,7 @@ def predict(self, input_texts: list[str], return_tokens=False):
                 self.full_hidden_states = self.predictor.run(list(self.model_inputs.values()))[0]
             else:
                 self.predictor.run(list(self.model_inputs.values()))
-        logger.info(f"running spend {time.time()  -  s_time}")
+        logger.info(f"running spend {time.time() - s_time}")
 
         if self.proposer is not None:
             self.proposer.postprocess(base_model_inputs=self.model_inputs)
@@ -1287,7 +1311,7 @@ def create_predictor(
         config: PretrainedConfig,
         model_args: ModelArgument,
         tokenizer: PretrainedTokenizer = None,
-        **kwargs
+        **kwargs,
     ):
         """
         Create a predictor
@@ -1341,7 +1365,16 @@ def create_predictor(
     predictor_args: PredictorArgument,
     model_args: ModelArgument,
 ):
-    tokenizer = AutoTokenizer.from_pretrained(predictor_args.model_name_or_path, padding_side="left")
+
+    paddle.set_device(predictor_args.device)
+    paddle.set_default_dtype(predictor_args.dtype)
+
+    from paddlenlp.utils.env import USE_FAST_TOKENIZER
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        predictor_args.model_name_or_path, padding_side="left", use_fast=USE_FAST_TOKENIZER
+    )
+
     # init chat_template for tokenizer
     llm_utils.init_chat_template(tokenizer, predictor_args.model_name_or_path, predictor_args.chat_template)
 
@@ -1431,9 +1464,6 @@ def predict():
     parser = PdArgumentParser((PredictorArgument, ModelArgument))
     predictor_args, model_args = parser.parse_args_into_dataclasses()
 
-    paddle.set_device(predictor_args.device)
-    paddle.set_default_dtype(predictor_args.dtype)
-
     tensor_parallel_degree = paddle.distributed.get_world_size()
     if tensor_parallel_degree > 1:
         strategy = fleet.DistributedStrategy()
@@ -1466,7 +1496,9 @@ def predict():
                     target_texts.append("")
 
     else:
-        source_texts = ["解释一下温故而知新"] * predictor_args.batch_size
+        source_texts = [
+            "2014年3月，大范围雾霾天气长时间影响我国东部地区，严重危害人体健康。造成雾霾天气的人为原因有____\r\n①工业生产中使用矿物作为燃料，大量排放污染物     ②汽车尾气的大量排放     \r\n③风力小，空气流动不畅     ④冬季取暖排放粉尘\nA. ①②③\nB. ②③④\nC. ①③④\nD. ①②④"
+        ] * predictor_args.batch_size
         target_texts = [""] * predictor_args.batch_size
 
     batch_source_texts = batchfy_text(source_texts, predictor_args.batch_size)
diff --git a/llm/predict/request_flask_server.py b/llm/predict/request_flask_server.py
index b7cef31eda9a..11b3e88404f1 100644
--- a/llm/predict/request_flask_server.py
+++ b/llm/predict/request_flask_server.py
@@ -17,38 +17,99 @@
 import requests
 
 
-def send_request(query, history=None):
-    data = {
-        "context": query,
-        "history": history,
-        "top_k": 0,
-        "top_p": 0.7,  # 0.0 为 greedy_search
-        "temperature": 0.95,
-        "repetition_penalty": 1.3,
-        "max_length": 100,
-        "src_length": 100,
+def build_messages(query, history=None):
+    """
+    根据传入的 query 和 history 构造符合 OpenAI 格式的消息列表。
+    如果 history 为 list 且每项为 dict，则直接使用；如果为 list 且每项为字符串，
+    则依次按用户（user）与助手（assistant）交替添加；否则直接只添加当前用户消息。
+    """
+    messages = []
+    if history:
+        if isinstance(history, list):
+            if all(isinstance(item, dict) for item in history):
+                messages.extend(history)
+            else:
+                # 假设 history 按顺序依次为用户、助手、用户、助手……
+                for idx, item in enumerate(history):
+                    role = "user" if idx % 2 == 0 else "assistant"
+                    messages.append({"role": role, "content": str(item)})
+        else:
+            messages.append({"role": "user", "content": str(history)})
+    # 当前请求作为最新的用户消息
+    messages.append({"role": "user", "content": query})
+    return messages
+
+
+def send_request(query, history=None, stream=True):
+    # 构造 OpenAI 格式的请求体
+    payload = {
+        "messages": build_messages(query, history),
+        # 以下生成参数可根据需要调整
+        # "top_k": 0,
+        # "top_p": 0.7,
+        # "temperature": 0.8,
+        # "repetition_penalty": 1.3,
+        "max_length": 1024,
+        "src_length": 1024,
         "min_length": 1,
+        "stream": stream,
     }
-    res = requests.post("http://127.0.0.1:8010/api/chat", json=data, stream=True)
-    text = ""
+    res = requests.post("http://localhost:8011/v1/chat/completions", json=payload, stream=True)
+    result_text = ""
+    printed_reasoning_content = False
+    printed_content = False
     for line in res.iter_lines():
-        result = json.loads(line)
-
-        if result["error_code"] != 0:
-            text = "error-response"
-            break
+        # https://github.com/vllm-project/vllm/blob/433c4a49230a470f13657f06e7612cde86e4fb40/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py#L67-L69
+        if not line:
+            continue
 
-        result = json.loads(line)
-        bot_response = result["result"]["response"]
+        decoded_line = line.decode("utf-8").strip()
+        # OpenAI 流返回每行以 "data:" 开头
+        if decoded_line.startswith("data:"):
+            data = decoded_line[5:].strip()  # Remove "data:" prefix
+            if data == "[DONE]":  # End of stream
+                print("\nclient: Stream completed.\n")
+                break
+            try:
+                # Parse the JSON data
+                chunk = json.loads(data)
+                reasoning_content = chunk["choices"][0]["delta"].get("reasoning_content", "")
+                content = chunk["choices"][0]["delta"].get("content", "")
 
-        if bot_response["utterance"].endswith("[END]"):
-            bot_response["utterance"] = bot_response["utterance"][:-5]
-        text += bot_response["utterance"]
+                if reasoning_content:
+                    if not printed_reasoning_content:
+                        printed_reasoning_content = True
+                        print("reasoning_content:", end="", flush=True)
+                    print(reasoning_content, end="", flush=True)
+                elif content:
+                    if not printed_content:
+                        printed_content = True
+                        print("\ncontent:", end="", flush=True)
+                    # Extract and print the content
+                    print(content, end="", flush=True)
+                    result_text += content
+            except Exception as e:
+                print("解析响应出错:", e)
+                continue
+        else:
+            try:
+                data = json.loads(decoded_line)
+                content = data["choices"][0]["message"].get("content", "")
+                print(content, end="", flush=True)
+                result_text += content
+            except Exception as e:
+                print("解析响应出错:", e)
+                continue
 
-    print("result -> ", text)
-    return text
+    print()
+    return result_text
 
 
-send_request("你好啊")
-send_request("再加一等于多少", ["一加一等于多少", "一加一等于二"])
-send_request("再加一等于多少", [{"utterance": "一加一等于多少"}, {"utterance": "一加一等于二"}])
+if __name__ == "__main__":
+    # 示例调用：仅发送当前用户消息
+    send_request("你好啊")
+    send_request("你好啊", stream=False)
+    # 示例调用：使用 history 为字符串列表（交替为用户与助手的对话）
+    send_request("再加一等于多少", ["一加一等于多少", "一加一等于二"])
+    # 示例调用：history 为字典格式，明确指定对话角色
+    send_request("再加一等于多少", [{"role": "user", "content": "一加一等于多少"}, {"role": "assistant", "content": "一加一等于二"}])
diff --git a/llm/run_finetune.py b/llm/run_finetune.py
index 9df35e4860bb..616eb8a8e17a 100644
--- a/llm/run_finetune.py
+++ b/llm/run_finetune.py
@@ -52,12 +52,18 @@
     AutoModelForCausalLM,
     AutoModelForCausalLMPipe,
     AutoTokenizer,
+    DeepseekV2ForCausalLM,
+    DeepseekV2ForCausalLMPipe,
+    DeepseekV3ForCausalLM,
+    DeepseekV3ForCausalLMPipe,
     Llama3Tokenizer,
     LlamaForCausalLM,
     LlamaForCausalLMPipe,
     LlamaTokenizer,
     Qwen2ForCausalLM,
     Qwen2ForCausalLMPipe,
+    Qwen2MoeForCausalLM,
+    Qwen2MoeForCausalLMPipe,
 )
 from paddlenlp.transformers.configuration_utils import LlmMetaConfig
 from paddlenlp.trl import DataConfig, ModelConfig, SFTConfig, SFTTrainer
@@ -74,7 +80,18 @@
 # Fine-tune Environment Variables to support sharding stage1 overlap optimization.
 os.environ["USE_CASUAL_MASK"] = "False"
 
-flash_mask_support_list = [LlamaForCausalLM, LlamaForCausalLMPipe, Qwen2ForCausalLM, Qwen2ForCausalLMPipe]
+flash_mask_support_list = [
+    DeepseekV2ForCausalLM,
+    DeepseekV2ForCausalLMPipe,
+    DeepseekV3ForCausalLM,
+    DeepseekV3ForCausalLMPipe,
+    LlamaForCausalLM,
+    LlamaForCausalLMPipe,
+    Qwen2ForCausalLM,
+    Qwen2ForCausalLMPipe,
+    Qwen2MoeForCausalLM,
+    Qwen2MoeForCausalLMPipe,
+]
 
 
 def paddlenlp_verison_check():
@@ -151,7 +168,11 @@ def main():
         quantization_config=quantization_config,
     )
 
-    if "Qwen2Moe" in str(model_config.architectures) and training_args.data_parallel_degree > 1:
+    architectures_to_check = {"Qwen2Moe", "DeepseekV2", "DeepseekV3"}
+    if (
+        any(architecture in str(model_config.architectures) for architecture in architectures_to_check)
+        and training_args.data_parallel_degree > 1
+    ):
         training_args.use_expert_parallel = True
 
     LlmMetaConfig.set_llm_config(model_config, training_args)
@@ -190,6 +211,8 @@ def main():
 
     logger.info(f"Final model config: {model_config}")
 
+    logger.info("Creating model")
+
     model_class = AutoModelForCausalLM
     if training_args.pipeline_parallel_degree > 1:
         if data_args.eval_with_do_generation and training_args.do_eval:
@@ -257,7 +280,6 @@ def neft_post_hook(module, input, output):
         tokenizer.pad_token_id = tokenizer.eos_token_id
 
     train_ds, dev_ds, test_ds = create_dataset(data_args, training_args)
-
     # TODO(ZHUI & sijunhe): Temporary implementation. Generalize this logic and move to Trainer later.
     if training_args.resume_from_checkpoint is not None and data_args.lazy:
         logger.info(
@@ -301,6 +323,7 @@ def neft_post_hook(module, input, output):
         )
         eval_zero_padding = False
 
+    logger.info("Trans the dataset text into token ids, please wait for a moment.")
     train_ds, dev_ds, test_ds = trans_dataset_to_ids(
         train_ds, dev_ds, test_ds, model_args, data_args, trans_func, eval_zero_padding
     )
@@ -585,7 +608,12 @@ def create_peft_model(model_args, reft_args, training_args, dtype, model_config,
 def trans_dataset_to_ids(train_ds, dev_ds, test_ds, model_args, data_args, trans_func, eval_zero_padding):
     if train_ds is not None:
         train_ds = train_ds.map(
-            partial(trans_func, is_test=False, zero_padding=data_args.zero_padding, flash_mask=model_args.flash_mask)
+            partial(
+                trans_func,
+                is_test=False,
+                zero_padding=data_args.zero_padding,
+                flash_mask=model_args.flash_mask,
+            )
         )
     if dev_ds is not None:
         dev_ds = dev_ds.map(
@@ -612,18 +640,21 @@ def create_dataset(data_args, training_args):
     if os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) or os.path.exists(
         os.path.join(data_args.dataset_name_or_path, "dev.json")
     ):
+        logger.info("load train")
         if training_args.do_train:
             train_ds = load_dataset(
                 "json",
                 data_files=os.path.join(data_args.dataset_name_or_path, "train.json"),
                 lazy=data_args.lazy,
             )[0]
+        logger.info("load eval")
         if training_args.do_eval:
             dev_ds = load_dataset(
                 "json",
                 data_files=os.path.join(data_args.dataset_name_or_path, "dev.json"),
                 lazy=data_args.lazy,
             )[0]
+        logger.info("load test")
         if training_args.do_predict:
             test_ds = load_dataset(
                 "json",
diff --git a/llm/run_pretrain.py b/llm/run_pretrain.py
index a487c8601289..25be3832156e 100644
--- a/llm/run_pretrain.py
+++ b/llm/run_pretrain.py
@@ -478,7 +478,11 @@ def main():
             except:
                 print("Not register llama pp reshard information.")
 
-    if "Qwen2Moe" in str(config.architectures) and training_args.data_parallel_degree > 1:
+    architectures_to_check = {"Qwen2Moe", "DeepseekV2", "DeepseekV3"}
+    if (
+        any(architecture in str(config.architectures) for architecture in architectures_to_check)
+        and training_args.data_parallel_degree > 1
+    ):
         training_args.use_expert_parallel = True
 
     if model_args.continue_training:
diff --git a/llm/server/README.md b/llm/server/README.md
index b521644e5769..535b40497d05 100644
--- a/llm/server/README.md
+++ b/llm/server/README.md
@@ -1,11 +1,10 @@
+# 大模型服务化部署-快速开始教程
 
-<h1 align="center"><b><em>大模型服务化部署</em></b></h1>
+*该部署工具是基于英伟达 Triton 框架专为服务器场景的大模型服务化部署而设计。它提供了支持 gRPC、HTTP 协议的服务接口，以及流式 Token 输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
 
-*该部署工具是基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计。它提供了支持gRPC、HTTP协议的服务接口，以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
+## 快速开始
 
-# 快速开始
-
-  基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
+  基于预编译镜像部署，**使用飞桨静态图模型部署**。本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例。其他模型需按照要求导出为**静态图模型格式**。更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
 
   ```
     # 下载模型
@@ -34,6 +33,6 @@ Note:
 
 更多关于该部署工具的使用方法，请查看[服务化部署流程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/server/docs/deploy_usage_tutorial.md)
 
-# License
+## License
 
 遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/LICENSE) 。
diff --git a/llm/server/docs/deploy_usage_tutorial.md b/llm/server/docs/deploy_usage_tutorial.md
index d5771bd0efbf..5ad704f22d4f 100644
--- a/llm/server/docs/deploy_usage_tutorial.md
+++ b/llm/server/docs/deploy_usage_tutorial.md
@@ -1,6 +1,8 @@
+# 静态图高性能部署全流程
 
 ## 目录
 
+- [快速开始](#快速开始)
 - [部署环境准备](#部署环境准备)
   - [基础环境](#基础环境)
   - [准备部署镜像](#准备部署镜像)
@@ -12,13 +14,46 @@
   - [服务状态查询](#服务状态查询)
 - [服务测试](#服务测试)
   - [Python 客户端](#Python-客户端)
-  - [HTTP调用](#HTTP调用)
+  - [HTTP 调用](#HTTP-调用)
   - [OpenAI 客户端](#OpenAI-客户端)
   - [返回示例](#返回示例)
-- [基于dockerfile创建自己的镜像](#基于dockerfile创建自己的镜像)
+- [基于 dockerfile 创建自己的镜像](#基于 dockerfile 创建自己的镜像)
 - [模型配置参数介绍](#模型配置参数介绍)
 - [请求参数介绍](#请求参数介绍)
 
+
+
+*该部署工具是基于英伟达 Triton 框架专为服务器场景的大模型服务化部署而设计。它提供了支持 gRPC、HTTP 协议的服务接口，以及流式 Token 输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
+
+## 快速开始
+
+基于预编译镜像部署，**使用飞桨静态图模型部署**。本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例。其他模型需按照要求导出为**静态图模型格式**。
+具体流程如下，仅供示例参考，用户需要根据自己的需求导出所需**静态图模型**，然后开始部署流程。
+
+```shell
+  # 下载模型
+  wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
+  mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
+
+  # 挂载模型文件
+  export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8
+
+  docker run --gpus all --shm-size 5G --network=host --privileged --cap-add=SYS_PTRACE \
+  -v ${MODEL_PATH}:/models/ \
+  -dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2 \
+  bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
+```
+等待服务启动成功（服务初次启动大概需要40s），可以通过以下命令测试：
+
+```shell
+  curl 127.0.0.1:9965/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"text": "hello, llm"}'
+```
+
+Note:
+1. 请保证 shm-size >= 5，不然可能会导致服务启动失败
+
 ## 部署环境准备
 
 ### 基础环境
@@ -34,7 +69,7 @@
 
 ### 准备部署镜像
 
-为了方便部署，我们提供了 cuda12.3 的镜像，可以直接拉取镜像，或者使用我们提供的 `Dockerfile` [构建自定义镜像](#基于dockerfile创建自己的镜像)
+为了方便部署，我们提供了 cuda12.3 的镜像，可以直接拉取镜像，或者使用我们提供的 `Dockerfile` [构建自定义镜像](#基于 dockerfile 创建自己的镜像)
 ```
 docker pull registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2
 ```
@@ -43,6 +78,13 @@ docker pull registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cu
 
 该部署工具为 PaddleNLP 静态图模型提供了高效的部署方案，模型静态图导出方案请参考：[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md) ...
 
+或者下载样例模型:
+```shell
+# 下载模型
+wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
+mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
+```
+
 导出后的模型放在任意文件夹下，以 `/home/workspace/models_dir` 为例
 
 ```
@@ -57,7 +99,7 @@ cd /home/workspace/models_dir
 # ├── rank_mapping.csv           # 多卡模型会有此文件，如为单卡模型，则无此文件（可选，仅在多卡部署模式下需要）
 # └── rank_0                     # 保存模型结构和权重文件的目录
 #     ├── model.pdiparams
-#     └── model.pdmodel
+#     └── model.pdmodel 或者 model.json # Paddle 3.0 版本模型为model.json，Paddle 2.x 版本模型为model.pdmodel
 ```
 
 ### 创建容器
@@ -87,7 +129,7 @@ ls /models/
 
 根据需求和硬件信息，配置以下环境变量
 
-```
+```shell
 # 单/多卡推理配置。自行修改。
 ## 如果是单卡推理，使用0卡，设置如下环境变量。
 export MP_NUM=1
@@ -128,7 +170,7 @@ export PUSH_MODE_HTTP_WORKERS="1" # HTTP服务进程数，在 PUSH_MODE_HTTP_POR
 
 ### 启动服务
 
-```
+```shell
 cd /opt/output/Serving
 bash start_server.sh
 
@@ -149,11 +191,11 @@ health接口：（模型是否准备好推理）
 
 ## 服务测试
 
-### HTTP调用
+### HTTP 调用
 
-提示：HTTP调用接口使用变量 PUSH_MODE_HTTP_PORT 配置！HTTP_PORT 仅用于探活接口使用！
+提示：HTTP 调用接口使用变量 PUSH_MODE_HTTP_PORT 配置！HTTP_PORT 仅用于探活接口使用！
 
-```
+```python
 import uuid
 import json
 import requests
@@ -193,7 +235,7 @@ for line in res.iter_lines():
 
 ### 返回示例
 
-```
+```python
 如果stream为True，流式返回
     如果正常，返回{'token': xxx, 'is_end': xxx, 'send_idx': xxx, ..., 'error_msg': '', 'error_code': 0}
     如果异常，返回{'error_msg': xxx, 'error_code': xxx}，error_msg字段不为空，error_code字段不为0
@@ -209,7 +251,7 @@ for line in res.iter_lines():
 
 提示：使用 OpenAI 客户端需要配置 `PUSH_MODE_HTTP_PORT`！
 
-```
+```python
 import openai
 
 push_mode_http_port = "9965"    # 服务配置的PUSH_MODE_HTTP_PORT
@@ -217,8 +259,8 @@ client = openai.Client(base_url=f"http://127.0.0.1:{push_mode_http_port}/v1/chat
 
 # 非流式返回
 response = client.completions.create(
-	model="default",
-	prompt="Hello, how are you?",
+    model="default",
+    prompt="Hello, how are you?",
   max_tokens=50,
   stream=False,
 )
@@ -228,8 +270,8 @@ print("\n")
 
 # 流式返回
 response = client.completions.create(
-	model="default",
-	prompt="Hello, how are you?",
+    model="default",
+    prompt="Hello, how are you?",
   max_tokens=100,
   stream=True,
 )
@@ -275,10 +317,10 @@ for chunk in response:
 print("\n")
 ```
 
-## 基于dockerfile创建自己的镜像
+## 基于 dockerfile 创建自己的镜像
 
-为了方便用户构建自定义服务，我们提供了基于dockerfile创建自己的镜像的脚本。
-```
+为了方便用户构建自定义服务，我们提供了基于 dockerfile 创建自己的镜像的脚本。
+```shell
 git clone https://github.com/PaddlePaddle/PaddleNLP.git
 cd PaddleNLP/llm/server
 
@@ -292,35 +334,35 @@ docker build --network=host -f ./dockerfiles/Dockerfile_serving_cuda123_cudnn9 -
 | :---: | :-----: | :---: | :---: | :-----: | :----: |
 | MP_NUM |  int  | 模型并行度 | 否 | 8 | CUDA_VISIBLE_DEVICES 需配置对应卡数 |
 | CUDA_VISIBLE_DEVICES | str | 使用 GPU 编号 | 否 | 0,1,2,3,4,5,6,7 |  |
-| HTTP_PORT | int | 探活服务的http端口 | 是 | 无 | 当前仅用于健康检查、探活 |
-| GRPC_PORT | int | 模型推服务的grpc端口 | 是 | 无 |   |
+| HTTP_PORT | int | 探活服务的 http 端口 | 是 | 无 | 当前仅用于健康检查、探活 |
+| GRPC_PORT | int | 模型推服务的 grpc 端口 | 是 | 无 |   |
 | METRICS_PORT | int | 模型服务中监督指标的端口 | 是 | 无 |   |
 | INFER_QUEUE_PORT | int | 模型服务内部使用的端口 | 否 | 56666 |   |
-| PUSH_MODE_HTTP_PORT | int | 服务请求HTTP端口号 | 否 | -1 | 如不配置，服务只支持GRPC协议 |
+| PUSH_MODE_HTTP_PORT | int | 服务请求 HTTP 端口号 | 否 | -1 | 如不配置，服务只支持 GRPC 协议 |
 | DISABLE_STREAMING | int | 是否使用流式返回 | 否 | 0 |  |
-| MAX_SEQ_LEN | int | 最大输入序列长度 | 否 | 8192 | 服务会拒绝input token数量超过MAX_SEQ_LEN的请求，并返回错误提示 |
-| MAX_DEC_LEN | int | 最大decoer序列长度 | 否 | 1024 | 服务会拒绝请求中max_dec_len/min_dec_len超过此参数的请求，并返回错误提示 |
-| BATCH_SIZE | int | 最大Batch Size | 否 | 50 | 模型可同时并发处理的最大输入数量，不能高于128 |
-| BLOCK_BS | int | 缓存Block支持的最大Query Batch Size | 否 | 50 | 如果出现out of memeory 错误，尝试减少该数值 |
-| BLOCK_RATIO | float |  | 否 | 0.75 | 建议配置 输入平均Token数/（输入+输出平均Token数) |
+| MAX_SEQ_LEN | int | 最大输入序列长度 | 否 | 8192 | 服务会拒绝 input token 数量超过 MAX_SEQ_LEN 的请求，并返回错误提示 |
+| MAX_DEC_LEN | int | 最大 decoer 序列长度 | 否 | 1024 | 服务会拒绝请求中 max_dec_len/min_dec_len 超过此参数的请求，并返回错误提示 |
+| BATCH_SIZE | int | 最大 Batch Size | 否 | 50 | 模型可同时并发处理的最大输入数量，不能高于128 |
+| BLOCK_BS | int | 缓存 Block 支持的最大 Query Batch Size | 否 | 50 | 如果出现 out of memeory 错误，尝试减少该数值 |
+| BLOCK_RATIO | float |  | 否 | 0.75 | 建议配置 输入平均 Token 数/（输入+输出平均 Token 数) |
 | MAX_CACHED_TASK_NUM | int | 服务缓存队列最大长度 | 否 | 128 | 队列达到上限后，会拒绝新的请求 |
-| PUSH_MODE_HTTP_WORKERS | int | HTTP服务进程数 | 否 | 1 | 在 PUSH_MODE_HTTP_PORT 配置的情况下有效，高并发下提高该数值，建议最高配置为8 |
+| PUSH_MODE_HTTP_WORKERS | int | HTTP 服务进程数 | 否 | 1 | 在 PUSH_MODE_HTTP_PORT 配置的情况下有效，高并发下提高该数值，建议最高配置为8 |
 | USE_WARMUP | int | 是否进行 warmup | 否 | 0 |  |
-| USE_HF_TOKENIZER | int | 是否进行使用huggingface的词表 | 否 | 0 |   |
-| USE_CACHE_KV_INT8 | int | 是否将INT8配置为KV Cache的类型 | 否 | 0 | c8量化模型需要配置为1 |
+| USE_HF_TOKENIZER | int | 是否进行使用 huggingface 的词表 | 否 | 0 |   |
+| USE_CACHE_KV_INT8 | int | 是否将 INT8配置为 KV Cache 的类型 | 否 | 0 | c8量化模型需要配置为1 |
 | MODEL_DIR | str | 模型文件路径 | 否 | /models/ |  |
-| FD_MODEL_CONFIG_PATH | str | 模型config文件路径 | 否 | ${model_dir}/config.json |  |
+| FD_MODEL_CONFIG_PATH | str | 模型 config 文件路径 | 否 | ${model_dir}/config.json |  |
 | DISTRIBUTED_CONFIG | str | 模型分布式配置文件路径 | 否 | ${model_dir}/rank_mapping.csv |  |
 
 ## 请求参数介绍
 
 | 字段名 | 字段类型 | 说明 | 是否必填 | 默认值 | 备注 |
 | :---: | :-----: | :---: | :---: | :-----: | :----: |
-| req_id |  str  | 请求ID，用于标识一个请求。建议设置req_id，保证其唯一性   | 否 | 随机id | 如果推理服务中同时有两个相同req_id的请求，会返回req_id重复的错误信息 |
+| req_id |  str  | 请求 ID，用于标识一个请求。建议设置 req_id，保证其唯一性   | 否 | 随机 id | 如果推理服务中同时有两个相同 req_id 的请求，会返回 req_id 重复的错误信息 |
 | text   | str  | 请求的文本 | 否 | 无 | text 和 messages 必须有一个 |
-| messages | str | 多轮对话文本 | 否 | 无 | 多轮对话以list方式存储 |
-| max_dec_len | int  | 最大生成token的长度，如果请求的文本token长度加上max_dec_len大于模型的max_seq_len，会返回长度超限的错误信息 | 否 | max_seq_len减去文本token长度 |  |
-| min_dec_len | int | 最小生成token的长度，最小是1 | 否 | 1 |  |
+| messages | str | 多轮对话文本 | 否 | 无 | 多轮对话以 list 方式存储 |
+| max_dec_len | int  | 最大生成 token 的长度，如果请求的文本 token 长度加上 max_dec_len 大于模型的 max_seq_len，会返回长度超限的错误信息 | 否 | max_seq_len 减去文本 token 长度 |  |
+| min_dec_len | int | 最小生成 token 的长度，最小是1 | 否 | 1 |  |
 | topp | float | 控制随机性参数，数值越大则随机性越大，范围是0~1 | 否 | 0.7 |  |
 | temperature | float | 控制随机性参数，数值越小随机性越大，需要大于 0 | 否 | 0.95 |  |
 | frequency_score | float | 频率分数 | 否 | 0 |  |
@@ -330,5 +372,5 @@ docker build --network=host -f ./dockerfiles/Dockerfile_serving_cuda123_cudnn9 -
 | timeout | int | 请求等待的超时时间，单位是秒 | 否 | 300 |  |
 | return_usage | bool | 是否返回输入、输出 token 数量 | 否 | False |  |
 
-* 在正确配置PUSH_MODE_HTTP_PORT字段下，服务支持 GRPC 和 HTTP 两种请求服务
+* 在正确配置 PUSH_MODE_HTTP_PORT 字段下，服务支持 GRPC 和 HTTP 两种请求服务
   * stream 参数仅对 HTTP 请求生效
diff --git a/llm/server/server/scripts/start_server.sh b/llm/server/server/scripts/start_server.sh
index e7975b3e838f..d4956d3065ac 100644
--- a/llm/server/server/scripts/start_server.sh
+++ b/llm/server/server/scripts/start_server.sh
@@ -6,14 +6,12 @@ export PYTHONIOENCODING=utf8
 export LC_ALL=C.UTF-8
 
 # PaddlePaddle environment variables
-export FLAGS_allocator_strategy=auto_growth
-export FLAGS_dynamic_static_unified_comm=0
-export FLAGS_use_xqa_optim=1
 export FLAGS_gemm_use_half_precision_compute_type=0
 export NVIDIA_TF32_OVERRIDE=0
 
 # Model hyperparameters
-export MP_NUM=${MP_NUM:-"1"}                                # Number of GPUs
+export MP_NUM=${MP_NUM:-"1"}                                # number of model parallelism
+export MP_NNODES=${MP_NNODES:-"1"}                            # number of nodes
 export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-"0"}    # GPU ids
 export MAX_SEQ_LEN=${MAX_SEQ_LEN:-"8192"}
 export MAX_DEC_LEN=${MAX_DEC_LEN:-"2048"}
@@ -43,7 +41,26 @@ mkdir -p log
 rm -rf console.log log/*
 rm -rf /dev/shm/*
 
-echo "start serving ..."
+FED_POD_IP=$(hostname -i)
+if [ "$MP_NNODE" -gt 1 ]; then
+    POD_0_IP=$POD_0_IP
+    HOST_IP=$FED_POD_IP
+else
+    POD_0_IP="127.0.0.1"
+    HOST_IP="127.0.0.1"
+fi
+
+echo "POD_0_IP: $POD_0_IP HOST_IP: $HOST_IP"
+
+if [ "$POD_0_IP" == "$HOST_IP" ]; then
+    echo "Master node, start serving ..."
+else
+    echo "Slave node, start push mode"
+    # waiting for master node to start serving ...
+    sleep ${SERVER_WAITTING_TIME:-"25"}
+fi
+
+
 
 tritonserver --exit-timeout-secs 100 --cuda-memory-pool-byte-size 0:0 --cuda-memory-pool-byte-size 1:0 \
                  --cuda-memory-pool-byte-size 2:0 --cuda-memory-pool-byte-size 3:0 --cuda-memory-pool-byte-size 4:0 \
diff --git a/llm/server/server/server/data/processor.py b/llm/server/server/server/data/processor.py
index b27ae1bf2ca0..6e7873d98bd8 100644
--- a/llm/server/server/server/data/processor.py
+++ b/llm/server/server/server/data/processor.py
@@ -19,6 +19,7 @@
 from paddlenlp.trl.llm_utils import get_eos_token_id
 from server.engine.config import Config
 from server.utils import data_processor_logger
+from paddlenlp.utils.env import USE_FAST_TOKENIZER
 
 
 class BaseDataProcessor(ABC):
@@ -121,7 +122,8 @@ class DataProcessor(BaseDataProcessor):
     def __init__(self):
         self.config = Config()
         max_length = self.config.get_model_config().get('max_length', 1024)
-        self.src_length = max_length - self.config.seq_len_limit
+        self.src_length = self.config.seq_len_limit - max_length
+
 
         self.decode_status = dict()
         self.tokenizer = self._load_tokenizer()
@@ -288,7 +290,7 @@ def _load_tokenizer(self):
             return AutoTokenizer.from_pretrained(self.config.model_dir, use_fast=False)
         else:
             from paddlenlp.transformers import AutoTokenizer
-            return AutoTokenizer.from_pretrained(self.config.model_dir)
+            return AutoTokenizer.from_pretrained(self.config.model_dir, use_fast=USE_FAST_TOKENIZER)
 
     def clear_request_status(self, task_id):
         """
diff --git a/llm/server/server/server/engine/config.py b/llm/server/server/server/engine/config.py
index 3b9a88f0c94b..fe25da48fb3d 100644
--- a/llm/server/server/server/engine/config.py
+++ b/llm/server/server/server/engine/config.py
@@ -59,6 +59,12 @@ def read_from_env(self):
         else:
             raise Exception(f"unsupported device type: {self.device}")
 
+        # multi-node config
+        self.nnode = int(env.get("MP_NNODE", "1"))
+        assert self.mp_num % self.nnode == 0 ,f"mp_num: {self.mp_num} should be divisible by nnode: {self.nnode}"
+        self.mp_num_per_node = self.mp_num // self.nnode
+        self.host_ip = os.getenv("HOST_IP", "127.0.0.1")
+        
         # Triton config
         self.max_prefill_batch = int(os.getenv("MAX_PREFILL_BATCH", 1))
         if self.max_prefill_batch <= 0:
@@ -93,6 +99,7 @@ def read_from_env(self):
         self.use_cache_kv_int8 = int(os.getenv("USE_CACHE_KV_INT8", 0))
         self.use_cache_kv_int4 = int(os.getenv("USE_CACHE_KV_INT4", 0))
 
+
         # infer config
         self.max_batch_size = int(env.get("BATCH_SIZE", 50))
         self.max_seq_len = int(env.get("MAX_SEQ_LEN", 8192))
@@ -168,6 +175,20 @@ def check(self):
             f"which means the exported MAX_DEC_LEN should less than "
             f"{self.max_seq_len}, but now it's {self.dec_len_limit}."
         )
+        if os.getenv("DISABLE_CAPACITY_CHECKER", "0") == 1:
+            # max_output_token_num
+            max_output_token_num = (self.total_block_num - self.max_block_num) * self.block_size + self.enc_dec_block_num * self.block_size
+            assert max_output_token_num >= self.dec_len_limit, (
+                f"The available output token number of the service is {max_output_token_num}, "
+                f"which is less than the setting MAX_DEC_LEN:{self.dec_len_limit}. "
+            )
+    
+            # Maximum input length of a single query that the service can handle
+            max_input_token_num = int(math.floor(self.max_block_num * self.block_size - self.dec_token_num))
+            assert max_input_token_num >= self.seq_len_limit, (
+                f"The available input token number of the service is {max_input_token_num}, "
+                f"which is less than the setting MAX_SEQ_LEN:{self.seq_len_limit}. "
+            )
 
     def print(self, file=None):
         """
diff --git a/llm/server/server/server/engine/engine.py b/llm/server/server/server/engine/engine.py
index 932404d9c094..4bf0fcfeffb8 100644
--- a/llm/server/server/server/engine/engine.py
+++ b/llm/server/server/server/engine/engine.py
@@ -50,7 +50,9 @@ def start(self):
         """
         assert not self.is_started, "The engine is already started.!"
         start_time = time.time()
-        self.queue_service = self._start_tasks_queue_service()
+        # Master node only
+        if self.cfg.nnode == 1 or self.cfg.host_ip == os.getenv('POD_0_IP', '127.0.0.1'):
+            self.queue_service = self._start_tasks_queue_service()
         self.tasks_queue = TaskQueueManager(mp_num=self.cfg.mp_num, port=self.cfg.infer_port)
 
         self.token_processor.tasks_queue = self.tasks_queue
@@ -258,7 +260,7 @@ def _infer_processes_ready(self):
         Returns:
             return: True if all ready, False otherwise
         """
-        if np.sum(self.flag_ready_array) == self.cfg.mp_num:
+        if np.sum(self.flag_ready_array) == self.cfg.mp_num_per_node:
             return True
         return False
 
@@ -378,7 +380,8 @@ def _start_gpu_infer_service(self):
         pd_cmd = "python3 -m paddle.distributed.launch "
         py_script = os.path.join(current_dir_path, "infer.py")
 
-        arguments = (f" --devices {self.cfg.device_ids} {py_script} --model_dir {self.cfg.model_dir}"
+        arguments = (f" --nnodes {str(self.cfg.nnode)}"
+                    f" --devices {self.cfg.device_ids} {py_script} --model_dir {self.cfg.model_dir}"
                     f" --max_batch_size {self.cfg.max_batch_size} --max_seq_len {self.cfg.max_seq_len}"
                     f" --max_dec_len {self.cfg.max_dec_len}"
                     f" --max_block_num {self.cfg.total_block_num} --block_size {self.cfg.block_size}"
diff --git a/llm/server/server/server/engine/infer.py b/llm/server/server/server/engine/infer.py
index b2c021b8b1af..030657cbb0f6 100644
--- a/llm/server/server/server/engine/infer.py
+++ b/llm/server/server/server/engine/infer.py
@@ -26,14 +26,19 @@
 import paddle.distributed as dist
 import paddle.distributed.fleet as fleet
 from paddle.base.framework import use_pir_api
-from paddlenlp.trl.llm_utils import get_rotary_position_embedding
 from paddlenlp_ops import step_paddle
 from server.data.processor import DataProcessor
 from server.engine.config import Config
-from paddlenlp.experimental.transformers import InferenceWithReferenceProposer
 from server.utils import get_logger
 from task_queue_manager import TaskQueueManager
 
+from paddlenlp.experimental.transformers import InferenceWithReferenceProposer
+from paddlenlp.trl.llm_utils import get_rotary_position_embedding
+from paddlenlp.utils.env import (
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+)
+
 File_Path = os.path.realpath(sys.argv[0])
 Dir_Path = os.path.dirname(File_Path)
 logger = get_logger("infer_server", "infer.log")
@@ -55,8 +60,11 @@ def __init__(self, args):
         self.args.num_layers = self.get_value(self.model_cfg, ["num_hidden_layers", "num_layers"])
         self.args.num_attention_heads = self.get_value(self.model_cfg, ["num_attention_heads", "n_head"])
         self.args.hidden_size = self.model_cfg["hidden_size"]
+        if "deepseek" in self.model_cfg["model_type"]:
+            self.qk_nope_head_dim = int(self.model_cfg["qk_nope_head_dim"])
+            self.qk_rope_head_dim = int(self.model_cfg["qk_rope_head_dim"])
+            self.v_head_dim = int(self.model_cfg["v_head_dim"])
 
-        self.reduce_dialogue_repetition = int(os.environ.get("REDUCE_DIALOGUE_REPETITION", 0))
 
         self.max_stop_seqs_num = int(os.getenv("MAX_STOP_SEQS_NUM", 5))
         self.stop_seqs_max_len = int(os.getenv("STOP_SEQS_MAX_LEN", 8))
@@ -72,13 +80,14 @@ def __init__(self, args):
         self.init_inputs()
 
         if self.is_speculate_decoding:
-            logger.info(f'Using speculate decoding, method: {self.speculate_config.speculate_method}.')
+            logger.info(f"Using speculate decoding, method: {self.speculate_config.speculate_method}.")
             if self.speculate_config.speculate_method == "inference_with_reference":
                 self.proposer = InferenceWithReferenceProposer(
                     self.speculate_config.speculate_max_draft_token_num,
                     self.speculate_config.speculate_max_ngram_size,
                     self.args.max_batch_size,
-                    self.args.max_seq_len)
+                    self.args.max_seq_len,
+                )
         else:
             self.proposer = None
 
@@ -88,12 +97,13 @@ def __init__(self, args):
         if not os.path.exists(model_rank_path):
             model_rank_path = self.args.model_dir
 
-        self.infer_engine = InferenceEngine(model_dir=model_rank_path,
-                                            share_inputs=self.share_inputs,
-                                            cache_kvs=self.cache_kvs,
-                                            config=self.config,
-                                            mp_degree=self.nranks
-                                        )
+        self.infer_engine = InferenceEngine(
+            model_dir=model_rank_path,
+            share_inputs=self.share_inputs,
+            cache_kvs=self.cache_kvs,
+            config=self.config,
+            mp_degree=self.nranks,
+        )
 
     def read_model_config(self):
         """
@@ -102,7 +112,7 @@ def read_model_config(self):
         Returns:
             model_config_json: dict, model config file
         """
-        model_config_json = json.load(open(self.config_file, 'r', encoding='utf-8'))
+        model_config_json = json.load(open(self.config_file, "r", encoding="utf-8"))
         return model_config_json
 
     def get_value(self, cfg, names):
@@ -115,9 +125,7 @@ def get_value(self, cfg, names):
             if name in cfg:
                 return cfg[name]
             break
-        raise Exception(
-            "Cannot find any one of key in {} in configuration file.".format(
-                names))
+        raise Exception("Cannot find any one of key in {} in configuration file.".format(names))
 
     def format_print_configuration(self):
         """
@@ -137,13 +145,13 @@ def load_model_init_val(self):
         """
         self.top_p = self.model_cfg.get("top_p", 0.0)
         self.temperature = self.model_cfg.get("temperature", 1.0)
-        self.rope_theta = self.model_cfg.get('rope_theta', 10000.0)
-        self.rope_scaling = self.model_cfg.get('rope_scaling', None)
-        self.penalty_score = self.model_cfg.get('penalty_score', 1.0)
-        self.frequency_score = self.model_cfg.get('frequency_score', 0.0)
-        self.presence_score = self.model_cfg.get('presence_score', 0.0)
-        self.min_length = self.model_cfg.get('min_length', 1)
-        self.max_length = self.model_cfg.get('max_length', 1024)
+        self.rope_theta = self.model_cfg.get("rope_theta", 10000.0)
+        self.rope_scaling = self.model_cfg.get("rope_scaling", None)
+        self.penalty_score = self.model_cfg.get("penalty_score", 1.0)
+        self.frequency_score = self.model_cfg.get("frequency_score", 0.0)
+        self.presence_score = self.model_cfg.get("presence_score", 0.0)
+        self.min_length = self.model_cfg.get("min_length", 1)
+        self.max_length = self.model_cfg.get("max_length", 1024)
 
         data_processor = DataProcessor()
         # reserve an eos token for request
@@ -169,9 +177,11 @@ def init_dist_env(self, seed=20):
 
     def init_inputs(self):
         # init all inputs
-        if "num_key_value_heads" in self.model_cfg and \
-                self.model_cfg["num_key_value_heads"] is not None and \
-                int(self.model_cfg["num_key_value_heads"]) > 0:
+        if (
+            "num_key_value_heads" in self.model_cfg
+            and self.model_cfg["num_key_value_heads"] is not None
+            and int(self.model_cfg["num_key_value_heads"]) > 0
+        ):
             kv_num_head = int(self.model_cfg["num_key_value_heads"]) // self.nranks
         else:
             kv_num_head = self.args.num_attention_heads // self.nranks
@@ -181,114 +191,154 @@ def init_inputs(self):
                 cache_type = self.args.dtype
             else:
                 cache_type = "uint8"
-
-            self.cache_kvs["key_caches_{}".format(i)] = paddle.full(shape=[
-                self.args.max_block_num, kv_num_head,
-                self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
-            ], fill_value=0, dtype=cache_type)
-            self.cache_kvs["value_caches_{}".format(i)] = paddle.full(shape=[
-                self.args.max_block_num, kv_num_head,
-                self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
-            ], fill_value=0, dtype=cache_type)
-
-        pre_max_block_num = (self.args.max_seq_len + self.args.block_size - 1) // self.args.block_size + self.args.enc_dec_block_num
+            
+            if "deepseek" in self.model_cfg["model_type"]:
+                self.cache_kvs["key_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size,
+                    self.qk_nope_head_dim + self.qk_rope_head_dim
+                ], fill_value=0, dtype=cache_type)
+                self.cache_kvs["value_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size, self.v_head_dim
+                ], fill_value=0, dtype=cache_type)
+            else:
+                self.cache_kvs["key_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
+                ], fill_value=0, dtype=cache_type)
+                self.cache_kvs["value_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
+                ], fill_value=0, dtype=cache_type)
+
+        pre_max_block_num = (
+            self.args.max_seq_len + self.args.block_size - 1
+        ) // self.args.block_size + self.args.enc_dec_block_num
         self.share_inputs["block_tables"] = paddle.full(
-                        shape=[self.args.max_batch_size, pre_max_block_num], fill_value=-1, dtype="int32")
+            shape=[self.args.max_batch_size, pre_max_block_num], fill_value=-1, dtype="int32"
+        )
 
-        self.share_inputs['pre_ids'] = paddle.to_tensor(
-                        np.full((self.args.max_batch_size, self.args.max_dec_len), -1, dtype='int64'))
+        self.share_inputs["pre_ids"] = paddle.to_tensor(
+            np.full((self.args.max_batch_size, self.args.max_dec_len), -1, dtype="int64")
+        )
 
         tmp_position_ids = paddle.arange(self.args.max_seq_len).reshape((1, -1))
-        self.share_inputs['rope_emb'] = get_rotary_position_embedding(tmp_position_ids,
-                        self.args.hidden_size // self.args.num_attention_heads,
-                        self.rope_theta, self.rope_scaling)
-        self.share_inputs['input_ids'] = paddle.full(
-                        shape=[self.args.max_batch_size, self.args.max_seq_len],
-                        fill_value=self.pad_token_id, dtype='int64')
-        self.share_inputs['top_p'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.top_p, dtype="float32")
-        self.share_inputs['temperature'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.temperature, dtype="float32")
-        self.share_inputs['eos_token_id'] = paddle.to_tensor(
-                            np.zeros((self.eos_tokens_lens, 1)).reshape(-1, 1).astype("int64"))
-        self.share_inputs['penalty_score'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.penalty_score, dtype="float32")
-        self.share_inputs['frequency_score'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.frequency_score, dtype="float32")
-        self.share_inputs['presence_score'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.presence_score, dtype="float32")
-        self.share_inputs['seq_lens_this_time'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['seq_lens_encoder'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['step_seq_lens_encoder'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['seq_lens_decoder'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['step_idx'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
-        self.share_inputs['min_length'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.min_length, dtype="int64")
-        self.share_inputs['max_length'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.max_length, dtype="int64")
-        self.share_inputs['not_need_stop'] = paddle.full(
-                            shape=[1], fill_value=False, dtype="bool")
-        self.share_inputs['stop_flags'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=True, dtype="bool")
-        self.share_inputs['stop_nums'] = paddle.full(
-                            shape=[1], fill_value=self.args.max_batch_size, dtype="int64")
-        self.share_inputs['bad_tokens'] = paddle.full(
-                            shape=[1], fill_value=-1, dtype="int64")
-        self.share_inputs['next_tokens'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64")
-        self.share_inputs['is_block_step'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=False, dtype="bool")
-        self.share_inputs['encoder_block_lens'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
-        self.share_inputs['step_block_list'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32")
-        self.share_inputs['step_lens'] = paddle.full(shape=[1], fill_value=0, dtype="int32")
-        self.share_inputs['recover_block_list'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32")
-        self.share_inputs['recover_lens'] = paddle.full(
-                            shape=[1], fill_value=0, dtype="int32")
-        self.share_inputs['need_block_list'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32")
-        self.share_inputs['need_block_len'] = paddle.full(
-                            shape=[1], fill_value=0, dtype="int32")
-        self.share_inputs['used_list_len'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
-        self.share_inputs['infer_seed'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
+        self.share_inputs["rope_emb"] = get_rotary_position_embedding(
+            tmp_position_ids,
+            self.args.hidden_size // self.args.num_attention_heads,
+            self.rope_theta,
+            self.rope_scaling,
+        )
+        self.share_inputs["input_ids"] = paddle.full(
+            shape=[self.args.max_batch_size, self.args.max_seq_len], fill_value=self.pad_token_id, dtype="int64"
+        )
+        self.share_inputs["top_p"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.top_p, dtype="float32"
+        )
+        self.share_inputs["temperature"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.temperature, dtype="float32"
+        )
+        self.share_inputs["eos_token_id"] = paddle.to_tensor(
+            np.zeros((self.eos_tokens_lens, 1)).reshape(-1, 1).astype("int64")
+        )
+        self.share_inputs["penalty_score"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.penalty_score, dtype="float32"
+        )
+        self.share_inputs["frequency_score"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.frequency_score, dtype="float32"
+        )
+        self.share_inputs["presence_score"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.presence_score, dtype="float32"
+        )
+        self.share_inputs["seq_lens_this_time"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["seq_lens_encoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["step_seq_lens_encoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["seq_lens_decoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["step_idx"] = paddle.full(shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
+        self.share_inputs["min_length"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.min_length, dtype="int64"
+        )
+        self.share_inputs["max_length"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.max_length, dtype="int64"
+        )
+        self.share_inputs["not_need_stop"] = paddle.full(shape=[1], fill_value=False, dtype="bool")
+        self.share_inputs["stop_flags"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=True, dtype="bool"
+        )
+        self.share_inputs["stop_nums"] = paddle.full(shape=[1], fill_value=self.args.max_batch_size, dtype="int64")
+        self.share_inputs["bad_tokens"] = paddle.full(shape=[1], fill_value=-1, dtype="int64")
+        self.share_inputs["next_tokens"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64"
+        )
+        self.share_inputs["is_block_step"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=False, dtype="bool"
+        )
+        self.share_inputs["encoder_block_lens"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["step_block_list"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32"
+        )
+        self.share_inputs["step_lens"] = paddle.full(shape=[1], fill_value=0, dtype="int32")
+        self.share_inputs["recover_block_list"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32"
+        )
+        self.share_inputs["recover_lens"] = paddle.full(shape=[1], fill_value=0, dtype="int32")
+        self.share_inputs["need_block_list"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32"
+        )
+        self.share_inputs["need_block_len"] = paddle.full(shape=[1], fill_value=0, dtype="int32")
+        self.share_inputs["used_list_len"] = paddle.full(shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
+        self.share_inputs["infer_seed"] = paddle.full(shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
         free_list = list(range(int(self.args.max_block_num * self.args.block_ratio)))
         self.free_list_len = len(free_list)
-        self.share_inputs['free_list'] = paddle.to_tensor(free_list, dtype="int32")
-        self.share_inputs['free_list_len'] = paddle.full(
-                            shape=[1], fill_value=self.free_list_len, dtype="int32")
-
-        self.share_inputs['stop_seqs_len'] = paddle.full(shape=[self.max_stop_seqs_num,],
-                                            fill_value=0,
-                                            dtype="int32")
-        self.share_inputs['stop_seqs'] = paddle.full(shape=[self.max_stop_seqs_num, self.stop_seqs_max_len],
-                                                fill_value=-1,
-                                                dtype="int64")
-
-        if self.reduce_dialogue_repetition:
-            self.share_inputs["first_token_ids"] = paddle.full(
-                shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64")
-            self.share_inputs["ori_seq_lens_encoder"] = paddle.full(
-                shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
+        self.share_inputs["free_list"] = paddle.to_tensor(free_list, dtype="int32")
+        self.share_inputs["free_list_len"] = paddle.full(shape=[1], fill_value=self.free_list_len, dtype="int32")
+
+        self.share_inputs["stop_seqs_len"] = paddle.full(
+            shape=[
+                self.max_stop_seqs_num,
+            ],
+            fill_value=0,
+            dtype="int32",
+        )
+        self.share_inputs["stop_seqs"] = paddle.full(
+            shape=[self.max_stop_seqs_num, self.stop_seqs_max_len], fill_value=-1, dtype="int64"
+        )
+
+
+        self.share_inputs["first_token_ids"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64")
+        self.share_inputs["ori_seq_lens_encoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
         # speculate decoding input
         if self.is_speculate_decoding:
             self.share_inputs["accept_tokens"] = paddle.full(
-                shape=[self.args.max_batch_size, self.speculate_config.speculate_max_draft_token_num + 1], fill_value=0, dtype="int64"
+                shape=[self.args.max_batch_size, self.speculate_config.speculate_max_draft_token_num + 1],
+                fill_value=0,
+                dtype="int64",
+            )
+            self.share_inputs["accept_num"] = paddle.full(
+                shape=[self.args.max_batch_size], fill_value=0, dtype="int32"
             )
-            self.share_inputs["accept_num"] = paddle.full(shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
             self.share_inputs["draft_tokens"] = paddle.full(
-                shape=[self.args.max_batch_size, self.speculate_config.speculate_max_draft_token_num + 1], fill_value=0, dtype="int64"
+                shape=[self.args.max_batch_size, self.speculate_config.speculate_max_draft_token_num + 1],
+                fill_value=0,
+                dtype="int64",
             )
             self.share_inputs["actual_draft_token_num"] = paddle.full(
-                shape=[self.args.max_batch_size], fill_value=self.speculate_config.speculate_max_draft_token_num, dtype="int32"
+                shape=[self.args.max_batch_size],
+                fill_value=self.speculate_config.speculate_max_draft_token_num,
+                dtype="int32",
             )
 
     def dy_input_preprocess(self, tasks):
@@ -297,58 +347,63 @@ def dy_input_preprocess(self, tasks):
         """
         for i in range(len(tasks)):
             task = tasks[i]
-            idx = task['idx']
-            length = len(task['input_ids'])
-            self.share_inputs['input_ids'][idx:idx + 1, :length] = np.array(task['input_ids'])
-            if len(task['eos_token_ids']) < self.eos_tokens_lens:
-                task['eos_token_ids'].append(task['eos_token_ids'][0])
-            self.share_inputs['eos_token_id'][:] = np.array(task['eos_token_ids'], dtype="int64").reshape(-1, 1)
-            self.share_inputs['pre_ids'][idx:idx + 1] = -1
-            self.share_inputs['top_p'][idx:idx + 1] = task.get('topp', 0.7)
-            self.share_inputs['temperature'][idx:idx + 1] = task.get('temperature', 0.95)
-            self.share_inputs['penalty_score'][idx:idx + 1] = task.get('penalty_score', 1.0)
-            self.share_inputs['frequency_score'][idx:idx + 1] = task.get('frequency_score', 0.0)
-            self.share_inputs['presence_score'][idx:idx + 1] = task.get('presence_score', 0.0)
-            self.share_inputs['seq_lens_this_time'][idx:idx + 1] = length
-            self.share_inputs['step_seq_lens_encoder'][idx:idx + 1] = length
-            self.share_inputs['seq_lens_encoder'][idx:idx + 1] = length
-            self.share_inputs['seq_lens_decoder'][idx:idx + 1] = 0
-            self.share_inputs['step_idx'][idx:idx + 1] = 0
-            self.share_inputs['min_length'][idx:idx + 1] = task.get('min_dec_len', 1)
+            idx = task["idx"]
+            length = len(task["input_ids"])
+            self.share_inputs["input_ids"][idx : idx + 1, :length] = np.array(task["input_ids"])
+            if len(task["eos_token_ids"]) < self.eos_tokens_lens:
+                task["eos_token_ids"].append(task["eos_token_ids"][0])
+            self.share_inputs["eos_token_id"][:] = np.array(task["eos_token_ids"], dtype="int64").reshape(-1, 1)
+            self.share_inputs["pre_ids"][idx : idx + 1] = -1
+            self.share_inputs["top_p"][idx : idx + 1] = task.get("topp", 0.7)
+            self.share_inputs["temperature"][idx : idx + 1] = task.get("temperature", 0.95)
+            self.share_inputs["penalty_score"][idx : idx + 1] = task.get("penalty_score", 1.0)
+            self.share_inputs["frequency_score"][idx : idx + 1] = task.get("frequency_score", 0.0)
+            self.share_inputs["presence_score"][idx : idx + 1] = task.get("presence_score", 0.0)
+            self.share_inputs["seq_lens_this_time"][idx : idx + 1] = length
+            self.share_inputs["step_seq_lens_encoder"][idx : idx + 1] = length
+            self.share_inputs["seq_lens_encoder"][idx : idx + 1] = length
+            self.share_inputs["seq_lens_decoder"][idx : idx + 1] = 0
+            self.share_inputs["step_idx"][idx : idx + 1] = 0
+            self.share_inputs["min_length"][idx : idx + 1] = task.get("min_dec_len", 1)
             if "max_dec_len" in task:
-                max_dec_len = task['max_dec_len']
+                max_dec_len = task["max_dec_len"]
             elif "seq_len" in task:
-                max_dec_len = task['seq_len']
+                max_dec_len = task["seq_len"]
             else:
                 max_dec_len = self.args.max_dec_len
-            self.share_inputs['max_length'][idx:idx + 1] = max_dec_len
-            self.share_inputs['stop_flags'][idx:idx + 1] = False
+            self.share_inputs["max_length"][idx : idx + 1] = max_dec_len
+            self.share_inputs["stop_flags"][idx : idx + 1] = False
+
 
-            if self.reduce_dialogue_repetition:
-                self.share_inputs['first_token_ids'][idx:idx + 1] =  self.share_inputs['input_ids'][idx:idx + 1, :1]
-                self.share_inputs["ori_seq_lens_encoder"][idx:idx + 1] = length
+            self.share_inputs['first_token_ids'][idx:idx + 1] =  self.share_inputs['input_ids'][idx:idx + 1, :1]
+            self.share_inputs["ori_seq_lens_encoder"][idx:idx + 1] = length
 
             if "infer_seed" in task:
-                self.share_inputs['infer_seed'][idx:idx + 1] = task['infer_seed']
+                self.share_inputs["infer_seed"][idx : idx + 1] = task["infer_seed"]
 
-            encoder_block_num = len(task['block_tables'])
-            self.share_inputs['encoder_block_lens'][idx:idx + 1] = encoder_block_num
-            self.share_inputs["block_tables"][idx:idx + 1, :] = -1
-            self.share_inputs["block_tables"][idx:idx + 1, :encoder_block_num] = np.array(
-                                            task['block_tables'], dtype="int32")
+            encoder_block_num = len(task["block_tables"])
+            self.share_inputs["encoder_block_lens"][idx : idx + 1] = encoder_block_num
+            self.share_inputs["block_tables"][idx : idx + 1, :] = -1
+            self.share_inputs["block_tables"][idx : idx + 1, :encoder_block_num] = np.array(
+                task["block_tables"], dtype="int32"
+            )
 
             if "stop_seqs_len" in task:
                 stop_seqs_num = len(task["stop_seqs_len"])
                 for i in range(stop_seqs_num, self.max_stop_seqs_num):
                     task["stop_seqs_len"].append(0)
-                self.share_inputs['stop_seqs_len'][:] = np.array(
-                                                        task["stop_seqs_len"], dtype="int32")
-                self.share_inputs['stop_seqs'][:stop_seqs_num, :len(task['stop_seqs'][0])] = np.array(
-                                                        task["stop_seqs"], dtype="int64")
+                self.share_inputs["stop_seqs_len"][:] = np.array(task["stop_seqs_len"], dtype="int32")
+                self.share_inputs["stop_seqs"][:stop_seqs_num, : len(task["stop_seqs"][0])] = np.array(
+                    task["stop_seqs"], dtype="int64"
+                )
 
             if self.is_speculate_decoding:
-                self.share_inputs["draft_tokens"][idx:idx + 1] = np.zeros([self.speculate_config.speculate_max_draft_token_num + 1])
-                self.share_inputs["actual_draft_token_num"][idx:idx + 1] = np.array([self.speculate_config.speculate_max_draft_token_num])
+                self.share_inputs["draft_tokens"][idx : idx + 1] = np.zeros(
+                    [self.speculate_config.speculate_max_draft_token_num + 1]
+                )
+                self.share_inputs["actual_draft_token_num"][idx : idx + 1] = np.array(
+                    [self.speculate_config.speculate_max_draft_token_num]
+                )
 
     def step_cuda(self, seq_lens_this_time):
         """
@@ -371,9 +426,8 @@ def step_cuda(self, seq_lens_this_time):
                     self.share_inputs['need_block_len'], self.share_inputs['used_list_len'],
                     self.share_inputs['free_list'], self.share_inputs['free_list_len'],
                     self.share_inputs['input_ids'], self.share_inputs['pre_ids'],
-                    self.share_inputs['step_idx'], self.share_inputs['next_tokens'],
-                    self.args.block_size, self.args.enc_dec_block_num, self.args.first_token_id,
-                    speculate_step_token_num)
+                    self.share_inputs['step_idx'], self.share_inputs['next_tokens'], self.share_inputs['first_token_ids'],
+                    self.args.block_size, self.args.enc_dec_block_num, 0)
 
     def initialize_engine_ready_check_flag(self):
         """
@@ -385,10 +439,11 @@ def initialize_engine_ready_check_flag(self):
         """
         engine_ready_check_flag = np.zeros([1], dtype=np.int32)
         shm_engine_ready_check_flag = shared_memory.SharedMemory(
-                        name=self.config.get_unique_name("engine_ready_check_flag"))
-        engine_ready_check_flag_array = np.ndarray(engine_ready_check_flag.shape,
-                                            dtype=engine_ready_check_flag.dtype,
-                                            buffer=shm_engine_ready_check_flag.buf)
+            name=self.config.get_unique_name("engine_ready_check_flag")
+        )
+        engine_ready_check_flag_array = np.ndarray(
+            engine_ready_check_flag.shape, dtype=engine_ready_check_flag.dtype, buffer=shm_engine_ready_check_flag.buf
+        )
         return shm_engine_ready_check_flag, engine_ready_check_flag_array
 
     def initialize_engine_live_flag(self):
@@ -398,9 +453,9 @@ def initialize_engine_live_flag(self):
         Returns:
             infer_live_flag_shm: infer live flag
         """
-        infer_live_flag_shm = shared_memory.SharedMemory(create=True,
-                                            size=1,
-                                            name=self.config.get_unique_name("shm_flag_infer_{}_live".format(self.rank)))
+        infer_live_flag_shm = shared_memory.SharedMemory(
+            create=True, size=1, name=self.config.get_unique_name("shm_flag_infer_{}_live".format(self.rank))
+        )
         return infer_live_flag_shm
 
     def initialize_engine_healthy_recorded_time_flag(self):
@@ -412,10 +467,13 @@ def initialize_engine_healthy_recorded_time_flag(self):
         """
         engine_healthy_recorded_time = np.zeros([1], dtype=float)
         shm_engine_healthy_recorded_time = shared_memory.SharedMemory(
-                    name=self.config.get_unique_name("engine_healthy_recorded_time"))
-        engine_healthy_recorded_time_array = np.ndarray(engine_healthy_recorded_time.shape,
-                                            dtype=engine_healthy_recorded_time.dtype,
-                                            buffer=shm_engine_healthy_recorded_time.buf)
+            name=self.config.get_unique_name("engine_healthy_recorded_time")
+        )
+        engine_healthy_recorded_time_array = np.ndarray(
+            engine_healthy_recorded_time.shape,
+            dtype=engine_healthy_recorded_time.dtype,
+            buffer=shm_engine_healthy_recorded_time.buf,
+        )
         return shm_engine_healthy_recorded_time, engine_healthy_recorded_time_array
 
     def run(self):
@@ -424,35 +482,38 @@ def run(self):
         """
         flag_array = np.zeros([1], dtype=np.int32)
         shm_flag_broadcast = shared_memory.SharedMemory(
-                        name=self.config.get_unique_name("shm_pd_infer_flag_broadcast"))
-        flag_broadcast_array = np.ndarray(flag_array.shape,
-                                            dtype=flag_array.dtype,
-                                            buffer=shm_flag_broadcast.buf)
+            name=self.config.get_unique_name("shm_pd_infer_flag_broadcast")
+        )
+        flag_broadcast_array = np.ndarray(flag_array.shape, dtype=flag_array.dtype, buffer=shm_flag_broadcast.buf)
 
         flag_array = np.zeros([self.nranks], dtype=np.int32)
         shm_flag_ready = shared_memory.SharedMemory(name=self.config.get_unique_name("shm_flag_infer_ready"))
-        flag_ready_array = np.ndarray(flag_array.shape,
-                                        dtype=flag_array.dtype,
-                                        buffer=shm_flag_ready.buf)
+        flag_ready_array = np.ndarray(flag_array.shape, dtype=flag_array.dtype, buffer=shm_flag_ready.buf)
         flag_ready_array[self.rank] = 1
 
         flag_array = np.zeros([1], dtype=np.int32)
-        shm_flag_has_block_step = shared_memory.SharedMemory(name=self.config.get_unique_name("shm_flag_has_block_step"))
-        flag_has_block_step_array = np.ndarray(flag_array.shape,
-                                                dtype=flag_array.dtype,
-                                                buffer=shm_flag_has_block_step.buf)
+        shm_flag_has_block_step = shared_memory.SharedMemory(
+            name=self.config.get_unique_name("shm_flag_has_block_step")
+        )
+        flag_has_block_step_array = np.ndarray(  # noqa: F841
+            flag_array.shape, dtype=flag_array.dtype, buffer=shm_flag_has_block_step.buf
+        )
 
         use_custom_health_checker = self.config.use_custom_health_checker
         if use_custom_health_checker:
-            shm_engine_ready_check_flag_array, engine_ready_check_flag_array = self.initialize_engine_ready_check_flag()
+            (
+                shm_engine_ready_check_flag_array,
+                engine_ready_check_flag_array,
+            ) = self.initialize_engine_ready_check_flag()
             engine_ready_check_flag_array[0] = 1
-            shm_engine_healthy_recorded_time_array, engine_healthy_recorded_time_array = self.initialize_engine_healthy_recorded_time_flag()
+            (
+                shm_engine_healthy_recorded_time_array,
+                engine_healthy_recorded_time_array,
+            ) = self.initialize_engine_healthy_recorded_time_flag()
             engine_healthy_recorded_time_array[0] = time.time()
-            infer_live_flag_shm = self.initialize_engine_live_flag()
-        infer_seed_increment = paddle.full(shape=[self.args.max_batch_size, 1],
-                                            fill_value=4,
-                                            dtype="int64")
-        thread_executor = ThreadPoolExecutor(max_workers=1)
+            infer_live_flag_shm = self.initialize_engine_live_flag()  # noqa: F841
+        infer_seed_increment = paddle.full(shape=[self.args.max_batch_size, 1], fill_value=4, dtype="int64")
+        thread_executor = ThreadPoolExecutor(max_workers=1)  # noqa: F841
         seq_lens_this_time = None
         real_bsz = None
 
@@ -460,14 +521,17 @@ def run(self):
             if use_custom_health_checker:
                 engine_healthy_recorded_time_array[0] = time.time()
 
-            if self.rank == 0:
+            if self.rank % self.config.mp_num_per_node == 0:
                 if not self.infer_queue.empty():
-                    flag_broadcast_array[0] = 1
+                    if self.config.nnode > 1:
+                        self.infer_queue.read_finish_flag.set(1)
+                    else:
+                        flag_broadcast_array[0] = 1
 
             if self.nranks > 1:
                 paddle.distributed.barrier()
 
-            if flag_broadcast_array[0] == 1:
+            if flag_broadcast_array[0] == 1 or self.infer_queue.read_finish_flag.get() == 1:
                 logger.info(f'rank: {self.rank} start to get')
                 if seq_lens_this_time is not None:
                     self.share_inputs["seq_lens_this_time"][:real_bsz] = seq_lens_this_time
@@ -475,23 +539,20 @@ def run(self):
                 tasks, read_finish = self.infer_queue.get()
                 if read_finish:
                     flag_broadcast_array[0] = 0
+                    self.infer_queue.read_finish_flag.set(0)
 
                 req_dicts = []
                 for req_dict, bsz in tasks:
                     real_bsz = int(bsz)
                     req_dicts.extend(req_dict)
-                    logger.info(
-                        f'rank: {self.rank}, real_bsz: {real_bsz}, query_num: {len(req_dicts)}'
-                    )
+                    logger.info(f"rank: {self.rank}, real_bsz: {real_bsz}, query_num: {len(req_dicts)}")
 
                 self.dy_input_preprocess(req_dicts)
-                seq_lens_this_time = copy.deepcopy(
-                    self.share_inputs['seq_lens_this_time'][:real_bsz])
-                self.infer_engine.seq_lens_handle.share_external_data(
-                    seq_lens_this_time)
-                self.share_inputs['not_need_stop'][0] = True
+                seq_lens_this_time = copy.deepcopy(self.share_inputs["seq_lens_this_time"][:real_bsz])
+                self.infer_engine.seq_lens_handle.share_external_data(seq_lens_this_time)
+                self.share_inputs["not_need_stop"][0] = True
 
-            if not self.share_inputs['not_need_stop']:
+            if not self.share_inputs["not_need_stop"]:
                 if self.nranks > 1:
                     paddle.distributed.barrier()
 
@@ -506,8 +567,8 @@ def run(self):
                 )
 
             self.infer_engine.predictor.run()
-            self.share_inputs['infer_seed'].add_(infer_seed_increment)
-            self.share_inputs['infer_seed'][:] %= self.MAX_INFER_SEED
+            self.share_inputs["infer_seed"].add_(infer_seed_increment)
+            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
             if self.free_list_len > 0:
                 self.step_cuda(seq_lens_this_time)
 
@@ -520,6 +581,7 @@ class InferenceEngine(object):
         model_dir (string): root directory of inference model
         mp_degree (int): model parallel size
     """
+
     def __init__(self, model_dir, share_inputs, cache_kvs, config, mp_degree=1):
         self.config = config
         self.model_dir = model_dir
@@ -542,7 +604,7 @@ def _init_predictor(self):
         """
         predictor init
         """
-        device_id = self.rank % 8
+        device_id = self.rank % self.config.mp_num_per_node
         if use_pir_api():
             self.model_file = os.path.join(self.model_dir, f"model.json")
             self.param_file = os.path.join(self.model_dir, f"model.pdiparams")
@@ -553,33 +615,13 @@ def _init_predictor(self):
 
         config.enable_use_gpu(100, device_id)
 
-        pir_flag = int(os.environ.get("FLAGS_enable_pir_api", 0))
-        if pir_flag == 1:
+        if use_pir_api():
             config.enable_new_executor()
             config.enable_new_ir()
 
-        # distributed config
-        if self.mp_degree > 1:
-            trainer_endpoints = fleet.worker_endpoints()
-            current_endpoint = trainer_endpoints[self.rank]
-            dist_config = config.dist_config()
-            dist_config.set_ranks(self.nranks, self.rank)
-            dist_config.set_endpoints(trainer_endpoints, current_endpoint)
-            dist_config.enable_dist_model(True)
-            if self.config.distributed_config_path:
-                 dist_config.set_comm_init_config(self.config.distributed_config_path)
-            else:
-                raise Exception("Please set DISTRIBUTED_CONFIG env variable.")
-                logger.warning(
-                    f"Use default distributed config, please set env DISTRIBUTED_CONFIG"
-                )
-                dist_config.set_comm_init_config(
-                    os.path.join(Dir_Path + "/config", "rank_mapping_mp{}.csv".format(self.nranks)))
-
-            config.set_dist_config(dist_config)
         self.predictor = paddle.inference.create_predictor(config)
         self.input_names = self.predictor.get_input_names()
-        self.seq_lens_handle = self.predictor.get_input_handle('seq_lens_this_time')
+        self.seq_lens_handle = self.predictor.get_input_handle("seq_lens_this_time")
 
     def share_data(self):
         """
@@ -595,17 +637,6 @@ def share_data(self):
             input_tensor = self.predictor.get_input_handle(name)
             input_tensor.share_external_data(self.share_inputs[name])
 
-    def predict(self, real_bsz):
-        """
-        predict
-        """
-        seq_lens_this_time = copy.deepcopy(
-            self.share_inputs['seq_lens_this_time'][:real_bsz])
-        self.seq_lens_handle.share_external_data(seq_lens_this_time)
-        self.share_inputs['not_need_stop'][0] = True
-        while self.share_inputs['not_need_stop']:
-            self.predictor.run()
-        self.share_inputs["seq_lens_this_time"][:real_bsz] = seq_lens_this_time
 
 
 def parse_args():
@@ -613,51 +644,18 @@ def parse_args():
     parse args from command line
     """
     parser = argparse.ArgumentParser("FastDeploy LLM Inference")
-    parser.add_argument('-m',
-                        '--model_dir',
-                        type=str,
-                        default='./output',
-                        help='model dir')
-    parser.add_argument('-mp',
-                        '--mp_degree',
-                        type=int,
-                        default=1,
-                        help='mp degree')
-    parser.add_argument('-mbs',
-                        '--max_batch_size',
-                        type=int,
-                        default=34,
-                        help='max batch size')
-    parser.add_argument('--max_block_num', type=int, default=2000)
+    parser.add_argument("-m", "--model_dir", type=str, default="./output", help="model dir")
+    parser.add_argument("-mp", "--mp_degree", type=int, default=1, help="mp degree")
+    parser.add_argument("-mbs", "--max_batch_size", type=int, default=34, help="max batch size")
+    parser.add_argument("--max_block_num", type=int, default=2000)
     parser.add_argument("--block_size", type=int, default=128)
-    parser.add_argument('--max_seq_len',
-                        type=int,
-                        default=3072,
-                        help='max_seq_len')
-    parser.add_argument('--max_dec_len',
-                        type=int,
-                        default=1024,
-                        help='max_dec_len')
-    parser.add_argument('--use_cache_kv_int8',
-                        type=int,
-                        default=0,
-                        help='use cache kv int8')
-    parser.add_argument('--dtype',
-                        type=str,
-                        default="bfloat16",
-                        help='input dtype')
-    parser.add_argument('--enc_dec_block_num',
-                        type=int,
-                        default=1,
-                        help="encoder's decoder num")
-    parser.add_argument('--block_ratio',
-                        type=float,
-                        default=0.7,
-                        help="block ratio")
-    parser.add_argument('--first_token_id',
-                        type=int,
-                        default=1,
-                        help="first token id")
+    parser.add_argument("--max_seq_len", type=int, default=3072, help="max_seq_len")
+    parser.add_argument("--max_dec_len", type=int, default=1024, help="max_dec_len")
+    parser.add_argument("--use_cache_kv_int8", type=int, default=0, help="use cache kv int8")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="input dtype")
+    parser.add_argument("--enc_dec_block_num", type=int, default=1, help="encoder's decoder num")
+    parser.add_argument("--block_ratio", type=float, default=0.7, help="block ratio")
+    parser.add_argument("--first_token_id", type=int, default=1, help="first token id")
     args = parser.parse_args()
     return args
 
diff --git a/llm/server/server/server/engine/task_queue_manager.py b/llm/server/server/server/engine/task_queue_manager.py
index 475365d47fba..a0b70c88b4a7 100644
--- a/llm/server/server/server/engine/task_queue_manager.py
+++ b/llm/server/server/server/engine/task_queue_manager.py
@@ -49,8 +49,9 @@ def __init__(self, rank=0, mp_num=8, port=56666):
         QueueManager.register('get_barrier1')
         QueueManager.register('get_barrier2')
         QueueManager.register('get_queue')
+        QueueManager.register('get_read_finish_flag')
 
-        self.client_manager = QueueManager(address=('127.0.0.1', port),
+        self.client_manager = QueueManager(address=(os.getenv("POD_0_IP","127.0.0.1"), port),
                                            authkey=b'infer_queue'
                                            )
         self.client_manager.connect()
@@ -60,6 +61,7 @@ def __init__(self, rank=0, mp_num=8, port=56666):
         self.barrier1 = self.client_manager.get_barrier1()
         self.barrier2 = self.client_manager.get_barrier2()
         self.queue = self.client_manager.get_queue()
+        self.read_finish_flag = self.client_manager.get_read_finish_flag()
         self.mp_num = mp_num
         self.rank = rank
         self.position = 1 << rank
@@ -155,7 +157,9 @@ def launch_queue_service(port, num_workers):
         QueueManager.register('get_barrier2', callable=lambda: barrier2)
         q = Queue()
         QueueManager.register("get_queue", callable=lambda: q)
-        m = QueueManager(address=('127.0.0.1', port), authkey=b'infer_queue')
+        read_finish_flag = Value("i", 0)
+        QueueManager.register("get_read_finish_flag", callable=lambda: read_finish_flag, proxytype=ValueProxy)
+        m = QueueManager(address=(os.getenv("POD_0_IP","127.0.0.1"), port), authkey=b'infer_queue')
         s = m.get_server()
         logger.info("launch queue service success")
         s.serve_forever()
diff --git a/llm/server/server/server/triton_server.py b/llm/server/server/server/triton_server.py
index 02be0b4e8aa8..d565b4ce872b 100644
--- a/llm/server/server/server/triton_server.py
+++ b/llm/server/server/server/triton_server.py
@@ -201,7 +201,9 @@ def initialize(self, args):
         self.engine.start()
         model_server_logger.info("Create engine success")
 
-        self._initialize_push_mode()
+        # Master node only
+        if self.cfg.nnode == 1 or os.getenv('POD_0_IP',"127.0.0.1") == self.cfg.host_ip:
+            self._initialize_push_mode()
         model_server_logger.info("Init triton server success")
 
 
diff --git a/llm/server/server/server/triton_server_helper.py b/llm/server/server/server/triton_server_helper.py
index b299cd4204f8..9ca3a7e4ae83 100644
--- a/llm/server/server/server/triton_server_helper.py
+++ b/llm/server/server/server/triton_server_helper.py
@@ -72,7 +72,7 @@ def check_infer_engine_process():
     return:
         status: bool, True if process is alive else False
     """
-    mp_num = int(env_config.mp_num)
+    mp_num = int(env_config.mp_num_per_node)
     for i in range(mp_num):
         try:
             infer_live_flag_shm = shared_memory.SharedMemory(name=env_config.get_unique_name("shm_flag_infer_{}_live".format(i)))
diff --git a/llm/utils/data.py b/llm/utils/data.py
index db9d417743d0..dbecb49778e6 100644
--- a/llm/utils/data.py
+++ b/llm/utils/data.py
@@ -59,11 +59,13 @@ def get_convert_example(model):
         "gpt",
         "yuan",
         "jamba",
+        "deepseek_v2",
+        "deepseek_v3",
     ]:
         return convert_example_common
     else:
         raise ValueError(
-            f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm, bloom, llama, qwen, mixtral, gemma, qwen2, qwen2_moe, yuan, jamba",
+            f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm, bloom, llama, qwen, mixtral, gemma, qwen2, qwen2_moe, yuan, jamba,deepseek_v2, deepseek_v3",
         )
 
 
diff --git a/paddlenlp/datasets/dataset.py b/paddlenlp/datasets/dataset.py
index cf810a5196fc..03e73035b5b2 100644
--- a/paddlenlp/datasets/dataset.py
+++ b/paddlenlp/datasets/dataset.py
@@ -20,6 +20,9 @@
 from collections import namedtuple
 from itertools import islice
 
+# Add this for extremely slow conection to hf sever even for local dataset.
+os.environ["HF_UPDATE_DOWNLOAD_COUNTS"] = "False"
+
 import datasets
 from multiprocess import Pool, RLock
 
@@ -117,6 +120,7 @@ def load_from_hf(path, name=None, splits=None, **kwargs):
             hf_datasets = load_hf_dataset(path, name=name, **kwargs)
         else:
             hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
+
     except FileNotFoundError:
         raise FileNotFoundError("Couldn't find the dataset script for '" + path + "' on PaddleNLP or HuggingFace")
     else:
diff --git a/paddlenlp/experimental/autonlp/text_classification.py b/paddlenlp/experimental/autonlp/text_classification.py
index 5df473387918..25c2d39bea07 100644
--- a/paddlenlp/experimental/autonlp/text_classification.py
+++ b/paddlenlp/experimental/autonlp/text_classification.py
@@ -47,6 +47,7 @@
 from ...utils.log import logger
 from .auto_trainer_base import AutoTrainerBase
 from .utils import UTCLoss
+from .utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 
 
 class AutoTrainerForTextClassification(AutoTrainerBase):
@@ -560,16 +561,16 @@ def export(self, export_path: str, trial_id: Optional[str] = None, compress: boo
         if os.path.exists(default_export_path):
             if "utc" in model_config["model_name_or_path"]:
                 files = [
-                    "model.pdiparams",
-                    "model.pdmodel",
+                    f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
+                    f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
                     "tokenizer_config.json",
                     "vocab.txt",
                     "taskflow_config.json",
                 ]
             else:
                 files = [
-                    "model.pdiparams",
-                    "model.pdmodel",
+                    f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
+                    f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
                     "tokenizer_config.json",
                     "vocab.txt",
                     "taskflow_config.json",
@@ -735,8 +736,8 @@ def _batch_generator_func():
             executor=exe,
             batch_generator=_batch_generator_func,
             model_dir=export_path,
-            model_filename="model.pdmodel",
-            params_filename="model.pdiparams",
+            model_filename=f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
+            params_filename=f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
             batch_size=batch_size,
             batch_nums=batch_nums,
             scope=None,
@@ -757,8 +758,8 @@ def _batch_generator_func():
         post_training_quantization.quantize()
         post_training_quantization.save_quantized_model(
             save_model_path=compress_path,
-            model_filename="model.pdmodel",
-            params_filename="model.pdiparams",
+            model_filename=f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
+            params_filename=f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
         )
 
         paddle.disable_static()
diff --git a/paddlenlp/experimental/transformers/deepseek_v2/__init__.py b/paddlenlp/experimental/transformers/deepseek_v2/__init__.py
new file mode 100644
index 000000000000..c2a7f656c636
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v2/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/deepseek_v2/modeling.py b/paddlenlp/experimental/transformers/deepseek_v2/modeling.py
new file mode 100644
index 000000000000..037208f4a151
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v2/modeling.py
@@ -0,0 +1,1226 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from functools import partial
+from typing import Tuple
+
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.distributed import fleet
+from paddle.nn.quant import weight_quantize
+
+from paddlenlp.experimental.transformers.fused_transformer_layers import (
+    FusedBlockMultiTransformer,
+    FusedBlockMultiTransformerWeightOnly,
+    FusedMultiTransformerConfig,
+    MLAConfig,
+    MoeConfig,
+    SpeculateConfig,
+)
+from paddlenlp.experimental.transformers.generation_utils import (
+    GenerationBlockInferenceModel,
+)
+from paddlenlp.experimental.transformers.utils import infererence_model_from_pretrained
+from paddlenlp.transformers import DeepseekV2Config, DeepseekV2PretrainedModel
+from paddlenlp.transformers.deepseek_v2.modeling import (
+    DeepseekV2LMHead,
+    yarn_find_correction_range,
+    yarn_get_mscale,
+    yarn_linear_ramp_mask,
+)
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+)
+from paddlenlp.transformers.model_utils import (
+    dy2st_nocheck_guard_context,
+    register_base_model,
+)
+from paddlenlp.utils.log import logger
+
+__all__ = ["DeepseekV2ForCausalLMBlockInferenceModel"]
+
+
+class DeepseekScalingRotaryEmbedding(nn.Layer):
+    """RotaryEmbedding extended with YaRN method.
+
+    Credits to Peng et al. github.com/jquesnelle/yarn
+    """
+
+    def __init__(
+        self,
+        rotary_dim: int,
+        max_position_embeddings: int,
+        base: int,
+        scaling_factor: float,
+        *,
+        extrapolation_factor: float = 1,
+        attn_factor: float = 1,
+        beta_fast: int = 32,
+        beta_slow: int = 1,
+        mscale: float = 1,
+        mscale_all_dim: float = 0,
+    ) -> None:
+        super().__init__()
+        self._dtype = paddle.get_default_dtype()
+
+        self.rotary_dim = rotary_dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+
+        self.scaling_factor = scaling_factor
+        self.extrapolation_factor = extrapolation_factor
+        self.attn_factor = attn_factor
+        self.beta_fast = beta_fast
+        self.beta_slow = beta_slow
+        # Get n-d magnitude scaling corrected for interpolation.
+        self.mscale = float(
+            yarn_get_mscale(self.scaling_factor, float(mscale))
+            / yarn_get_mscale(self.scaling_factor, float(mscale_all_dim))
+            * attn_factor
+        )
+
+        cache = self._compute_cos_sin_cache()
+
+        self.cos_sin_cache: paddle.Tensor
+        self.register_buffer("cos_sin_cache", cache, persistable=True)
+
+    def _compute_inv_freq(self, scaling_factor: float) -> paddle.Tensor:
+        pos_freqs = self.base ** (paddle.arange(0, self.rotary_dim, 2, dtype=paddle.float32) / self.rotary_dim)
+
+        inv_freq_extrapolation = 1.0 / pos_freqs
+        inv_freq_interpolation = 1.0 / (scaling_factor * pos_freqs)
+
+        low, high = yarn_find_correction_range(
+            self.beta_fast, self.beta_slow, self.rotary_dim, self.base, self.max_position_embeddings
+        )
+        # Get n-d rotational scaling corrected for extrapolation
+        inv_freq_mask = (1 - yarn_linear_ramp_mask(low, high, self.rotary_dim // 2)) * self.extrapolation_factor
+        inv_freq = inv_freq_interpolation * (1 - inv_freq_mask) + inv_freq_extrapolation * inv_freq_mask
+        return inv_freq
+
+    def _compute_cos_sin_cache(self) -> paddle.Tensor:
+        inv_freq = self._compute_inv_freq(self.scaling_factor)
+        t = paddle.arange(self.max_position_embeddings * self.scaling_factor, dtype=paddle.float32)
+        freqs = paddle.einsum("i,j->ij", t, inv_freq)
+        cos = freqs.cos() * self.mscale
+        sin = freqs.sin() * self.mscale
+        cache = paddle.concat((cos, sin), axis=-1)
+        return cache.cast(self._dtype)
+
+    def forward(
+        self,
+        position_ids: paddle.Tensor,
+        query: paddle.Tensor,
+        key: paddle.Tensor,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        import os
+
+        from paddlenlp_ops import fused_rotary_position_encoding
+
+        # In-place operations that update the query and key tensors.
+        os.environ["stride_in_no_check_dy2st_diff"] = "1"
+        fused_rotary_position_encoding(query, key, position_ids, self.cos_sin_cache, self.rotary_dim, False)
+
+        return query, key
+
+
+class DeepseekV2RMSNorm(nn.Layer):
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__()
+        self.eps = config.rms_norm_eps
+        self.weight = paddle.create_parameter(
+            shape=[config.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+
+    def forward(self, x):
+        return paddle.incubate.nn.functional.fused_rms_norm(x, self.weight, None, self.eps, begin_norm_axis=1)[0]
+
+
+@register_base_model
+class DeepseekV2BlockInferenceModel(DeepseekV2PretrainedModel):
+    def __init__(self, config: DeepseekV2Config, base_model_prefix: str):
+        super(DeepseekV2PretrainedModel, self).__init__(config)
+        self.base_model_prefix = base_model_prefix
+
+        self.config = config
+
+        self.max_seq_len = config.max_seq_len
+
+        self.vocab_size = config.vocab_size
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_layers = config.num_hidden_layers
+        self.rms_norm_eps = config.rms_norm_eps
+        self.quant_type = config.quant_type
+        self.rope_theta = config.rope_theta
+        self.return_full_hidden_states = config.get("return_full_hidden_states", False)
+
+        self.use_weight_only = False
+        if config.quant_type == "weight_only_int8":
+            self.use_weight_only = True
+            self.quant_algo = "weight_only_int8"
+        elif config.quant_type == "weight_only_int4":
+            self.use_weight_only = True
+            self.quant_algo = "weight_only_int4"
+
+        if self.use_weight_only:
+            assert (
+                self.quant_type == "weight_only_int8" or self.quant_type == "weight_only_int4"
+            ), f"Expected quant_type equal to 'weight_only_int8' or 'weight_only_int4', but received {self.quant_type}"
+
+        self.first_k_dense_replace = config.first_k_dense_replace
+        self.n_routed_experts = config.n_routed_experts
+
+        if config.tensor_parallel_degree > config.n_routed_experts:
+            raise ValueError(
+                f"Tensor parallel size {config.tensor_parallel_degree} is greater than "
+                f"the number of experts {config.n_routed_experts}."
+            )
+
+        if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                self.vocab_size,
+                self.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(
+                self.vocab_size,
+                self.hidden_size,
+            )
+
+        self.norm = DeepseekV2RMSNorm(config)
+
+        scaling_factor = config.rope_scaling.get("factor", 1)
+        original_max_position = config.rope_scaling.get("original_max_position_embeddings", 4096)
+        extra_kwargs = {
+            k: v
+            for k, v in config.rope_scaling.items()
+            if k in ("extrapolation_factor", "attn_factor", "beta_fast", "beta_slow", "mscale", "mscale_all_dim")
+        }
+        self.rotary_emb = DeepseekScalingRotaryEmbedding(
+            config.qk_rope_head_dim,
+            original_max_position,
+            config.rope_theta,
+            scaling_factor,
+            **extra_kwargs,
+        )
+
+        # get ring_id
+        ring_id = -1
+        try:
+            hcg = fleet.get_hybrid_communicate_group()
+            model_parallel_group = hcg.get_model_parallel_group()
+            ring_id = model_parallel_group.id
+        except:
+            pass
+
+        ln_scale_attrs = [
+            paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ln_scale") for idx in range(self.num_layers)
+        ]
+
+        q_a_proj_weight_attrs = None
+        q_a_layernorm_weight_attrs = None
+        q_b_proj_weight_attrs = None
+        q_proj_weight_attrs = None
+
+        if self.config.q_lora_rank is not None:
+            q_a_proj_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_a_proj_weight",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                for idx in range(self.num_layers)
+            ]
+            q_a_layernorm_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_a_layernorm_weight",
+                    initializer=paddle.nn.initializer.Constant(value=1.0),
+                )
+                for idx in range(self.num_layers)
+            ]
+            q_b_proj_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_b_proj_weight",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                for idx in range(self.num_layers)
+            ]
+        else:
+            q_proj_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_proj_weight",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                for idx in range(self.num_layers)
+            ]
+
+        kv_a_proj_with_mqa_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.kv_a_proj_with_mqa_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        kv_a_layernorm_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.kv_a_layernorm_weight",
+                initializer=paddle.nn.initializer.Constant(value=1.0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        kv_b_proj_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.kv_b_proj_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+
+        out_proj_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.out_proj_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        ffn_ln_scale_attrs = [
+            paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ffn_ln_scale") for idx in range(self.num_layers)
+        ]
+        ffn1_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.ffn1_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        ffn2_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.ffn2_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        gate_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.gate_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+
+        e_score_correction_bias_attrs = None
+        if self.base_model_prefix.startswith("deepseek_v3"):
+            e_score_correction_bias_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.e_score_correction_bias",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                if idx >= self.config.first_k_dense_replace
+                else None
+                for idx in range(self.num_layers)
+            ]
+
+        shared_expert_ffn1_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn1_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        shared_expert_ffn2_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn2_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+
+        q_proj_weight_scale_attrs = None
+        q_a_proj_weight_scale_attrs = None
+        q_b_proj_weight_scale_attrs = None
+        kv_a_proj_with_mqa_weight_scale_attrs = None
+        kv_b_proj_weight_scale_attrs = None
+
+        out_proj_weight_scale_attrs = None
+        ffn1_weight_scale_attrs = None
+        ffn2_weight_scale_attrs = None
+        shared_expert_ffn1_weight_scale_attrs = None
+        shared_expert_ffn2_weight_scale_attrs = None
+
+        if self.use_weight_only:
+            if self.config.q_lora_rank is not None:
+                q_proj_weight_scale_attrs = [
+                    paddle.ParamAttr(
+                        name=f"fuse{self.base_model_prefix}.{idx}.q_a_proj_weight_scale",
+                    )
+                    for idx in range(self.num_layers)
+                ]
+                q_b_proj_weight_scale_attrs = [
+                    paddle.ParamAttr(
+                        name=f"fuse{self.base_model_prefix}.{idx}.q_b_proj_weight_scale",
+                    )
+                    for idx in range(self.num_layers)
+                ]
+            else:
+                q_proj_weight_scale_attrs = [
+                    paddle.ParamAttr(
+                        name=f"fuse{self.base_model_prefix}.{idx}.q_proj_weight_scale",
+                    )
+                    for idx in range(self.num_layers)
+                ]
+
+            kv_a_proj_with_mqa_weight_scale_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.kv_a_proj_with_mqa_weight_scale",
+                )
+                for idx in range(self.num_layers)
+            ]
+            kv_b_proj_weight_scale_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.kv_b_proj_weight_scale",
+                )
+                for idx in range(self.num_layers)
+            ]
+
+            out_proj_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.out_proj_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            ffn1_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ffn1_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            ffn2_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ffn2_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            shared_expert_ffn1_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn1_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            shared_expert_ffn2_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn2_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+
+        mla_config = MLAConfig(
+            q_lora_rank=self.config.q_lora_rank,
+            kv_lora_rank=self.config.kv_lora_rank,
+            qk_nope_head_dim=self.config.qk_nope_head_dim,
+            qk_rope_head_dim=self.config.qk_rope_head_dim,
+            v_head_dim=self.config.v_head_dim,
+            mscale=yarn_get_mscale(scaling_factor, float(config.rope_scaling.get("mscale_all_dim", 1.0))),
+            q_proj_weight_attrs=q_proj_weight_attrs,
+            q_proj_weight_scale_attrs=q_proj_weight_scale_attrs,
+            q_a_proj_weight_attrs=q_a_proj_weight_attrs,
+            q_a_proj_weight_scale_attrs=q_a_proj_weight_scale_attrs,
+            q_a_layernorm_weight_attrs=q_a_layernorm_weight_attrs,
+            q_b_proj_weight_attrs=q_b_proj_weight_attrs,
+            q_b_proj_weight_scale_attrs=q_b_proj_weight_scale_attrs,
+            kv_a_proj_with_mqa_weight_attrs=kv_a_proj_with_mqa_weight_attrs,
+            kv_a_proj_with_mqa_weight_scale_attrs=kv_a_proj_with_mqa_weight_scale_attrs,
+            kv_a_layernorm_weight_attrs=kv_a_layernorm_weight_attrs,
+            kv_b_proj_weight_attrs=kv_b_proj_weight_attrs,
+            kv_b_proj_weight_scale_attrs=kv_b_proj_weight_scale_attrs,
+        )
+
+        moe_config = MoeConfig(
+            num_experts=self.n_routed_experts,
+            top_k=self.config.num_experts_per_tok,
+            topk_group=self.config.topk_group,
+            norm_topk_prob=self.config.norm_topk_prob,
+            routed_scaling_factor=self.config.routed_scaling_factor,
+            num_expert_group=self.config.n_group,
+            topk_method=self.config.topk_method,
+            moe_intermediate_size=self.config.moe_intermediate_size,
+            first_k_dense_replace=self.first_k_dense_replace,
+            shared_expert_with_gate=False,
+            shared_expert_intermediate_size=self.config.moe_intermediate_size * self.config.n_shared_experts,
+            shared_expert_ffn1_weight_attrs=shared_expert_ffn1_weight_attrs,
+            shared_expert_ffn1_weight_scale_attrs=shared_expert_ffn1_weight_scale_attrs,
+            shared_expert_ffn2_weight_attrs=shared_expert_ffn2_weight_attrs,
+            shared_expert_ffn2_weight_scale_attrs=shared_expert_ffn2_weight_scale_attrs,
+        )
+
+        speculate_config = SpeculateConfig(
+            speculate_method=config.get("speculate_method", None),
+            speculate_max_draft_token_num=config.get("speculate_max_draft_token_num", 5),
+            return_full_hidden_states=config.get("return_full_hidden_states", False),
+        )
+
+        transformer_config = FusedMultiTransformerConfig(
+            embed_dim=self.hidden_size,
+            num_heads=self.num_attention_heads,
+            kv_num_heads=self.num_key_value_heads,
+            intermediate_size=self.intermediate_size,
+            quant_type=self.quant_type,
+            activation="swiglu",
+            num_layers=config.num_hidden_layers,
+            nranks=config.tensor_parallel_degree,
+            ring_id=ring_id,
+            ln_scale_attrs=ln_scale_attrs,
+            linear_weight_attrs=out_proj_weight_attrs,
+            linear_weight_scale_attrs=out_proj_weight_scale_attrs,
+            ffn_ln_scale_attrs=ffn_ln_scale_attrs,
+            gate_weight_attrs=gate_weight_attrs,
+            ffn1_weight_attrs=ffn1_weight_attrs,
+            ffn1_weight_scale_attrs=ffn1_weight_scale_attrs,
+            ffn2_weight_attrs=ffn2_weight_attrs,
+            ffn2_weight_scale_attrs=ffn2_weight_scale_attrs,
+            e_score_correction_bias_attrs=e_score_correction_bias_attrs,
+            epsilon=self.rms_norm_eps,
+            rope_theta=self.rope_theta,
+            rotary_emb=self.rotary_emb,
+            norm_type="rmsnorm",
+            rank_id=config.tensor_parallel_rank,
+            moe_config=moe_config,
+            mla_config=mla_config,
+            append_attn=config.append_attn,
+            speculate_config=speculate_config,
+        )
+
+        self.set_transformer_block(transformer_config)
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        self.transformer_block.init_weight()
+
+        dtype = paddle.get_default_dtype()
+        embed_tokens_weight = paddle.to_tensor(state_dict[f"{self.base_model_prefix}.embed_tokens.weight"]).cast(
+            self.embed_tokens.weight.dtype
+        )
+        norm_weight = paddle.to_tensor(state_dict[f"{self.base_model_prefix}.norm.weight"]).cast(
+            self.norm.weight.dtype
+        )
+        self.embed_tokens.weight.set_value(embed_tokens_weight)
+        self.norm.weight.set_value(norm_weight)
+
+        if self.use_weight_only:
+            logger.info("weight only is enabled")
+        for idx in range(self.num_layers):
+            logger.info(f"set state for layer {idx}")
+
+            ln_scale = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.input_layernorm.weight"]
+            ).cast(self.transformer_block.ln_scales[idx].dtype)
+            self.transformer_block.ln_scales[idx].set_value(ln_scale)
+
+            if self.config.q_lora_rank is not None:
+                q_a_proj_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_a_proj.weight"]
+                ).cast(dtype)
+                q_a_layernorm_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_a_layernorm.weight"]
+                ).cast(self.transformer_block.q_a_layernorm_weights[idx].dtype)
+                q_b_proj_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_b_proj.weight"]
+                ).cast(dtype)
+
+                if self.use_weight_only:
+                    q_a_proj_quanted_weight, q_a_proj_weight_scale = weight_quantize(
+                        q_a_proj_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.q_a_proj_weights[idx].set_value(q_a_proj_quanted_weight)
+                    self.transformer_block.q_a_proj_weights_scale[idx].set_value(q_a_proj_weight_scale)
+
+                    q_b_proj_quanted_weight, q_b_proj_weight_scale = weight_quantize(
+                        q_b_proj_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.q_b_proj_weights[idx].set_value(q_b_proj_quanted_weight)
+                    self.transformer_block.q_a_layernorm_weights[idx].set_value(q_a_layernorm_weight)
+                    self.transformer_block.q_b_proj_weights_scale[idx].set_value(q_b_proj_weight_scale)
+                else:
+                    self.transformer_block.q_a_proj_weights[idx].set_value(q_a_proj_weight)
+                    self.transformer_block.q_a_layernorm_weights[idx].set_value(q_a_layernorm_weight)
+                    self.transformer_block.q_b_proj_weights[idx].set_value(q_b_proj_weight)
+            else:
+                q_proj_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_proj.weight"]
+                ).cast(dtype)
+
+                if self.use_weight_only:
+                    q_proj_quanted_weight, q_proj_weight_scale = weight_quantize(q_proj_weight, algo=self.quant_algo)
+                    self.transformer_block.q_proj_weights[idx].set_value(q_proj_quanted_weight)
+                    self.transformer_block.q_proj_weights_scale[idx].set_value(q_proj_weight_scale)
+                else:
+                    self.transformer_block.q_proj_weights[idx].set_value(q_proj_weight)
+
+            kv_a_proj_with_mqa_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.kv_a_proj_with_mqa.weight"]
+            ).cast(dtype)
+            kv_a_layernorm_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.kv_a_layernorm.weight"]
+            ).cast(self.transformer_block.kv_a_layernorm_weights[idx].dtype)
+            kv_b_proj_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.kv_b_proj.weight"]
+            ).cast(dtype)
+
+            if self.use_weight_only:
+                kv_a_proj_with_mqa_quanted_weight, kv_a_proj_with_mqa_weight_scale = weight_quantize(
+                    kv_a_proj_with_mqa_weight, algo=self.quant_algo
+                )
+                self.transformer_block.kv_a_proj_with_mqa_weights[idx].set_value(kv_a_proj_with_mqa_quanted_weight)
+                self.transformer_block.kv_a_proj_with_mqa_weights_scale[idx].set_value(kv_a_proj_with_mqa_weight_scale)
+
+                kv_b_proj_quanted_weight, kv_b_proj_weight_scale = weight_quantize(
+                    kv_b_proj_weight, algo=self.quant_algo
+                )
+                self.transformer_block.kv_b_proj_weights[idx].set_value(kv_b_proj_quanted_weight)
+                self.transformer_block.kv_a_layernorm_weights[idx].set_value(kv_a_layernorm_weight)
+                self.transformer_block.kv_b_proj_weights_scale[idx].set_value(kv_b_proj_weight_scale)
+            else:
+                self.transformer_block.kv_a_proj_with_mqa_weights[idx].set_value(kv_a_proj_with_mqa_weight)
+                self.transformer_block.kv_a_layernorm_weights[idx].set_value(kv_a_layernorm_weight)
+                self.transformer_block.kv_b_proj_weights[idx].set_value(kv_b_proj_weight)
+
+            linear_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.o_proj.weight"]
+            ).cast(dtype)
+
+            if self.use_weight_only:
+                linear_quanted_weight, linear_weight_scale = weight_quantize(linear_weight, algo=self.quant_algo)
+                self.transformer_block.linear_weights[idx].set_value(linear_quanted_weight)
+                self.transformer_block.linear_weights_scale[idx].set_value(linear_weight_scale)
+            else:
+                self.transformer_block.linear_weights[idx].set_value(linear_weight)
+
+            ffn_ln_scale = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.post_attention_layernorm.weight"],
+            ).cast(
+                self.transformer_block.ffn_ln_scales[idx].dtype,
+            )
+            self.transformer_block.ffn_ln_scales[idx].set_value(ffn_ln_scale)
+            if idx < self.first_k_dense_replace:
+                concated_ffn1_weight = np.concatenate(
+                    [
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.gate_proj.weight"],
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.up_proj.weight"],
+                    ],
+                    axis=-1,
+                )
+                ffn1_weight_tensor = paddle.to_tensor(concated_ffn1_weight).cast(paddle.get_default_dtype())
+
+                if self.use_weight_only:
+                    ffn1_quanted_weight_tensor, ffn1_weight_scale_tensor = weight_quantize(
+                        ffn1_weight_tensor, algo=self.quant_algo
+                    )
+                    self.transformer_block.ffn1_weights[idx].set_value(ffn1_quanted_weight_tensor)
+                    self.transformer_block.ffn1_weights_scale[idx].set_value(ffn1_weight_scale_tensor)
+                else:
+                    self.transformer_block.ffn1_weights[idx].set_value(ffn1_weight_tensor)
+
+                ffn2_weight_tensor = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.down_proj.weight"]
+                ).cast(paddle.get_default_dtype())
+                if self.use_weight_only:
+                    ffn2_quanted_weight_tensor, ffn2_weight_scale_tensor = weight_quantize(
+                        ffn2_weight_tensor, algo=self.quant_algo
+                    )
+                    self.transformer_block.ffn2_weights[idx].set_value(ffn2_quanted_weight_tensor)
+                    self.transformer_block.ffn2_weights_scale[idx].set_value(ffn2_weight_scale_tensor)
+                else:
+                    self.transformer_block.ffn2_weights[idx].set_value(ffn2_weight_tensor)
+            else:
+                ffn1_weights = []
+                ffn2_weights = []
+                ffn1_scales = []
+                ffn2_scales = []
+
+                for expert_idx in range(self.n_routed_experts):
+                    concated_gate_up_weight = np.concatenate(
+                        [
+                            state_dict[
+                                f"{self.base_model_prefix}.layers.{idx}.mlp.experts.{expert_idx}.gate_proj.weight"
+                            ],
+                            state_dict[
+                                f"{self.base_model_prefix}.layers.{idx}.mlp.experts.{expert_idx}.up_proj.weight"
+                            ],
+                        ],
+                        axis=-1,
+                    )
+                    ffn1_weight = paddle.to_tensor(concated_gate_up_weight).cast(dtype)
+                    ffn2_weight = paddle.to_tensor(
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.experts.{expert_idx}.down_proj.weight"]
+                    ).cast(dtype)
+
+                    if self.use_weight_only:
+                        ffn1_quanted_weight, ffn1_weight_scale = weight_quantize(ffn1_weight, algo=self.quant_algo)
+                        ffn2_quanted_weight, ffn2_weight_scale = weight_quantize(ffn2_weight, algo=self.quant_algo)
+                        ffn1_weights.append(ffn1_quanted_weight.reshape([self.transformer_block.config.embed_dim, -1]))
+                        ffn2_weights.append(ffn2_quanted_weight.reshape([-1, self.transformer_block.config.embed_dim]))
+                        ffn1_scales.append(ffn1_weight_scale)
+                        ffn2_scales.append(ffn2_weight_scale)
+                    else:
+                        ffn1_weights.append(ffn1_weight)
+                        ffn2_weights.append(ffn2_weight)
+
+                fused_moe_ffn1_weight = paddle.to_tensor(ffn1_weights)
+                fused_moe_ffn2_weight = paddle.to_tensor(ffn2_weights)
+                fused_moe_ffn1_weight_scale = paddle.to_tensor(ffn1_scales)
+                fused_moe_ffn2_weight_scale = paddle.to_tensor(ffn2_scales)
+                gate_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.gate.weight"]
+                ).cast("float32")
+
+                if self.base_model_prefix.startswith("deepseek_v3"):
+                    e_score_correction_bias = paddle.to_tensor(
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.gate.e_score_correction_bias"]
+                    ).cast("float32")
+                    self.transformer_block.e_score_correction_biases[idx].set_value(e_score_correction_bias)
+
+                self.transformer_block.ffn1_weights[idx].set_value(fused_moe_ffn1_weight)
+                self.transformer_block.ffn2_weights[idx].set_value(fused_moe_ffn2_weight)
+                self.transformer_block.gate_weights[idx].set_value(gate_weight)
+
+                if self.use_weight_only:
+                    self.transformer_block.ffn1_weights_scale[idx].set_value(fused_moe_ffn1_weight_scale)
+                    self.transformer_block.ffn2_weights_scale[idx].set_value(fused_moe_ffn2_weight_scale)
+
+                concated_gate_up_weight = np.concatenate(
+                    [
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.shared_experts.gate_proj.weight"],
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.shared_experts.up_proj.weight"],
+                    ],
+                    axis=-1,
+                )
+                shared_expert_ffn1_weight = paddle.to_tensor(concated_gate_up_weight).cast(dtype)
+                shared_expert_ffn2_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.shared_experts.down_proj.weight"]
+                ).cast(dtype)
+
+                if self.use_weight_only:
+                    shared_expert_ffn1_quanted_weight, shared_expert_ffn1_weight_scale = weight_quantize(
+                        shared_expert_ffn1_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.shared_expert_ffn1_weights[idx].set_value(shared_expert_ffn1_quanted_weight)
+                    self.transformer_block.shared_expert_ffn1_weights_scale[idx].set_value(
+                        shared_expert_ffn1_weight_scale
+                    )
+
+                    shared_expert_ffn2_quanted_weight, shared_expert_ffn2_weight_scale = weight_quantize(
+                        shared_expert_ffn2_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.shared_expert_ffn2_weights[idx].set_value(shared_expert_ffn2_quanted_weight)
+                    self.transformer_block.shared_expert_ffn2_weights_scale[idx].set_value(
+                        shared_expert_ffn2_weight_scale
+                    )
+                else:
+                    self.transformer_block.shared_expert_ffn1_weights[idx].set_value(shared_expert_ffn1_weight)
+                    self.transformer_block.shared_expert_ffn2_weights[idx].set_value(shared_expert_ffn2_weight)
+
+    def set_transformer_block(self, transformer_config):
+        if self.use_weight_only:
+            self.transformer_block = FusedBlockMultiTransformerWeightOnly(transformer_config)
+        else:
+            self.transformer_block = FusedBlockMultiTransformer(transformer_config)
+
+    def remove_padding(self, input_ids, seq_lens_this_time, draft_tokens=None, seq_lens_encoder=None):
+        cum_offsets_now = paddle.cumsum(self.max_seq_len - seq_lens_this_time)
+        token_num = paddle.sum(seq_lens_this_time)
+        from paddlenlp_ops import get_padding_offset_v2
+
+        ids_remove_padding, cum_offsets, padding_offset, cu_seqlens_q, cu_seqlens_k = get_padding_offset_v2(
+            input_ids, cum_offsets_now, token_num, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+        return ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        caches=None,
+        pre_caches=None,
+        **kwargs,
+    ):
+
+        seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+        draft_tokens = kwargs.get("draft_tokens", None)
+        seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+
+        ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k = self.remove_padding(
+            input_ids, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+
+        kwargs["cu_seqlens_q"] = cu_seqlens_q
+        kwargs["cu_seqlens_k"] = cu_seqlens_k
+        kwargs["padding_offsets"] = padding_offset
+        kwargs["max_input_length"] = self.max_seq_len
+
+        inputs_embeds = self.embed_tokens(ids_remove_padding)
+
+        with dy2st_nocheck_guard_context():
+            hidden_states, _ = self.transformer_block(
+                input_ids=input_ids,
+                src=inputs_embeds,
+                cum_offsets=cum_offsets,
+                attn_mask=attention_mask,
+                caches=caches,
+                pre_caches=pre_caches,
+                rotary_embs=None,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None,
+            cum_offsets=cum_offsets,
+        )
+
+
+@register_base_model
+class MTPDeepseekV2BlockInferenceModel(DeepseekV2BlockInferenceModel):
+    def __init__(self, config: DeepseekV2Config, base_model_prefix: str):
+        super().__init__(config, base_model_prefix)
+        from paddle.distributed.fleet.layers.mpu.mp_layers import ColumnParallelLinear
+
+        self.enorm = DeepseekV2RMSNorm(config)
+        self.hnorm = DeepseekV2RMSNorm(config)
+        self.norm = DeepseekV2RMSNorm(config)
+
+        if config.tensor_parallel_degree > 1:
+            self.eh_proj = ColumnParallelLinear(
+                self.hidden_size * 2, self.hidden_size, has_bias=True, gather_output=True, fuse_matmul_bias=True
+            )
+        else:
+            self.eh_proj = nn.Linear(self.hidden_size * 2, self.hidden_size, bias_attr=True)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        caches=None,
+        pre_caches=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        **kwargs,
+    ):
+        seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+        rope_emb = kwargs.get("rope_emb", None)
+        draft_tokens = kwargs.get("draft_tokens", None)
+        seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+        pre_hidden_states = kwargs.get("pre_hidden_states", None)
+        ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k = self.remove_padding(
+            input_ids, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+
+        kwargs["cu_seqlens_q"] = cu_seqlens_q
+        kwargs["cu_seqlens_k"] = cu_seqlens_k
+        kwargs["padding_offsets"] = padding_offset
+        kwargs["max_input_length"] = self.max_seq_len
+
+        inputs_embeds = self.embed_tokens(ids_remove_padding)
+        inputs_embeds = paddle.concat([self.enorm(inputs_embeds), self.hnorm(pre_hidden_states)], axis=-1)
+        inputs_embeds = self.eh_proj(inputs_embeds)
+
+        with dy2st_nocheck_guard_context():
+            hidden_states, _ = self.transformer_block(
+                input_ids=input_ids,
+                src=inputs_embeds,
+                cum_offsets=cum_offsets,
+                attn_mask=attention_mask,
+                caches=caches,
+                pre_caches=pre_caches,
+                rotary_embs=rope_emb,
+                post_rebuild_padding=True,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None,
+        )
+
+
+class DeepseekV2ForCausalLMBlockInferenceModel(GenerationBlockInferenceModel, DeepseekV2PretrainedModel):
+    """
+    Dynamic Batching for DeepseekV2 Model with pretraining tasks on top.
+    """
+
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config: DeepseekV2Config, base_model_prefix: str = "deepseek_v2"):
+        super().__init__(config)
+        self.base_model_prefix = base_model_prefix
+
+        self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
+        self.verify_window = config.get("speculate_verify_window", 2)
+        self.max_seq_len = config.max_seq_len
+        self.return_full_hidden_states = config.get("return_full_hidden_states", False)
+
+        self.deepseek_v2 = DeepseekV2BlockInferenceModel(config, base_model_prefix)
+        if config.tie_word_embeddings:
+            self.lm_head = DeepseekV2LMHead(
+                config, embedding_weights=self.deepseek_v2.embed_tokens.weight, transpose_y=True
+            )
+            self.tie_weights()
+        else:
+            self.lm_head = DeepseekV2LMHead(config)
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config: DeepseekV2Config, is_split=True):
+
+        logger.info("DeepseekV2 inference model _get_tensor_parallel_mappings")
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+
+            base_actions = {
+                "lm_head.weight": partial(fn, is_column=True),
+                "eh_proj.weight": partial(fn, is_column=True),
+                # Row Linear
+                "embed_tokens.weight": partial(fn, is_column=False),
+                "layers.0.self_attn.o_proj.weight": partial(fn, is_column=False),
+            }
+
+            # Column Linear
+            base_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.self_attn.q_b_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.self_attn.kv_b_proj.weight"] = partial(fn, is_column=True)
+
+            base_actions["layers.0.mlp.gate_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.up_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.down_proj.weight"] = partial(fn, is_column=False)
+
+            for expert_idx in range(config.n_routed_experts):
+                base_actions[f"layers.0.mlp.experts.{expert_idx}.up_proj.weight"] = partial(fn, is_column=True)
+                base_actions[f"layers.0.mlp.experts.{expert_idx}.gate_proj.weight"] = partial(fn, is_column=True)
+                base_actions[f"layers.0.mlp.experts.{expert_idx}.down_proj.weight"] = partial(fn, is_column=False)
+            base_actions["layers.0.mlp.shared_experts.up_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.gate_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.down_proj.weight"] = partial(fn, is_column=False)
+
+            # MTP parts
+            base_actions["layers.61.embed_tokens.weight"] = partial(fn, is_column=False)
+            base_actions["layers.61.eh_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.61.shared_head.head.weight"] = partial(fn, is_column=True)
+
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        return infererence_model_from_pretrained(cls, pretrained_model_name_or_path, args, kwargs)
+
+    @classmethod
+    def get_cache_kvs_shape(
+        cls, config: DeepseekV2Config, max_batch_size: int = None, max_length: int = None
+    ) -> list[list[int]]:
+        """get cache_kvs tensor for DeepseekV2 model
+
+        Args:
+            max_batch_size (int): the max batch size
+            max_length (int | None, optional): the max_length of cache_kvs. Defaults to None.
+
+        Returns:
+            list[paddle.Tensor]: the list tensor shape for cache
+        """
+        max_block_per_seq = (config.max_seq_len + config.block_size - 1) // config.block_size
+        if max_batch_size == -1:
+            max_block_nums = None
+        else:
+            max_block_nums = max_batch_size * max_block_per_seq
+
+        cache_kvs = []
+        for _ in range(config.num_hidden_layers):
+            cache_k_shape = [
+                max_block_nums,
+                config.num_key_value_heads // max(config.tensor_parallel_degree, 1),
+                config.block_size,
+                config.qk_nope_head_dim + config.qk_rope_head_dim,
+            ]
+            cache_v_shape = [
+                max_block_nums,
+                config.num_key_value_heads // max(config.tensor_parallel_degree, 1),
+                config.block_size,
+                config.v_head_dim,
+            ]
+            cache_kvs.append(cache_k_shape)
+            cache_kvs.append(cache_v_shape)
+        return cache_kvs
+
+    def prepare_inputs_for_generation(self, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        input_ids = kwargs["input_ids"]
+        src_mask = kwargs.get("src_mask", None)
+        block_tables = kwargs.get("block_tables", None)
+
+        pre_caches = kwargs.get("pre_caches", None)
+        caches = kwargs.get("caches", None)
+
+        seq_lens_this_time = kwargs["seq_lens_this_time"]
+        seq_lens_encoder = kwargs["seq_lens_encoder"]
+        seq_lens_decoder = kwargs["seq_lens_decoder"]
+        k_quant_scales = kwargs.get("k_quant_scales", None)
+        v_quant_scales = kwargs.get("v_quant_scales", None)
+        k_dequant_scales = kwargs.get("k_dequant_scales", None)
+        v_dequant_scales = kwargs.get("v_dequant_scales", None)
+
+        # speculative decoding related parameters
+        draft_tokens = kwargs.get("draft_tokens", None)
+        output_padding_offset = kwargs.get("output_padding_offset", None)
+
+        model_inputs = {
+            "input_ids": input_ids,
+            "src_mask": src_mask,
+            "rope_emb": None,
+            "pre_caches": pre_caches,
+            "caches": caches,
+            "seq_lens_this_time": seq_lens_this_time,
+            "seq_lens_encoder": seq_lens_encoder,
+            "seq_lens_decoder": seq_lens_decoder,
+            "block_tables": block_tables,
+            "k_quant_scales": k_quant_scales,
+            "v_quant_scales": v_quant_scales,
+            "k_dequant_scales": k_dequant_scales,
+            "v_dequant_scales": v_dequant_scales,
+            "draft_tokens": draft_tokens,
+            "output_padding_offset": output_padding_offset,
+        }
+        return model_inputs
+
+    def forward(
+        self,
+        input_ids,
+        src_mask=None,
+        pre_caches=None,
+        caches=None,
+        seq_lens_this_time=None,
+        seq_lens_encoder=None,
+        seq_lens_decoder=None,
+        rope_emb=None,
+        block_tables=None,
+        k_quant_scales=None,
+        v_quant_scales=None,
+        k_dequant_scales=None,
+        v_dequant_scales=None,
+        draft_tokens=None,
+        output_padding_offset=None,
+    ):
+        outputs = self.deepseek_v2(
+            input_ids,
+            src_mask=src_mask,
+            caches=caches,
+            rope_emb=None,
+            block_tables=block_tables,
+            pre_caches=pre_caches,
+            seq_lens_this_time=seq_lens_this_time,
+            seq_lens_encoder=seq_lens_encoder,
+            seq_lens_decoder=seq_lens_decoder,
+            k_quant_scales=k_quant_scales,
+            v_quant_scales=v_quant_scales,
+            k_dequant_scales=k_dequant_scales,
+            v_dequant_scales=v_dequant_scales,
+            draft_tokens=draft_tokens,
+            output_padding_offset=output_padding_offset,
+        )
+        if self.return_full_hidden_states:
+            from paddlenlp_ops import rebuild_padding_v2
+
+            full_hidden_states = outputs[0]
+            cum_offsets = outputs[1]
+            hidden_states = rebuild_padding_v2(
+                full_hidden_states,
+                cum_offsets,
+                seq_lens_decoder,
+                seq_lens_encoder,
+                output_padding_offset,
+                self.max_seq_len,
+            )
+        else:
+            hidden_states = outputs[0]
+        logits = self.lm_head(
+            hidden_states,
+            tensor_parallel_output=False,
+        )
+        if self.return_full_hidden_states:
+            return logits, full_hidden_states
+        else:
+            return logits
+
+        return logits
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        if "lm_head.weight" in state_dict:
+            self.lm_head.weight.set_value(
+                paddle.to_tensor(state_dict["lm_head.weight"]).cast(self.lm_head.weight.dtype)
+            )
+        self.deepseek_v2.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
+
+
+class MTPDeepseekV2ForCausalLMBlockInferenceModel(DeepseekV2ForCausalLMBlockInferenceModel):
+    def __init__(self, config, base_model_prefix):
+        super(DeepseekV2ForCausalLMBlockInferenceModel, self).__init__(config, base_model_prefix="deepseek_v3_mtp")
+        self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
+        self.verify_window = config.get("speculate_verify_window", 2)
+        self.max_seq_len = config.max_seq_len
+
+        self.mtp = MTPDeepseekV2BlockInferenceModel(config, base_model_prefix="deepseek_v3_mtp")
+        self.tensor_parallel_rank = config.tensor_parallel_rank
+        if config.tie_word_embeddings:
+            self.lm_head = DeepseekV2LMHead(config, embedding_weights=self.llama.embed_tokens.weight, transpose_y=True)
+            self.tie_weights()
+        else:
+            self.lm_head = DeepseekV2LMHead(config)
+
+    def prepare_inputs_for_generation(self, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        input_ids = kwargs["input_ids"]
+        src_mask = kwargs.get("src_mask", None)
+        block_tables = kwargs.get("block_tables", None)
+
+        pre_caches = kwargs.get("pre_caches", None)
+        caches = kwargs.get("caches", None)
+
+        seq_lens_this_time = kwargs["seq_lens_this_time"]
+        seq_lens_encoder = kwargs["seq_lens_encoder"]
+        seq_lens_decoder = kwargs["seq_lens_decoder"]
+        k_quant_scales = kwargs.get("k_quant_scales", None)
+        v_quant_scales = kwargs.get("v_quant_scales", None)
+        k_dequant_scales = kwargs.get("k_dequant_scales", None)
+        v_dequant_scales = kwargs.get("v_dequant_scales", None)
+
+        # speculative decoding related parameters
+        draft_tokens = kwargs.get("draft_tokens", None)
+        output_padding_offset = kwargs.get("output_padding_offset", None)
+        hidden_states = kwargs.get("hidden_states", None)
+
+        model_inputs = {
+            "input_ids": input_ids,
+            "src_mask": src_mask,
+            "rope_emb": None,
+            "pre_caches": pre_caches,
+            "caches": caches,
+            "seq_lens_this_time": seq_lens_this_time,
+            "seq_lens_encoder": seq_lens_encoder,
+            "seq_lens_decoder": seq_lens_decoder,
+            "block_tables": block_tables,
+            "k_quant_scales": k_quant_scales,
+            "v_quant_scales": v_quant_scales,
+            "k_dequant_scales": k_dequant_scales,
+            "v_dequant_scales": v_dequant_scales,
+            "draft_tokens": draft_tokens,
+            "output_padding_offset": output_padding_offset,
+            "pre_hidden_states": hidden_states,
+        }
+        return model_inputs
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        if "lm_head.weight" in state_dict:
+            self.lm_head.weight.set_value(
+                paddle.to_tensor(state_dict["lm_head.weight"]).cast(self.lm_head.weight.dtype)
+            )
+
+        self.mtp.enorm.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.enorm.weight"]).cast(self.lm_head.weight.dtype)
+        )
+        self.mtp.hnorm.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.hnorm.weight"]).cast(self.lm_head.weight.dtype)
+        )
+        self.mtp.norm.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.norm.weight"]).cast(self.lm_head.weight.dtype)
+        )
+        self.mtp.eh_proj.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.eh_proj.weight"]).cast(self.lm_head.weight.dtype)
+        )
+
+        self.mtp.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
+
+    def forward(
+        self,
+        input_ids,
+        src_mask=None,
+        pre_caches=None,
+        caches=None,
+        seq_lens_this_time=None,
+        seq_lens_encoder=None,
+        seq_lens_decoder=None,
+        rope_emb=None,
+        block_tables=None,
+        k_quant_scales=None,
+        v_quant_scales=None,
+        k_dequant_scales=None,
+        v_dequant_scales=None,
+        draft_tokens=None,
+        output_padding_offset=None,
+        pre_hidden_states=None,
+    ):
+        outputs = self.mtp(
+            input_ids,
+            src_mask=src_mask,
+            caches=caches,
+            rope_emb=rope_emb,
+            block_tables=block_tables,
+            pre_caches=pre_caches,
+            seq_lens_this_time=seq_lens_this_time,
+            seq_lens_encoder=seq_lens_encoder,
+            seq_lens_decoder=seq_lens_decoder,
+            k_quant_scales=k_quant_scales,
+            v_quant_scales=v_quant_scales,
+            k_dequant_scales=k_dequant_scales,
+            v_dequant_scales=v_dequant_scales,
+            draft_tokens=draft_tokens,
+            output_padding_offset=output_padding_offset,
+            pre_hidden_states=pre_hidden_states,
+        )
+
+        hidden_states = outputs[0]
+
+        logits = self.lm_head(
+            hidden_states,
+            tensor_parallel_output=False,
+        )
+
+        return logits, hidden_states
diff --git a/paddlenlp/experimental/transformers/deepseek_v3/__init__.py b/paddlenlp/experimental/transformers/deepseek_v3/__init__.py
new file mode 100644
index 000000000000..c2a7f656c636
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v3/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/deepseek_v3/modeling.py b/paddlenlp/experimental/transformers/deepseek_v3/modeling.py
new file mode 100644
index 000000000000..5a63a7a548ff
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v3/modeling.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from paddlenlp.experimental.transformers.deepseek_v2.modeling import (
+    DeepseekV2ForCausalLMBlockInferenceModel,
+    MTPDeepseekV2ForCausalLMBlockInferenceModel,
+)
+from paddlenlp.transformers import DeepseekV3Config
+
+__all__ = ["DeepseekV3ForCausalLMBlockInferenceModel"]
+
+
+class DeepseekV3ForCausalLMBlockInferenceModel(DeepseekV2ForCausalLMBlockInferenceModel):
+    def __init__(self, config: DeepseekV3Config, base_model_prefix: str = "deepseek_v3"):
+        super().__init__(config, base_model_prefix)
+
+
+class MTPDeepseekV3ForCausalLMBlockInferenceModel(MTPDeepseekV2ForCausalLMBlockInferenceModel):
+    def __init__(self, config: DeepseekV3Config, base_model_prefix: str = "deepseek_v3_mtp"):
+        super().__init__(config, base_model_prefix)
diff --git a/paddlenlp/experimental/transformers/fused_transformer_layers.py b/paddlenlp/experimental/transformers/fused_transformer_layers.py
index 887ab006311a..a9de2fea3469 100644
--- a/paddlenlp/experimental/transformers/fused_transformer_layers.py
+++ b/paddlenlp/experimental/transformers/fused_transformer_layers.py
@@ -76,6 +76,7 @@ def use_cutlass_fp8_gemm():
 
 __all__ = [
     "MoeConfig",
+    "MLAConfig",
     "FusedMultiTransformerConfig",
     "FusedMultiTransformerBase",
     "FusedMultiTransformerPostLayernorm",
@@ -107,8 +108,16 @@ def _set_var_distributed(var):
 class MoeConfig:
     num_experts: int = 0
     top_k: int = 0
+    topk_method: Optional[str] = None
+    num_expert_group: int = 1
+    topk_group: Optional[int] = None
     norm_topk_prob: bool = True
     moe_every2: bool = False
+    first_k_dense_replace: int = 0
+    moe_intermediate_size: int = 0
+    routed_scaling_factor: float = 1.0
+
+    shared_expert_with_gate: bool = True
 
     shared_expert_intermediate_size: int = 0
     shared_expert_ffn1_weight_attrs: Optional[List[paddle.ParamAttr]] = None
@@ -121,7 +130,11 @@ def has_moe(self) -> bool:
         return self.num_experts > 1
 
     def use_moe(self, i: int) -> bool:
-        return self.has_moe() and (self.moe_every2 is False or (self.moe_every2 and i % 2 == 1))
+        return (
+            self.has_moe()
+            and (self.moe_every2 is False or (self.moe_every2 and i % 2 == 1))
+            and i >= self.first_k_dense_replace
+        )
 
     def has_shared_expert(self) -> bool:
         return self.has_moe() and self.shared_expert_intermediate_size > 0
@@ -144,18 +157,51 @@ class SpeculateConfig:
     return_full_hidden_states: bool = False
 
 
+@dataclass
+class MLAConfig:
+    q_lora_rank: int = None
+    kv_lora_rank: int = None
+    qk_nope_head_dim: int = None
+    qk_rope_head_dim: int = None
+    v_head_dim: int = None
+
+    mscale: float = 1.0
+
+    q_proj_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+
+    q_a_proj_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_a_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_a_layernorm_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_b_proj_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_b_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+    kv_a_proj_with_mqa_weight_attrs: List[paddle.ParamAttr] = None
+    kv_a_proj_with_mqa_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+    kv_a_layernorm_weight_attrs: List[paddle.ParamAttr] = None
+    kv_b_proj_weight_attrs: List[paddle.ParamAttr] = None
+    kv_b_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+
+    def use_mla(self) -> bool:
+        return self.kv_lora_rank is not None
+
+    @property
+    def qk_head_dim(self) -> int:
+        return self.qk_nope_head_dim + self.qk_rope_head_dim
+
+
 class FusedMultiTransformerConfig:
     def __init__(
         self,
         embed_dim,
         num_heads,
-        dim_feedforward,
+        intermediate_size,
         quant_type="",
         dropout_rate=0.0,
         activation="gelu",
         norm_type="layernorm",
         use_neox_rotary_style=False,
         rope_theta=10000.0,
+        rotary_emb=None,
         normalize_before=True,
         ln_scale_attrs=None,
         ln_bias_attrs=None,
@@ -181,6 +227,7 @@ def __init__(
         ffn2_weight_attrs=None,
         ffn2_weight_scale_attrs=None,
         ffn2_bias_attrs=None,
+        e_score_correction_bias_attrs=None,
         qkv_out_scale_attrs=None,
         linear_out_scale_attrs=None,
         ffn1_out_scale_attrs=None,
@@ -209,6 +256,7 @@ def __init__(
         moe_config=MoeConfig(),
         avx_config=AvxConfig(),
         speculate_config=SpeculateConfig(),
+        mla_config=MLAConfig(),
     ):
         self.embed_dim = embed_dim
         self.num_heads = num_heads
@@ -216,11 +264,14 @@ def __init__(
             self.kv_num_heads = kv_num_heads
         else:
             self.kv_num_heads = num_heads
-        self.dim_feedforward = dim_feedforward
+        self.intermediate_size = intermediate_size
         self.dropout_rate = dropout_rate
         self.activation = activation
         self.norm_type = norm_type
         self.rope_theta = rope_theta
+
+        self.rotary_emb = rotary_emb
+
         self.use_neox_rotary_style = use_neox_rotary_style
         self.normalize_before = normalize_before
         self.ln_scale_attrs = ln_scale_attrs
@@ -251,6 +302,8 @@ def __init__(
         self.ffn2_weight_scale_attrs = ffn2_weight_scale_attrs
         self.ffn2_bias_attrs = ffn2_bias_attrs
 
+        self.e_score_correction_bias_attrs = e_score_correction_bias_attrs
+
         self.qkv_out_scale_attrs = qkv_out_scale_attrs
         self.linear_out_scale_attrs = linear_out_scale_attrs
         self.ffn1_out_scale_attrs = ffn1_out_scale_attrs
@@ -287,6 +340,7 @@ def __init__(
         self.moe_config = moe_config
         self.avx_config = avx_config
         self.speculate_config = speculate_config
+        self.mla_config = mla_config
 
 
 class FusedMultiTransformerBase(Layer):
@@ -299,8 +353,8 @@ def __init__(self, config: FusedMultiTransformerConfig):
             config.embed_dim
         )
         assert config.num_heads > 0, "Expected nhead to be greater than 0, " "but received {}".format(config.num_heads)
-        assert config.dim_feedforward > 0, "Expected dim_feedforward to be greater than 0, but received {}".format(
-            config.dim_feedforward
+        assert config.intermediate_size > 0, "Expected intermediate_size to be greater than 0, but received {}".format(
+            config.intermediate_size
         )
 
         # self.normalize_before = normalize_before
@@ -333,27 +387,38 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.activation = config.activation
 
         self.embed_dim = config.embed_dim
-        self.head_dim = config.embed_dim // config.num_heads
-        assert self.head_dim * config.num_heads == config.embed_dim, "embed_dim must be divisible by num_heads"
+        if config.mla_config.use_mla():
+            self.head_dim = config.mla_config.v_head_dim
+        else:
+            self.head_dim = config.embed_dim // config.num_heads
+            assert self.head_dim * config.num_heads == config.embed_dim, "embed_dim must be divisible by num_heads"
 
         # tensor model parallel
         if config.nranks > 1:
             assert config.ring_id != -1
         assert config.num_heads % config.nranks == 0
-        assert config.dim_feedforward % config.nranks == 0
+        assert config.intermediate_size % config.nranks == 0
         assert config.moe_config.shared_expert_intermediate_size % config.nranks == 0
+        assert config.moe_config.moe_intermediate_size % config.nranks == 0
         self.num_heads = config.num_heads // config.nranks
         self.kv_num_heads = config.kv_num_heads // config.nranks
-        dim_feedforward = config.dim_feedforward // config.nranks
-        self.dim_feedforward = dim_feedforward
-        shared_expert_intermediate_size = config.moe_config.shared_expert_intermediate_size // config.nranks
-        self.config.moe_config.shared_expert_intermediate_size = shared_expert_intermediate_size
+        self.intermediate_size = config.intermediate_size // config.nranks
+        self.config.moe_config.shared_expert_intermediate_size //= config.nranks
+        self.config.moe_config.moe_intermediate_size //= config.nranks
 
         self.num_layers = config.num_layers
         assert self.num_layers > 0
-        if isinstance(config.qkv_weight_attrs, (list, tuple)):
+        if config.qkv_weight_attrs is not None and isinstance(config.qkv_weight_attrs, (list, tuple)):
             assert self.num_layers == len(config.qkv_weight_attrs)
 
+        if self.config.mla_config.use_mla():
+            mscale = self.config.mla_config.mscale
+            self.softmax_scale = float(self.config.mla_config.qk_head_dim**-0.5) * mscale * mscale
+        else:
+            self.softmax_scale = float(self.head_dim**-0.5)
+
+        self.position_ids: list[int] = []
+
         self.weight_dtype = self._dtype
         self.create_params_type = self.get_weight_create_dype()
 
@@ -363,10 +428,12 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.ffn_ln_scales, self.ffn_ln_biases = [], []
         self.ffn1_biases = []
         self.ffn2_biases = []
-        if self.config.moe_config.has_shared_expert():
-            self.shared_expert_gate_weights = []
-            self.shared_expert_ffn1_weights = []
-            self.shared_expert_ffn2_weights = []
+        self.e_score_correction_biases = []
+
+        self.shared_expert_gate_weights = []
+        self.shared_expert_ffn1_weights = []
+        self.shared_expert_ffn2_weights = []
+
         self.cache_k_scales, self.cache_v_scales = [], []
         self.cache_k_out_scales, self.cache_v_out_scales = [], []
 
@@ -383,11 +450,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn_ln_bias_attr = self.get_attr(config.ffn_ln_bias_attrs, i)
             ffn1_bias_attr = self.get_attr(config.ffn1_bias_attrs, i)
             ffn2_bias_attr = self.get_attr(config.ffn2_bias_attrs, i)
-
-            if self.config.moe_config.use_shared_expert(i):
-                shared_expert_gate_weight_attr = self.get_attr(config.moe_config.shared_expert_gate_weight_attrs, i)
-                shared_expert_ffn1_weight_attr = self.get_attr(config.moe_config.shared_expert_ffn1_weight_attrs, i)
-                shared_expert_ffn2_weight_attr = self.get_attr(config.moe_config.shared_expert_ffn2_weight_attrs, i)
+            e_score_correction_bias_attr = self.get_attr(config.e_score_correction_bias_attrs, i)
 
             cache_k_scale_attr = self.get_attr(config.cache_k_scale_attrs, i)
             cache_v_scale_attr = self.get_attr(config.cache_v_scale_attrs, i)
@@ -448,21 +511,34 @@ def __init__(self, config: FusedMultiTransformerConfig):
             if ffn1_bias_attr:
                 if self.config.moe_config.use_moe(i):
                     ffn1_bias = self.create_parameter(
-                        shape=[self.config.moe_config.num_experts, self.dim_feedforward * 2]
+                        shape=[self.config.moe_config.num_experts, self.intermediate_size * 2]
                         if self.activation.endswith("glu")
-                        else [self.config.moe_config.num_experts, self.dim_feedforward],
+                        else [self.config.moe_config.num_experts, self.intermediate_size],
                         attr=ffn1_bias_attr,
                         dtype=self._dtype,
                         is_bias=True,
                     )
                 else:
                     ffn1_bias = self.create_parameter(
-                        shape=[dim_feedforward * 2] if self.activation.endswith("glu") else [dim_feedforward],
+                        shape=[self.intermediate_size * 2]
+                        if self.activation.endswith("glu")
+                        else [self.intermediate_size],
                         attr=ffn1_bias_attr,
                         dtype=self._dtype,
                         is_bias=True,
                     )
 
+            e_score_correction_bias = None
+            if e_score_correction_bias_attr:
+                if self.config.moe_config.use_moe(i):
+                    if self.config.moe_config.topk_method == "noaux_tc":
+                        e_score_correction_bias = self.create_parameter(
+                            shape=[self.config.moe_config.num_experts],
+                            attr=e_score_correction_bias_attr,
+                            dtype="float32",
+                            is_bias=True,
+                        )
+
             ffn2_bias = None
             if ffn2_bias_attr:
                 if self.config.moe_config.use_moe(i):
@@ -480,26 +556,9 @@ def __init__(self, config: FusedMultiTransformerConfig):
                         is_bias=True,
                     )
 
-            if self.config.moe_config.use_shared_expert(i):
-                shared_expert_ffn1_weight = self.create_parameter(
-                    shape=self.shared_expert_ffn1_weight_shape,
-                    attr=shared_expert_ffn1_weight_attr,
-                    dtype=self.create_params_type,
-                )
-                shared_expert_ffn2_weight = self.create_parameter(
-                    shape=self.shared_expert_ffn2_weight_shape,
-                    attr=shared_expert_ffn2_weight_attr,
-                    dtype=self.create_params_type,
-                )
-                shared_expert_gate_weight = self.create_parameter(
-                    shape=self.shared_expert_gate_weight_shape,
-                    attr=shared_expert_gate_weight_attr,
-                    dtype=self._helper.get_default_dtype(),
-                )
-
             cache_scale_dtype = "float32"
             if self.config.append_attn:
-                cache_scale_dtype = paddle.get_default_dtype()
+                cache_scale_dtype = self._dtype
 
             cache_k_scale = None
             if cache_k_scale_attr:
@@ -542,9 +601,6 @@ def __init__(self, config: FusedMultiTransformerConfig):
                 # column parallel
                 _set_var_distributed(qkv_bias)
                 _set_var_distributed(ffn1_bias)
-                if self.config.moe_config.use_shared_expert(i):
-                    _set_var_distributed(shared_expert_ffn1_weight)
-                    _set_var_distributed(shared_expert_ffn2_weight)
 
             self.ln_scales.append(ln_scale)
             self.ln_biases.append(ln_bias)
@@ -555,11 +611,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             self.ffn_ln_biases.append(ffn_ln_bias)
             self.ffn1_biases.append(ffn1_bias)
             self.ffn2_biases.append(ffn2_bias)
-
-            if self.config.moe_config.use_shared_expert(i):
-                self.shared_expert_ffn1_weights.append(shared_expert_ffn1_weight)
-                self.shared_expert_ffn2_weights.append(shared_expert_ffn2_weight)
-                self.shared_expert_gate_weights.append(shared_expert_gate_weight)
+            self.e_score_correction_biases.append(e_score_correction_bias)
 
             self.cache_k_scales.append(cache_k_scale)
             self.cache_v_scales.append(cache_v_scale)
@@ -575,11 +627,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             self._add_parameter(ffn_ln_bias)
             self._add_parameter(ffn1_bias)
             self._add_parameter(ffn2_bias)
-
-            if self.config.moe_config.use_shared_expert(i):
-                self._add_parameter(shared_expert_ffn1_weight)
-                self._add_parameter(shared_expert_ffn2_weight)
-                self._add_parameter(shared_expert_gate_weight)
+            self._add_parameter(e_score_correction_bias)
 
             self._add_parameter(cache_k_scale)
             self._add_parameter(cache_v_scale)
@@ -588,10 +636,6 @@ def __init__(self, config: FusedMultiTransformerConfig):
 
         self.dropout_rate = config.dropout_rate
 
-        from paddle.incubate.nn.functional import fused_linear
-
-        self.linear = fused_linear
-
     def init_weight(self):
         self.qkv_weights = []
         self.linear_weights = []
@@ -599,19 +643,86 @@ def init_weight(self):
         self.ffn1_weights = []
         self.ffn2_weights = []
 
+        self.q_proj_weights = []
+        self.q_a_proj_weights = []
+        self.q_a_layernorm_weights = []
+        self.q_b_proj_weights = []
+        self.kv_a_proj_with_mqa_weights = []
+        self.kv_a_layernorm_weights = []
+        self.kv_b_proj_weights = []
+
         for i in range(self.num_layers):
-            qkv_weight_attr = self.get_attr(self.config.qkv_weight_attrs, i)
             linear_weight_attr = self.get_attr(self.config.linear_weight_attrs, i)
             gate_weight_attr = self.get_attr(self.config.gate_weight_attrs, i)
             ffn1_weight_attr = self.get_attr(self.config.ffn1_weight_attrs, i)
             ffn2_weight_attr = self.get_attr(self.config.ffn2_weight_attrs, i)
 
-            qkv_weight = self.create_parameter(
-                shape=self.qkv_weight_shape,
-                attr=qkv_weight_attr,
-                dtype=self.create_params_type,
-                is_bias=False,
-            )
+            qkv_weight = None
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    q_proj_weight_attr = self.get_attr(self.config.mla_config.q_proj_weight_attrs, i)
+                    q_proj_weight = self.create_parameter(
+                        shape=self.q_proj_weight_shape,
+                        attr=q_proj_weight_attr,
+                        dtype=self.create_params_type,
+                        is_bias=False,
+                    )
+                else:
+                    q_a_proj_weight_attr = self.get_attr(self.config.mla_config.q_a_proj_weight_attrs, i)
+                    q_a_layernorm_weight_attr = self.get_attr(self.config.mla_config.q_a_layernorm_weight_attrs, i)
+                    q_b_proj_weight_attr = self.get_attr(self.config.mla_config.q_b_proj_weight_attrs, i)
+                    q_a_proj_weight = self.create_parameter(
+                        shape=self.q_a_proj_weight_shape,
+                        attr=q_a_proj_weight_attr,
+                        dtype=self.create_params_type,
+                        is_bias=False,
+                    )
+                    q_a_layernorm_weight = self.create_parameter(
+                        shape=[self.config.mla_config.q_lora_rank],
+                        attr=q_a_layernorm_weight_attr,
+                        dtype=self._norm_weight_dtype,
+                        is_bias=False,
+                    )
+                    q_b_proj_weight = self.create_parameter(
+                        shape=self.q_b_proj_weight_shape,
+                        attr=q_b_proj_weight_attr,
+                        dtype=self.create_params_type,
+                        is_bias=False,
+                    )
+
+                kv_a_proj_with_mqa_weight_attr = self.get_attr(
+                    self.config.mla_config.kv_a_proj_with_mqa_weight_attrs, i
+                )
+                kv_a_layernorm_weight_attr = self.get_attr(self.config.mla_config.kv_a_layernorm_weight_attrs, i)
+                kv_b_proj_weight_attr = self.get_attr(self.config.mla_config.kv_b_proj_weight_attrs, i)
+
+                kv_a_proj_with_mqa_weight = self.create_parameter(
+                    shape=self.kv_a_proj_with_mqa_weight_shape,
+                    attr=kv_a_proj_with_mqa_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
+                kv_a_layernorm_weight = self.create_parameter(
+                    shape=[self.config.mla_config.kv_lora_rank],
+                    attr=kv_a_layernorm_weight_attr,
+                    dtype=self._norm_weight_dtype,
+                    is_bias=False,
+                )
+                kv_b_proj_weight = self.create_parameter(
+                    shape=self.kv_b_proj_weight_shape,
+                    attr=kv_b_proj_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
+            else:
+                qkv_weight_attr = self.get_attr(self.config.qkv_weight_attrs, i)
+                qkv_weight = self.create_parameter(
+                    shape=self.qkv_weight_shape,
+                    attr=qkv_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
+
             linear_weight = self.create_parameter(
                 shape=self.linear_weight_shape,
                 attr=linear_weight_attr,
@@ -620,7 +731,6 @@ def init_weight(self):
             )
 
             gate_weight = None
-
             if self.config.moe_config.use_moe(i):
                 gate_weight = self.create_parameter(
                     shape=[self.config.embed_dim, self.config.moe_config.num_experts],
@@ -637,6 +747,12 @@ def init_weight(self):
                     dtype=self.create_params_type,
                     is_bias=False,
                 )
+                ffn2_weight = self.create_parameter(
+                    shape=self.moe_ffn2_weight_shape,
+                    attr=ffn2_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
             else:
                 ffn1_weight = self.create_parameter(
                     shape=self.ffn1_weight_shape,
@@ -644,20 +760,44 @@ def init_weight(self):
                     dtype=self.create_params_type,
                     is_bias=False,
                 )
-            if self.config.moe_config.use_moe(i):
                 ffn2_weight = self.create_parameter(
-                    shape=self.moe_ffn2_weight_shape,
+                    shape=self.ffn2_weight_shape,
                     attr=ffn2_weight_attr,
                     dtype=self.create_params_type,
                     is_bias=False,
                 )
-            else:
-                ffn2_weight = self.create_parameter(
-                    shape=self.ffn2_weight_shape,
-                    attr=ffn2_weight_attr,
+
+            shared_expert_ffn1_weight = None
+            shared_expert_ffn2_weight = None
+            shared_expert_gate_weight = None
+            if self.config.moe_config.use_shared_expert(i):
+                if self.config.moe_config.shared_expert_with_gate:
+                    shared_expert_gate_weight_attr = self.get_attr(
+                        self.config.moe_config.shared_expert_gate_weight_attrs, i
+                    )
+                shared_expert_ffn1_weight_attr = self.get_attr(
+                    self.config.moe_config.shared_expert_ffn1_weight_attrs, i
+                )
+                shared_expert_ffn2_weight_attr = self.get_attr(
+                    self.config.moe_config.shared_expert_ffn2_weight_attrs, i
+                )
+
+                shared_expert_ffn1_weight = self.create_parameter(
+                    shape=self.shared_expert_ffn1_weight_shape,
+                    attr=shared_expert_ffn1_weight_attr,
                     dtype=self.create_params_type,
-                    is_bias=False,
                 )
+                shared_expert_ffn2_weight = self.create_parameter(
+                    shape=self.shared_expert_ffn2_weight_shape,
+                    attr=shared_expert_ffn2_weight_attr,
+                    dtype=self.create_params_type,
+                )
+                if self.config.moe_config.shared_expert_with_gate:
+                    shared_expert_gate_weight = self.create_parameter(
+                        shape=self.shared_expert_gate_weight_shape,
+                        attr=shared_expert_gate_weight_attr,
+                        dtype=self._helper.get_default_dtype(),
+                    )
 
             # tensor model parallel
             if self.config.nranks > 1:
@@ -668,16 +808,54 @@ def init_weight(self):
                 _set_var_distributed(linear_weight)
                 _set_var_distributed(ffn2_weight)
 
-            self.qkv_weights.append(qkv_weight)
+                if self.config.moe_config.use_shared_expert(i):
+                    _set_var_distributed(shared_expert_ffn1_weight)
+                    _set_var_distributed(shared_expert_ffn2_weight)
+
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self.q_proj_weights.append(q_proj_weight)
+                else:
+                    self.q_a_proj_weights.append(q_a_proj_weight)
+                    self.q_a_layernorm_weights.append(q_a_layernorm_weight)
+                    self.q_b_proj_weights.append(q_b_proj_weight)
+                self.kv_a_proj_with_mqa_weights.append(kv_a_proj_with_mqa_weight)
+                self.kv_a_layernorm_weights.append(kv_a_layernorm_weight)
+                self.kv_b_proj_weights.append(kv_b_proj_weight)
+            else:
+                self.qkv_weights.append(qkv_weight)
+
             self.linear_weights.append(linear_weight)
 
-            if gate_weight is not None:
-                self.gate_weights.append(gate_weight)
+            self.gate_weights.append(gate_weight)
             self.ffn1_weights.append(ffn1_weight)
             self.ffn2_weights.append(ffn2_weight)
 
-            self._add_parameter(qkv_weight)
+            self.shared_expert_ffn1_weights.append(shared_expert_ffn1_weight)
+            self.shared_expert_ffn2_weights.append(shared_expert_ffn2_weight)
+            self.shared_expert_gate_weights.append(shared_expert_gate_weight)
+
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self._add_parameter(q_proj_weight)
+                else:
+                    self._add_parameter(q_a_proj_weight)
+                    self._add_parameter(q_a_layernorm_weight)
+                    self._add_parameter(q_b_proj_weight)
+                self._add_parameter(kv_a_proj_with_mqa_weight)
+                self._add_parameter(kv_a_layernorm_weight)
+                self._add_parameter(kv_b_proj_weight)
+            else:
+                self._add_parameter(qkv_weight)
+
+            if self.config.moe_config.use_shared_expert(i):
+                self._add_parameter(shared_expert_ffn1_weight)
+                self._add_parameter(shared_expert_ffn2_weight)
+                if self.config.moe_config.shared_expert_with_gate:
+                    self._add_parameter(shared_expert_gate_weight)
+
             self._add_parameter(linear_weight)
+
             if gate_weight is not None:
                 self._add_parameter(gate_weight)
             self._add_parameter(ffn1_weight)
@@ -698,27 +876,55 @@ def _add_parameter(self, param):
         self._parameters[param.name] = param
 
     def init_weight_shape(self, config):
-        self.qkv_weight_shape = (
-            [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim]
-            if config.trans_qkvw
-            else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
-        )
+
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is None:
+                self.q_proj_weight_shape = [
+                    self.config.embed_dim,
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                ]
+            else:
+                self.q_a_proj_weight_shape = [self.config.embed_dim, self.config.mla_config.q_lora_rank]
+                self.q_b_proj_weight_shape = [
+                    self.config.mla_config.q_lora_rank,
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                ]
+
+            self.kv_a_proj_with_mqa_weight_shape = [
+                self.config.embed_dim,
+                self.config.mla_config.kv_lora_rank + self.config.mla_config.qk_rope_head_dim,
+            ]
+            self.kv_b_proj_weight_shape = [
+                self.config.mla_config.kv_lora_rank,
+                self.num_heads * (self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim),
+            ]
+        else:
+            self.qkv_weight_shape = (
+                [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim]
+                if config.trans_qkvw
+                else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
+            )
+
         self.linear_weight_shape = [self.num_heads * self.head_dim, self.embed_dim]
 
         self.ffn1_weight_shape = (
-            [self.embed_dim, self.dim_feedforward * 2]
+            [self.embed_dim, self.intermediate_size * 2]
             if self.activation.endswith("glu")
-            else [self.embed_dim, self.dim_feedforward]
+            else [self.embed_dim, self.intermediate_size]
         )
-        self.ffn2_weight_shape = [self.dim_feedforward, self.embed_dim]
+        self.ffn2_weight_shape = [self.intermediate_size, self.embed_dim]
 
-        if self.config.moe_config.has_moe() is True:
+        if self.config.moe_config.has_moe():
             self.moe_ffn1_weight_shape = (
-                [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward * 2]
+                [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size * 2]
                 if self.activation.endswith("glu")
-                else [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward]
+                else [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size]
             )
-            self.moe_ffn2_weight_shape = [self.config.moe_config.num_experts, self.dim_feedforward, self.embed_dim]
+            self.moe_ffn2_weight_shape = [
+                self.config.moe_config.num_experts,
+                self.config.moe_config.moe_intermediate_size,
+                self.embed_dim,
+            ]
 
         if self.config.moe_config.has_shared_expert():
             self.shared_expert_ffn1_weight_shape = [
@@ -729,10 +935,11 @@ def init_weight_shape(self, config):
                 self.config.moe_config.shared_expert_intermediate_size,
                 self.embed_dim,
             ]
-            self.shared_expert_gate_weight_shape = [
-                self.embed_dim,
-                1,
-            ]
+            if self.config.moe_config.shared_expert_with_gate:
+                self.shared_expert_gate_weight_shape = [
+                    self.embed_dim,
+                    1,
+                ]
 
     def skip_quant(self, layer_name, layer_idx):
         return False
@@ -749,14 +956,65 @@ def compute_layernorm_before_qkv(self, src, i):
         return ln_out
 
     def compute_qkv_linear(self, ln_out, i):
-        if paddle.version.cuda() == "False" or float(paddle.version.cuda()) < 11.6:
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is not None:
+                query = paddle.matmul(ln_out, self.q_a_proj_weights[i])
+                query = self.norm_func(
+                    x=query,
+                    norm_weight=self.q_a_layernorm_weights[i],
+                    norm_bias=None,
+                    epsilon=self._epsilon,
+                    begin_norm_axis=1,
+                )[0]
+                query = paddle.matmul(query, self.q_b_proj_weights[i])
+            else:
+                query = paddle.matmul(ln_out, self.q_proj_weights[i])
+
+            query = query.reshape([-1, self.num_heads, self.config.mla_config.qk_head_dim])
+            query_nope, query_pe = query.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+
+            compressed_kv = paddle.matmul(ln_out, self.kv_a_proj_with_mqa_weights[i])
+            compressed_kv, key_pe = compressed_kv.split(
+                [self.config.mla_config.kv_lora_rank, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+            key_pe = key_pe.reshape([-1, 1, self.config.mla_config.qk_rope_head_dim])
+            compressed_kv = self.norm_func(
+                x=compressed_kv,
+                norm_weight=self.kv_a_layernorm_weights[i],
+                norm_bias=None,
+                epsilon=self._epsilon,
+                begin_norm_axis=1,
+            )[0]
+            key_value = paddle.matmul(compressed_kv, self.kv_b_proj_weights[i])
+            key_value = key_value.reshape(
+                [-1, self.num_heads, self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim]
+            )
+            key_nope, value = key_value.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.v_head_dim], axis=-1
+            )
+            query_pe, key_pe = self.config.rotary_emb(self.position_ids, query_pe, key_pe)
+
+            query[..., self.config.mla_config.qk_nope_head_dim :] = query_pe
+            key = paddle.empty_like(query)
+            key[..., : self.config.mla_config.qk_nope_head_dim] = key_nope
+            key[..., self.config.mla_config.qk_nope_head_dim :] = key_pe
+
+            qkv_out = paddle.concat(
+                [
+                    query.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    key.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    value.reshape([-1, self.num_heads * self.config.mla_config.v_head_dim]),
+                ],
+                axis=-1,
+            )
+        else:
             qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
             if self.qkv_biases[i] is not None:
                 qkv_out = paddle.add(qkv_out, self.qkv_biases[i])
-            return qkv_out
-        else:
-            # This method requires CUDA version >= 11.6.
-            return self.linear(ln_out, self.qkv_weights[i], self.qkv_biases[i], transpose_weight=True)
+
+        return qkv_out
 
     def compute_qkv(self, src, residual_input, i):
         ln_out = self.compute_layernorm_before_qkv(src, i)
@@ -820,7 +1078,7 @@ def compute_fmha(
             seq_lens,
             seq_lens + pre_caches_length,
             mask=attn_mask,
-            scale=float(self.head_dim**-0.5),
+            scale=self.softmax_scale,
         )
 
         return transpose_remove_padding(qktv_out, seq_lens, padding_offset)
@@ -893,19 +1151,113 @@ def compute_ffn_layernorm(self, out_linear_out, residual_input, i):
         return tmp_out, residual_input
 
     def compute_fused_moe(self, tmp_out, i):
-        fused_moe_out = fused_moe(
-            tmp_out,
-            self.gate_weights[i],
-            self.ffn1_weights[i],
-            self.ffn2_weights[i],
-            self.ffn1_biases[i],
-            None,
-            self.ffn2_biases[i],
-            None,
-            "None",
-            self.config.moe_config.top_k,
-            self.config.moe_config.norm_topk_prob,
-        )
+        e_score_correction_bias = self.e_score_correction_biases[i]
+
+        def get_moe_scores(
+            gating_output: paddle.Tensor,
+            config: MoeConfig,
+        ) -> tuple[paddle.Tensor, paddle.Tensor]:
+
+            num_token = gating_output.shape[0]
+            num_expert_group = config.num_expert_group
+            topk_group = config.topk_group
+
+            # Compute softmax or sigmoid scores based on the topk_method
+            if config.topk_method == "greedy":
+                scores = paddle.nn.functional.softmax(gating_output, axis=-1)
+                return scores, scores
+            elif config.topk_method == "group_limited_greedy":
+                scores = paddle.nn.functional.softmax(gating_output, axis=-1)
+                scores_no_bias = scores
+                group_scores = scores.reshape([num_token, num_expert_group, -1]).max(axis=-1)  # [n, num_expert_group]
+            elif config.topk_method == "noaux_tc":
+                if e_score_correction_bias is None:
+                    raise ValueError("e_score_correction_bias must be provided for 'noaux_tc' method.")
+                scores = paddle.nn.functional.sigmoid(gating_output)
+                # 原始 scores
+                scores_no_bias = scores
+                scores = scores + e_score_correction_bias.unsqueeze(0)
+                group_scores = (
+                    scores.reshape([num_token, num_expert_group, -1]).topk(2, axis=-1)[0].sum(axis=-1)
+                )  # [n, num_expert_group]
+            else:
+                raise ValueError(
+                    f"Unsupported topk_method: {config.topk_method}. Please choose 'group_limited_greedy' or 'noaux_tc'."
+                )
+
+            # Identify top-k groups
+            group_idx = paddle.topk(group_scores, k=topk_group, axis=-1, sorted=False)[1]  # [n, topk_group]
+
+            group_mask = paddle.zeros_like(group_scores, dtype="int64")  # [n, num_expert_group]
+            group_mask = paddle.put_along_axis(group_mask, group_idx, 1, axis=1)
+
+            # Apply group mask to the scores
+            score_mask = (
+                group_mask.unsqueeze(-1)
+                .expand([num_token, num_expert_group, scores.shape[-1] // num_expert_group])
+                .reshape([num_token, -1])
+                .astype("float32")
+            )  # [n, e]
+
+            # Scale the scores with the mask and scaling factor
+            scores = scores * score_mask
+
+            # renormalize 和 refactor 在后面做
+            return scores, scores_no_bias
+
+        if self.config.moe_config.topk_method is not None:
+            from paddle.incubate.nn.functional import moe_dispatch, moe_ffn, moe_reduce
+
+            gate_out = paddle.matmul(tmp_out.cast("float32"), self.gate_weights[i])
+            # 应用各种策略后重塑的 scores
+            scores, scores_no_bias = get_moe_scores(gate_out, self.config.moe_config)
+            # topk 在 moe_dispatch 中
+            (
+                permute_input,
+                token_nums_per_expert,
+                permute_indices_per_token,
+                top_k_weights,
+                top_k_indices,
+            ) = moe_dispatch(tmp_out, scores, self.config.moe_config.top_k, False, topk_only_mode=True)
+
+            ffn_out = moe_ffn(
+                permute_input,
+                token_nums_per_expert,
+                self.ffn1_weights[i],
+                self.ffn2_weights[i],
+                self.ffn1_biases[i],
+                self.ffn1_weights_scale[i] if hasattr(self, "ffn1_weights_scale") else None,
+                self.ffn2_weights_scale[i] if hasattr(self, "ffn2_weights_scale") else None,
+                self.quant_type if hasattr(self, "quant_type") else "None",
+            )
+
+            if e_score_correction_bias is not None:
+                top_k_weights = scores_no_bias.take_along_axis(top_k_indices, axis=1)
+
+            # reduce 中会做 topk 个 weight 的 norm 和 routed_scaling_factor
+            fused_moe_out = moe_reduce(
+                ffn_out,
+                top_k_weights,
+                permute_indices_per_token,
+                top_k_indices,
+                self.ffn2_biases[i],
+                norm_topk_prob=self.config.moe_config.norm_topk_prob,
+                routed_scaling_factor=self.config.moe_config.routed_scaling_factor,
+            )
+        else:
+            fused_moe_out = fused_moe(
+                tmp_out,
+                self.gate_weights[i],
+                self.ffn1_weights[i],
+                self.ffn2_weights[i],
+                self.ffn1_biases[i],
+                self.ffn1_weights_scale[i] if hasattr(self, "ffn1_weights_scale") else None,
+                self.ffn2_biases[i],
+                self.ffn2_weights_scale[i] if hasattr(self, "ffn2_weights_scale") else None,
+                self.quant_type if hasattr(self, "quant_type") else "None",
+                self.config.moe_config.top_k,
+                self.config.moe_config.norm_topk_prob,
+            )
         return fused_moe_out
 
     def compute_activation(self, ffn1_out, i):
@@ -918,7 +1270,6 @@ def compute_ffn2(self, ffn1_out, i):
         return paddle.matmul(ffn1_out, self.ffn2_weights[i])
 
     def compute_bias_residual_layernorm(self, ffn2_out, residual_input, i, num_layers):
-
         if i != num_layers - 1:
             norm_out = self.norm_func(
                 ffn2_out,
@@ -946,13 +1297,24 @@ def compute_shared_expert(self, tmp_out, i):
         ffn1_out = paddle.matmul(tmp_out, self.shared_expert_ffn1_weights[i])
         ffn1_out = fused_bias_act(ffn1_out, None, act_method=self.activation)
         ffn2_out = paddle.matmul(ffn1_out, self.shared_expert_ffn2_weights[i])
-        gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
-        gate_out = paddle.nn.functional.sigmoid(gate_out)
-        shared_expert_output = gate_out * ffn2_out
-        return shared_expert_output
+        if self.config.moe_config.shared_expert_with_gate:
+            gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
+            gate_out = paddle.nn.functional.sigmoid(gate_out)
+            return gate_out * ffn2_out
+        return ffn2_out
 
     def pre_process(self, **kwargs):
-        pass
+        if self.config.mla_config.use_mla():
+            seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+            seq_lens_decoder = kwargs.get("seq_lens_decoder", None)
+            seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+            position_ids_shape = paddle.sum(seq_lens_this_time)
+            self.position_ids = paddle.zeros(shape=position_ids_shape, dtype=seq_lens_encoder.dtype)
+
+            from paddlenlp_ops import get_position_ids
+
+            # In-place operations that compute the position_ids.
+            get_position_ids(seq_lens_encoder, seq_lens_decoder, seq_lens_this_time, self.position_ids)
 
     def post_process(self, **kwargs):
         time_step = kwargs.get("time_step", None)
@@ -1023,9 +1385,9 @@ def forward(
         kwargs["cum_offsets"] = cum_offsets
 
         if caches is not None:
-            assert len(caches) == len(self.qkv_weights) or len(caches) == 2 * len(self.qkv_weights)
+            assert len(caches) == len(self.linear_weights) or len(caches) == 2 * len(self.linear_weights)
 
-        assert self.num_layers == len(self.qkv_weights)
+        assert self.num_layers == len(self.linear_weights)
 
         max_enc_len_this_time, max_dec_len_this_time = self.compute_max_len(
             kwargs.get("seq_lens_encoder", None), kwargs.get("seq_lens_decoder", None), cum_offsets
@@ -1177,13 +1539,17 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.ffn1_weights_scale = []
         self.ffn2_weights_scale = []
 
-        if self.config.moe_config.has_shared_expert():
-            self.shared_expert_ffn1_weights_scale = []
-            self.shared_expert_ffn2_weights_scale = []
+        self.q_proj_weights_scale = []
+        self.q_a_proj_weights_scale = []
+        self.q_b_proj_weights_scale = []
+        self.kv_a_proj_with_mqa_weights_scale = []
+        self.kv_b_proj_weights_scale = []
+
+        self.shared_expert_ffn1_weights_scale = []
+        self.shared_expert_ffn2_weights_scale = []
 
         for i in range(self.num_layers):
 
-            qkv_weight_scale_attr = self.get_attr(config.qkv_weight_scale_attrs, i)
             linear_weight_scale_attr = self.get_attr(config.linear_weight_scale_attrs, i)
             ffn1_weight_scale_attr = self.get_attr(config.ffn1_weight_scale_attrs, i)
             ffn2_weight_scale_attr = self.get_attr(config.ffn2_weight_scale_attrs, i)
@@ -1196,12 +1562,59 @@ def __init__(self, config: FusedMultiTransformerConfig):
                     config.moe_config.shared_expert_ffn2_weight_scale_attrs, i
                 )
 
-            qkv_weight_scale = self.create_parameter(
-                shape=[(self.num_heads + 2 * self.kv_num_heads) * self.head_dim],
-                attr=qkv_weight_scale_attr,
-                dtype=self.weight_scale_dtype,
-                is_bias=False,
-            )
+            qkv_weight_scale = None
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    q_proj_weight_scale_attr = self.get_attr(self.config.mla_config.q_proj_weight_scale_attrs, i)
+                    q_proj_weight_scale = self.create_parameter(
+                        shape=[self.num_heads * (self.config.mla_config.qk_head_dim)],
+                        attr=q_proj_weight_scale_attr,
+                        dtype=self.weight_scale_dtype,
+                        is_bias=False,
+                    )
+                else:
+                    q_a_proj_weight_scale_attr = self.get_attr(self.config.mla_config.q_a_proj_weight_scale_attrs, i)
+                    q_b_proj_weight_scale_attr = self.get_attr(self.config.mla_config.q_b_proj_weight_scale_attrs, i)
+                    q_a_proj_weight_scale = self.create_parameter(
+                        shape=[self.config.mla_config.q_lora_rank],
+                        attr=q_a_proj_weight_scale_attr,
+                        dtype=self.weight_scale_dtype,
+                        is_bias=False,
+                    )
+                    q_b_proj_weight_scale = self.create_parameter(
+                        shape=[self.num_heads * (self.config.mla_config.qk_head_dim)],
+                        attr=q_b_proj_weight_scale_attr,
+                        dtype=self.weight_scale_dtype,
+                        is_bias=False,
+                    )
+
+                kv_a_proj_with_mqa_weight_scale_attr = self.get_attr(
+                    self.config.mla_config.kv_a_proj_with_mqa_weight_scale_attrs, i
+                )
+                kv_b_proj_weight_scale_attr = self.get_attr(self.config.mla_config.kv_b_proj_weight_scale_attrs, i)
+
+                kv_a_proj_with_mqa_weight_scale = self.create_parameter(
+                    shape=[self.config.mla_config.kv_lora_rank + self.config.mla_config.qk_rope_head_dim],
+                    attr=kv_a_proj_with_mqa_weight_scale_attr,
+                    dtype=self.weight_scale_dtype,
+                    is_bias=False,
+                )
+                kv_b_proj_weight_scale = self.create_parameter(
+                    shape=[
+                        self.num_heads * (self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim)
+                    ],
+                    attr=kv_b_proj_weight_scale_attr,
+                    dtype=self.weight_scale_dtype,
+                    is_bias=False,
+                )
+            else:
+                qkv_weight_scale_attr = self.get_attr(config.qkv_weight_scale_attrs, i)
+                qkv_weight_scale = self.create_parameter(
+                    shape=[(self.num_heads + 2 * self.kv_num_heads) * self.head_dim],
+                    attr=qkv_weight_scale_attr,
+                    dtype=self.weight_scale_dtype,
+                    is_bias=False,
+                )
 
             linear_weight_scale = self.create_parameter(
                 shape=[self.embed_dim],
@@ -1212,16 +1625,18 @@ def __init__(self, config: FusedMultiTransformerConfig):
 
             if self.config.moe_config.use_moe(i):
                 ffn1_weight_scale = self.create_parameter(
-                    shape=[self.config.moe_config.num_experts, self.dim_feedforward * 2]
+                    shape=[self.config.moe_config.num_experts, self.config.moe_config.moe_intermediate_size * 2]
                     if config.activation.endswith("glu")
-                    else [self.config.moe_config.num_experts, self.dim_feedforward],
+                    else [self.config.moe_config.num_experts, self.config.moe_config.moe_intermediate_size],
                     attr=ffn1_weight_scale_attr,
                     dtype=self.weight_scale_dtype,
                     is_bias=False,
                 )
             else:
                 ffn1_weight_scale = self.create_parameter(
-                    shape=[self.dim_feedforward * 2] if config.activation.endswith("glu") else [self.dim_feedforward],
+                    shape=[self.intermediate_size * 2]
+                    if config.activation.endswith("glu")
+                    else [self.intermediate_size],
                     attr=ffn1_weight_scale_attr,
                     dtype=self.weight_scale_dtype,
                     is_bias=False,
@@ -1242,6 +1657,8 @@ def __init__(self, config: FusedMultiTransformerConfig):
                     is_bias=False,
                 )
 
+            shared_expert_ffn1_weight_scale = None
+            shared_expert_ffn2_weight_scale = None
             if self.config.moe_config.use_shared_expert(i):
                 shared_expert_ffn1_weight_scale = self.create_parameter(
                     shape=[self.config.moe_config.shared_expert_intermediate_size * 2],
@@ -1256,16 +1673,35 @@ def __init__(self, config: FusedMultiTransformerConfig):
                     is_bias=False,
                 )
 
-            self.qkv_weights_scale.append(qkv_weight_scale)
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self.q_proj_weights_scale.append(q_proj_weight_scale)
+                else:
+                    self.q_a_proj_weights_scale.append(q_a_proj_weight_scale)
+                    self.q_b_proj_weights_scale.append(q_b_proj_weight_scale)
+                self.kv_a_proj_with_mqa_weights_scale.append(kv_a_proj_with_mqa_weight_scale)
+                self.kv_b_proj_weights_scale.append(kv_b_proj_weight_scale)
+            else:
+                self.qkv_weights_scale.append(qkv_weight_scale)
+
             self.linear_weights_scale.append(linear_weight_scale)
             self.ffn1_weights_scale.append(ffn1_weight_scale)
             self.ffn2_weights_scale.append(ffn2_weight_scale)
 
-            if self.config.moe_config.use_shared_expert(i):
-                self.shared_expert_ffn1_weights_scale.append(shared_expert_ffn1_weight_scale)
-                self.shared_expert_ffn2_weights_scale.append(shared_expert_ffn2_weight_scale)
+            self.shared_expert_ffn1_weights_scale.append(shared_expert_ffn1_weight_scale)
+            self.shared_expert_ffn2_weights_scale.append(shared_expert_ffn2_weight_scale)
+
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self._add_parameter(q_proj_weight_scale)
+                else:
+                    self._add_parameter(q_a_proj_weight_scale)
+                    self._add_parameter(q_b_proj_weight_scale)
+                self._add_parameter(kv_a_proj_with_mqa_weight_scale)
+                self._add_parameter(kv_b_proj_weight_scale)
+            else:
+                self._add_parameter(qkv_weight_scale)
 
-            self._add_parameter(qkv_weight_scale)
             self._add_parameter(linear_weight_scale)
             self._add_parameter(ffn1_weight_scale)
             self._add_parameter(ffn2_weight_scale)
@@ -1280,27 +1716,68 @@ def get_weight_create_dype(self):
     def init_weight_shape(self, config):
         super().init_weight_shape(config)
 
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is None:
+                self.q_proj_weight_shape = [
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                    self.config.embed_dim,
+                ]
+            else:
+                self.q_a_proj_weight_shape = [self.config.mla_config.q_lora_rank, self.config.embed_dim]
+                self.q_b_proj_weight_shape = [
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                    self.config.mla_config.q_lora_rank,
+                ]
+
+            self.kv_a_proj_with_mqa_weight_shape = [
+                self.config.mla_config.kv_lora_rank + self.config.mla_config.qk_rope_head_dim,
+                self.config.embed_dim,
+            ]
+            self.kv_b_proj_weight_shape = [
+                self.num_heads * (self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim),
+                self.config.mla_config.kv_lora_rank,
+            ]
+        else:
+            self.qkv_weight_shape = (
+                [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim]
+                if config.trans_qkvw
+                else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
+            )
+
         self.linear_weight_shape = [self.embed_dim, self.num_heads * self.head_dim]
         self.ffn1_weight_shape = (
-            [self.dim_feedforward * 2, self.embed_dim]
+            [self.intermediate_size * 2, self.embed_dim]
             if self.activation.endswith("glu")
-            else [self.dim_feedforward, self.embed_dim]
+            else [self.intermediate_size, self.embed_dim]
         )
-        self.ffn2_weight_shape = [self.embed_dim, self.dim_feedforward]
+        self.ffn2_weight_shape = [self.embed_dim, self.intermediate_size]
 
         if config.quant_type == "weight_only_int4":
-            self.qkv_weight_shape[0] //= 2
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self.q_proj_weight_shape[0] //= 2
+                else:
+                    self.q_a_proj_weight_shape[0] //= 2
+                    self.q_b_proj_weight_shape[0] //= 2
+                self.kv_a_proj_with_mqa_weight_shape[0] //= 2
+                self.kv_b_proj_weight_shape[0] //= 2
+            else:
+                self.qkv_weight_shape[0] //= 2
             self.linear_weight_shape[0] //= 2
             self.ffn1_weight_shape[0] //= 2
             self.ffn2_weight_shape[0] //= 2
 
-        if self.config.moe_config.has_moe() is True:
+        if self.config.moe_config.has_moe():
             self.moe_ffn1_weight_shape = (
-                [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward * 2]
+                [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size * 2]
                 if self.activation.endswith("glu")
-                else [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward]
+                else [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size]
             )
-            self.moe_ffn2_weight_shape = [self.config.moe_config.num_experts, self.dim_feedforward, self.embed_dim]
+            self.moe_ffn2_weight_shape = [
+                self.config.moe_config.num_experts,
+                self.config.moe_config.moe_intermediate_size,
+                self.embed_dim,
+            ]
 
             if config.quant_type == "weight_only_int4":
                 if config.moe_config.has_shared_expert():
@@ -1319,22 +1796,105 @@ def init_weight_shape(self, config):
                 self.embed_dim,
                 self.config.moe_config.shared_expert_intermediate_size,
             ]
-            self.shared_expert_gate_weight_shape = [
-                self.embed_dim,
-                1,
-            ]
+            if self.config.moe_config.shared_expert_with_gate:
+                self.shared_expert_gate_weight_shape = [
+                    self.embed_dim,
+                    1,
+                ]
             if config.quant_type == "weight_only_int4":
                 self.shared_expert_ffn1_weight_shape[0] //= 2
                 self.shared_expert_ffn2_weight_shape[0] //= 2
 
     def compute_qkv_linear(self, ln_out, i):
-        return weight_only_linear(
-            ln_out,
-            weight=self.qkv_weights[i],
-            bias=self.qkv_biases[i],
-            weight_scale=self.qkv_weights_scale[i],
-            weight_dtype=self.weight_dtype,
-        )
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is not None:
+                query = weight_only_linear(
+                    ln_out,
+                    weight=self.q_a_proj_weights[i],
+                    weight_scale=self.q_a_proj_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+                query = self.norm_func(
+                    x=query,
+                    norm_weight=self.q_a_layernorm_weights[i],
+                    norm_bias=None,
+                    epsilon=self._epsilon,
+                    begin_norm_axis=1,
+                )[0]
+                query = weight_only_linear(
+                    query,
+                    weight=self.q_b_proj_weights[i],
+                    weight_scale=self.q_b_proj_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+            else:
+                query = weight_only_linear(
+                    ln_out,
+                    weight=self.q_proj_weights[i],
+                    weight_scale=self.q_proj_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+
+            query = query.reshape([-1, self.num_heads, self.config.mla_config.qk_head_dim])
+            query_nope, query_pe = query.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+
+            compressed_kv = weight_only_linear(
+                ln_out,
+                weight=self.kv_a_proj_with_mqa_weights[i],
+                weight_scale=self.kv_a_proj_with_mqa_weights_scale[i],
+                weight_dtype=self.weight_dtype,
+            )
+            compressed_kv, key_pe = compressed_kv.split(
+                [self.config.mla_config.kv_lora_rank, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+            key_pe = key_pe.reshape([-1, 1, self.config.mla_config.qk_rope_head_dim])
+            compressed_kv = self.norm_func(
+                x=compressed_kv,
+                norm_weight=self.kv_a_layernorm_weights[i],
+                norm_bias=None,
+                epsilon=self._epsilon,
+                begin_norm_axis=1,
+            )[0]
+            key_value = weight_only_linear(
+                compressed_kv,
+                weight=self.kv_b_proj_weights[i],
+                weight_scale=self.kv_b_proj_weights_scale[i],
+                weight_dtype=self.weight_dtype,
+            )
+            key_value = key_value.reshape(
+                [-1, self.num_heads, self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim]
+            )
+            key_nope, value = key_value.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.v_head_dim], axis=-1
+            )
+
+            query_pe, key_pe = self.config.rotary_emb(self.position_ids, query_pe, key_pe)
+
+            query[..., self.config.mla_config.qk_nope_head_dim :] = query_pe
+            key = paddle.empty_like(query)
+            key[..., : self.config.mla_config.qk_nope_head_dim] = key_nope
+            key[..., self.config.mla_config.qk_nope_head_dim :] = key_pe
+
+            qkv_out = paddle.concat(
+                [
+                    query.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    key.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    value.reshape([-1, self.num_heads * self.config.mla_config.v_head_dim]),
+                ],
+                axis=-1,
+            )
+        else:
+            qkv_out = weight_only_linear(
+                ln_out,
+                weight=self.qkv_weights[i],
+                bias=self.qkv_biases[i],
+                weight_scale=self.qkv_weights_scale[i],
+                weight_dtype=self.weight_dtype,
+            )
+
+        return qkv_out
 
     def compute_out_linear(self, fmha_out, i):
         return weight_only_linear(
@@ -1344,22 +1904,6 @@ def compute_out_linear(self, fmha_out, i):
             weight_dtype=self.weight_dtype,
         )
 
-    def compute_fused_moe(self, tmp_out, i):
-        fused_moe_out = fused_moe(
-            tmp_out,
-            self.gate_weights[i],
-            self.ffn1_weights[i],
-            self.ffn2_weights[i],
-            self.ffn1_biases[i],
-            self.ffn1_weights_scale[i],
-            self.ffn2_biases[i],
-            self.ffn2_weights_scale[i],
-            self.quant_type,
-            self.config.moe_config.top_k,
-            self.config.moe_config.norm_topk_prob,
-        )
-        return fused_moe_out
-
     def compute_ffn1(self, tmp_out, i):
         return weight_only_linear(
             tmp_out,
@@ -1383,21 +1927,18 @@ def compute_shared_expert(self, tmp_out, i):
             weight_scale=self.shared_expert_ffn1_weights_scale[i],
             weight_dtype=self.weight_dtype,
         )
-
         ffn1_out = fused_bias_act(ffn1_out, None, act_method=self.activation)
-
         ffn2_out = weight_only_linear(
             ffn1_out,
             weight=self.shared_expert_ffn2_weights[i],
             weight_scale=self.shared_expert_ffn2_weights_scale[i],
             weight_dtype=self.weight_dtype,
         )
-
-        gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
-        gate_out = paddle.nn.functional.sigmoid(gate_out)
-
-        shared_expert_output = gate_out * ffn2_out
-        return shared_expert_output
+        if self.config.moe_config.shared_expert_with_gate:
+            gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
+            gate_out = paddle.nn.functional.sigmoid(gate_out)
+            return gate_out * ffn2_out
+        return ffn2_out
 
 
 class FusedMultiTransformerWeightOnlyPostLayernorm(
@@ -1415,8 +1956,8 @@ def __init__(self, config: FusedMultiTransformerConfig):
             config.embed_dim
         )
         assert config.num_heads > 0, "Expected nhead to be greater than 0, " "but received {}".format(config.num_heads)
-        assert config.dim_feedforward > 0, "Expected dim_feedforward to be greater than 0, but received {}".format(
-            config.dim_feedforward
+        assert config.intermediate_size > 0, "Expected intermediate_size to be greater than 0, but received {}".format(
+            config.intermediate_size
         )
         self._dtype = "float32"
         self._epsilon = config.epsilon
@@ -1432,9 +1973,9 @@ def __init__(self, config: FusedMultiTransformerConfig):
         assert self.head_dim * config.num_heads == config.embed_dim, "embed_dim must be divisible by num_heads"
 
         assert config.num_heads % config.nranks == 0
-        assert config.dim_feedforward % config.nranks == 0
+        assert config.intermediate_size % config.nranks == 0
 
-        dim_feedforward = config.dim_feedforward
+        intermediate_size = config.intermediate_size
         self.num_heads = config.num_heads
         self.cache_dtype = self.config.avx_config.cache_dtype
         self.kv_num_heads = config.kv_num_heads
@@ -1446,7 +1987,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.weight_dtype = self._dtype
         self.create_params_type = self._dtype
         self.activation = config.activation
-        self.intermediate_size = dim_feedforward
+        self.intermediate_size = intermediate_size
         self.max_positions = self.config.avx_config.max_position_embeddings
         self.max_pos_embed = self.config.avx_config.max_position_embeddings
         self.hiddensize = self.num_heads * self.head_dim
@@ -1548,7 +2089,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             gate_bias = None
             if gate_bias_attr:
                 gate_bias = self.create_parameter(
-                    shape=[config.dim_feedforward],
+                    shape=[config.intermediate_size],
                     attr=gate_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -1562,7 +2103,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             up_bias = None
             if up_bias_attr:
                 up_bias = self.create_parameter(
-                    shape=[config.dim_feedforward],
+                    shape=[config.intermediate_size],
                     attr=up_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -1721,7 +2262,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
                 default_initializer=paddle.nn.initializer.Constant(-1),
             )
             ffn1_out_scale = self.create_parameter(
-                shape=[self.dim_feedforward * 2] if self.activation.endswith("glu") else [self.dim_feedforward],
+                shape=[self.intermediate_size * 2] if self.activation.endswith("glu") else [self.intermediate_size],
                 attr=ffn1_out_scale_attr,
                 dtype="float32",
                 is_bias=False,
@@ -1750,13 +2291,13 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn2_shift = None
             if ffn2_shift_attr:
                 ffn2_shift = self.create_parameter(
-                    shape=[self.dim_feedforward], attr=ffn2_shift_attr, dtype=self._dtype, is_bias=False
+                    shape=[self.intermediate_size], attr=ffn2_shift_attr, dtype=self._dtype, is_bias=False
                 )
 
             ffn2_smooth = None
             if ffn2_smooth_attr:
                 ffn2_smooth = self.create_parameter(
-                    shape=[self.dim_feedforward], attr=ffn2_smooth_attr, dtype=self._dtype, is_bias=False
+                    shape=[self.intermediate_size], attr=ffn2_smooth_attr, dtype=self._dtype, is_bias=False
                 )
 
             self.qkv_out_scales.append(qkv_out_scale)
@@ -1859,8 +2400,7 @@ def init_weight(self):
             self.qkv_weights.append(qkv_weight)
             self.linear_weights.append(linear_weight)
 
-            if gate_weight is not None:
-                self.gate_weights.append(gate_weight)
+            self.gate_weights.append(gate_weight)
             self.ffn1_weights.append(ffn1_weight)
             self.ffn2_weights.append(ffn2_weight)
 
@@ -1896,11 +2436,11 @@ def init_weight_shape(self, config):
         if not paddle.is_compiled_with_rocm():
             self.linear_weight_shape = [self.embed_dim, self.num_heads * self.head_dim]
             self.ffn1_weight_shape = (
-                [self.dim_feedforward * 2, self.embed_dim]
+                [self.intermediate_size * 2, self.embed_dim]
                 if self.activation.endswith("glu")
-                else [self.dim_feedforward, self.embed_dim]
+                else [self.intermediate_size, self.embed_dim]
             )
-            self.ffn2_weight_shape = [self.embed_dim, self.dim_feedforward]
+            self.ffn2_weight_shape = [self.embed_dim, self.intermediate_size]
 
     def compute_layernorm_before_qkv(self, src, i):
         if i == 0:
@@ -1921,11 +2461,14 @@ def compute_layernorm_before_qkv(self, src, i):
         return ln_out
 
     def compute_qkv_linear(self, ln_out, i):
-        if paddle.is_compiled_with_rocm():
-            qkv_out = paddle.matmul(ln_out, self.qkv_weights[i])
+        if self.config.mla_config.use_mla():
+            raise NotImplementedError("Not support MLA yet.")
         else:
-            qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
-        return qkv_out
+            if paddle.is_compiled_with_rocm():
+                qkv_out = paddle.matmul(ln_out, self.qkv_weights[i])
+            else:
+                qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
+            return qkv_out
 
     def compute_fmha(
         self,
@@ -1982,7 +2525,7 @@ def compute_fmha(
             seq_lens,
             seq_lens + pre_caches_length,
             mask=attn_mask,
-            scale=float(self.head_dim**-0.5),
+            scale=self.softmax_scale,
         )
 
         fmha_out = transpose_remove_padding(qktv_out, seq_lens, padding_offset)
@@ -2190,8 +2733,9 @@ def compute_attn(
                 "none",  # cache_quant_type
                 self.use_neox_rotary_style,
                 kwargs.get("max_input_length", -1),
-                0.0,
-                0.0,
+                self.softmax_scale,  # softmax_scale
+                0.0,  # quant_max_bound
+                0.0,  # quant_min_bound
                 0.0,  # out_linear_in_scale
                 self.config.speculate_config.speculate_max_draft_token_num,
                 True,  # causal
@@ -2280,6 +2824,9 @@ def compute_attn(
                     rope_theta=self.config.rope_theta,
                 )[0]
 
+        if self.config.mla_config.use_mla():
+            fmha_out = fmha_out.reshape([-1, self.num_heads * self.config.mla_config.v_head_dim])
+
         out_linear_out = self.compute_out_linear(fmha_out, i)
 
         return out_linear_out
@@ -2387,6 +2934,7 @@ def compute_attn(
                 cache_quant_type_str,
                 self.use_neox_rotary_style,
                 kwargs.get("max_input_length", -1),
+                self.softmax_scale,
                 self.quant_max_bound,
                 self.quant_min_bound,
                 self.act_scales["out_linear_in_scale"][i],
@@ -2473,7 +3021,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn1_0_bias = None
             if ffn1_0_bias_attr:
                 ffn1_0_bias = self.create_parameter(
-                    shape=[self.dim_feedforward],
+                    shape=[self.intermediate_size],
                     attr=ffn1_0_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -2482,7 +3030,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn1_1_bias = None
             if ffn1_1_bias_attr:
                 ffn1_1_bias = self.create_parameter(
-                    shape=[self.dim_feedforward],
+                    shape=[self.intermediate_size],
                     attr=ffn1_1_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -2580,9 +3128,9 @@ def init_weight_shape(self, config):
             else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
         )
         self.linear_weight_shape = [self.num_heads * self.head_dim, self.embed_dim]
-        self.ffn1_0_weight_shape = [self.dim_feedforward, self.embed_dim]
-        self.ffn1_1_weight_shape = [self.dim_feedforward, self.embed_dim]
-        self.ffn2_weight_shape = [self.embed_dim, self.dim_feedforward]
+        self.ffn1_0_weight_shape = [self.intermediate_size, self.embed_dim]
+        self.ffn1_1_weight_shape = [self.intermediate_size, self.embed_dim]
+        self.ffn2_weight_shape = [self.embed_dim, self.intermediate_size]
 
     def get_weight_create_dype(self, layer_name=None, layer_idx=None):
         """
@@ -2615,25 +3163,25 @@ def compute_layernorm_before_qkv(self, src, i):
         return ln_out
 
     def compute_qkv_linear(self, ln_out, i):
-        """
-        For fake parameter
-        """
-        if paddle.is_compiled_with_rocm() or float(paddle.version.cuda()) < 11.6:
-            qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
-            if self.qkv_biases[i] is not None:
-                qkv_out = paddle.add(qkv_out, self.qkv_biases[i])
-            return qkv_out
+        if self.config.mla_config.use_mla():
+            raise NotImplementedError("Not support MLA yet.")
         else:
-            qkv_out = fp8_gemm_fused(
-                ln_out,
-                self.qkv_weights[i],
-                transpose_x=False,
-                transpose_y=True,
-                bias=self.qkv_biases[i],
-                scale=self.weight_scales["qkv_weight_scale"][i] / (self.act_scales["qkv_in_scale"][i] * 448 * 448),
-                output_dtype=self._dtype,
-                act="identity",
-            )
+            if paddle.is_compiled_with_rocm() or float(paddle.version.cuda()) < 11.6:
+                qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
+                if self.qkv_biases[i] is not None:
+                    qkv_out = paddle.add(qkv_out, self.qkv_biases[i])
+                return qkv_out
+            else:
+                qkv_out = fp8_gemm_fused(
+                    ln_out,
+                    self.qkv_weights[i],
+                    transpose_x=False,
+                    transpose_y=True,
+                    bias=self.qkv_biases[i],
+                    scale=self.weight_scales["qkv_weight_scale"][i] / (self.act_scales["qkv_in_scale"][i] * 448 * 448),
+                    output_dtype=self._dtype,
+                    act="identity",
+                )
 
             return qkv_out
 
@@ -2746,6 +3294,7 @@ def compute_attn(
                 cache_quant_type_str,
                 self.use_neox_rotary_style,
                 kwargs.get("max_input_length", -1),
+                self.softmax_scale,
                 self.quant_max_bound,
                 self.quant_min_bound,
                 self.act_scales["out_linear_in_scale"][i],
diff --git a/paddlenlp/experimental/transformers/generation_utils.py b/paddlenlp/experimental/transformers/generation_utils.py
index f00123258f99..e2a75b911d4f 100644
--- a/paddlenlp/experimental/transformers/generation_utils.py
+++ b/paddlenlp/experimental/transformers/generation_utils.py
@@ -110,7 +110,7 @@ def to_static(self, output_path: str, config: dict):
             input_spec[16] = paddle.static.InputSpec(shape=[None, 2, 1], dtype="int64", name="tgt_pos")  # tgt_pos
         elif self.config["model_type"] and "gpt" in self.config.model_type:
             input_spec[2] = paddle.static.InputSpec(shape=[None], dtype="int64", name="position_ids")  # position_ids
-        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec, full_graph=True)
         paddle.jit.save(
             model, output_path, skip_prune_program=True
         )  # Note(Zhengzekang): If we prune program it may cause some inference error.
@@ -539,7 +539,7 @@ def to_static(self, output_path: str, config: dict):
             ]
             input_spec.extend(speculate_spec)
 
-        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec, full_graph=True)
         paddle.jit.save(
             model, output_path, skip_prune_program=True
         )  # Note(Zhengzekang): If we prune program it may cause some inference error.
@@ -735,23 +735,24 @@ def _post_process_(
             if self.config.tensor_parallel_degree > 1:
                 paddle.distributed.broadcast(next_tokens, 0)
 
-            from paddlenlp_ops import update_inputs_v2
+            with paddle.base.framework._stride_in_no_check_dy2st_diff():
+                from paddlenlp_ops import update_inputs_v2
 
-            update_inputs_v2(
-                model_kwargs["stop_flags"],
-                model_kwargs["step_idx"],
-                model_kwargs["not_need_stop"],
-                model_kwargs["seq_lens_this_time"],
-                model_kwargs["seq_lens_encoder"],
-                model_kwargs["seq_lens_decoder"],
-                model_kwargs["max_dec_len"],
-                model_kwargs["input_ids"],
-                model_kwargs["stop_nums"],
-                next_tokens,
-                model_kwargs["is_block_step"],
-                eos_token_id,
-                model_kwargs["next_tokens"],
-            )
+                update_inputs_v2(
+                    model_kwargs["stop_flags"],
+                    model_kwargs["step_idx"],
+                    model_kwargs["not_need_stop"],
+                    model_kwargs["seq_lens_this_time"],
+                    model_kwargs["seq_lens_encoder"],
+                    model_kwargs["seq_lens_decoder"],
+                    model_kwargs["max_dec_len"],
+                    model_kwargs["input_ids"],
+                    model_kwargs["stop_nums"],
+                    next_tokens,
+                    model_kwargs["is_block_step"],
+                    eos_token_id,
+                    model_kwargs["next_tokens"],
+                )
 
             from paddlenlp_ops import save_output
 
@@ -830,28 +831,26 @@ def _post_process_(
             probs = F.softmax(logits)
 
             from paddlenlp_ops import (
+                speculate_clear_accept_nums,
                 speculate_save_output,
                 speculate_set_value_by_flags_and_idx,
-                speculate_verify_and_update,
+                speculate_update,
+                speculate_verify,
                 top_p_candidates,
             )
 
             verify_scores, verify_tokens, actual_candidate_len = top_p_candidates(
                 probs, top_p, model_kwargs["output_padding_offset"], self.max_candidate_len, self.max_seq_len
-            )  # [token_num, max_candidate_len]
+            )
 
-            # Speculate Verify And Update
-            speculate_verify_and_update(
+            speculate_verify(
                 model_kwargs["accept_tokens"],
                 model_kwargs["accept_num"],
                 model_kwargs["step_idx"],
+                model_kwargs["stop_flags"],
                 model_kwargs["seq_lens_encoder"],
                 model_kwargs["seq_lens_decoder"],
-                model_kwargs["stop_flags"],
-                model_kwargs["not_need_stop"],
-                model_kwargs[
-                    "draft_tokens"
-                ],  # Both input and output, need to write the last 1 token accepted to position 0.
+                model_kwargs["draft_tokens"],  # 既是输入又是输出，需要把接收的最后1个token写入到第0个位置
                 model_kwargs["seq_lens_this_time"],
                 verify_tokens,
                 verify_scores,
@@ -867,6 +866,25 @@ def _post_process_(
                 True,  # enable_topp
             )
 
+            if self.config.tensor_parallel_degree > 1:
+                paddle.distributed.broadcast(model_kwargs["accept_tokens"], 0)
+                paddle.distributed.broadcast(model_kwargs["accept_num"], 0)
+                paddle.distributed.broadcast(model_kwargs["step_idx"], 0)
+                paddle.distributed.broadcast(model_kwargs["stop_flags"], 0)
+
+            speculate_update(
+                model_kwargs["seq_lens_encoder"],
+                model_kwargs["seq_lens_decoder"],
+                model_kwargs["not_need_stop"],
+                model_kwargs["draft_tokens"],
+                model_kwargs["actual_draft_token_num"],
+                model_kwargs["accept_tokens"],
+                model_kwargs["accept_num"],
+                model_kwargs["stop_flags"],
+                model_kwargs["seq_lens_this_time"],
+                model_kwargs["is_block_step"],
+            )
+
             speculate_save_output(
                 model_kwargs["accept_tokens"],
                 model_kwargs["accept_num"],
@@ -875,7 +893,7 @@ def _post_process_(
             )
 
             # If seq_lens_decoder is 0 (means stop), accept_num should be set to 0
-            model_kwargs["accept_num"][model_kwargs["seq_lens_decoder"] == 0] = 0
+            speculate_clear_accept_nums(model_kwargs["accept_num"], model_kwargs["seq_lens_decoder"])
 
             # Update pre_ids through accept tokens
             speculate_set_value_by_flags_and_idx(
@@ -1016,7 +1034,7 @@ def to_static(self, output_path: str, config: dict):
             config.get("logits_processors", None),
             None,
         ]
-        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec, full_graph=True)
         paddle.jit.save(
             model, output_path, skip_prune_program=True
         )  # Note(Zhengzekang): If we prune program it may cause some inference error.
diff --git a/paddlenlp/experimental/transformers/llama/modeling.py b/paddlenlp/experimental/transformers/llama/modeling.py
index ffa05a241d3a..24b68e63f3f4 100644
--- a/paddlenlp/experimental/transformers/llama/modeling.py
+++ b/paddlenlp/experimental/transformers/llama/modeling.py
@@ -188,7 +188,7 @@ def __init__(self, config: LlamaConfig):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_layers,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             activation="silu",
             num_layers=self.num_layers,
             ln_scale_attrs=ln_scale_attrs,
@@ -616,7 +616,7 @@ def __init__(self, config: LlamaConfig):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
@@ -1431,6 +1431,7 @@ def forward(
         kwargs["cu_seqlens_k"] = cu_seqlens_k
         kwargs["padding_offsets"] = padding_offset
         kwargs["max_input_length"] = self.max_seq_len
+        kwargs["block_size"] = self.block_size
 
         inputs_embeds = self.embed_tokens(ids_remove_padding)
 
diff --git a/paddlenlp/experimental/transformers/mixtral/modeling.py b/paddlenlp/experimental/transformers/mixtral/modeling.py
index 27e638d9d9f1..73479fc243d1 100644
--- a/paddlenlp/experimental/transformers/mixtral/modeling.py
+++ b/paddlenlp/experimental/transformers/mixtral/modeling.py
@@ -289,13 +289,14 @@ def __init__(self, config: MixtralConfig):
             top_k=self.moe_topk,
             norm_topk_prob=True,
             moe_every2=self.moe_every2,
+            moe_intermediate_size=self.intermediate_size,
         )
 
         transformer_config = FusedMultiTransformerConfig(
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
@@ -643,9 +644,9 @@ def set_state_dict(self, state_dict):
                     )
                     ffn1_quanted_weight_list.append(
                         ffn1_quanted_weight_list_i.reshape(
-                            [self.transformer_block.embed_dim, self.transformer_block.dim_feedforward * 2]
+                            [self.transformer_block.embed_dim, self.transformer_block.intermediate_size * 2]
                             if self.quant_type == "weight_only_int8"
-                            else [self.transformer_block.embed_dim, self.transformer_block.dim_feedforward]
+                            else [self.transformer_block.embed_dim, self.transformer_block.intermediate_size]
                         )
                     )
                     ffn1_quanted_weight_scale.append(ffn1_quanted_weight_scale_i)
@@ -682,9 +683,9 @@ def set_state_dict(self, state_dict):
                     )
                     ffn2_quanted_weight_list.append(
                         ffn2_quanted_weight_list_i.reshape(
-                            [self.transformer_block.dim_feedforward, self.transformer_block.embed_dim]
+                            [self.transformer_block.intermediate_size, self.transformer_block.embed_dim]
                             if self.quant_type == "weight_only_int8"
-                            else [self.transformer_block.dim_feedforward, self.transformer_block.embed_dim // 2]
+                            else [self.transformer_block.intermediate_size, self.transformer_block.embed_dim // 2]
                         )
                     )
                     ffn2_quanted_weight_scale.append(ffn2_quanted_weight_scale_i)
diff --git a/paddlenlp/experimental/transformers/proposers.py b/paddlenlp/experimental/transformers/proposers.py
index def236da54ff..6196d621ee9f 100644
--- a/paddlenlp/experimental/transformers/proposers.py
+++ b/paddlenlp/experimental/transformers/proposers.py
@@ -108,6 +108,7 @@ def run(self, model_inputs: dict[str, paddle.Tensor], **kargs):
             seq_lens_this_time,
             seq_lens_encoder,
             seq_lens_decoder,
+            model_inputs["max_length"].cpu(),
             kargs["real_batch_size"],
             self.max_ngram_size,
             self.max_draft_token_num,
@@ -144,6 +145,7 @@ def __init__(self, args, **kwargs):
         assert self.draft_type in (
             "draft_model",
             "eagle",
+            "mtp",
         ), f"draft_type support [draft_model, eagle], but get {self.draft_type}"
 
         self.max_draft_tokens = self.args.speculate_max_draft_token_num
@@ -170,7 +172,6 @@ def init_predictor(self):
         tensor_parallel_rank, tensor_parallel_degree = llm_utils.init_dist_env()
 
         self.config = AutoConfig.from_pretrained(self.args.draft_model_name_or_path)
-        self.config["decode_strategy"] = "draft_model_sample"
         self.model = AutoInferenceModelForCausalLM.from_pretrained(
             self.args.model_name_or_path,
             config=self.config,
@@ -179,7 +180,7 @@ def init_predictor(self):
             dtype=self.args.dtype,
             tensor_parallel_degree=tensor_parallel_degree,
             tensor_parallel_rank=tensor_parallel_rank,
-            is_eagle=True if self.draft_type == "eagle" else False,
+            spec_model_type=self.draft_type,
         )
 
         # prepare model_inputs
@@ -282,7 +283,7 @@ def run_preprocess(self, share_inputs):
             share_inputs["stop_flags"],
             share_inputs["draft_tokens"],
             self.max_draft_tokens,
-            self.draft_type,
+            self.draft_type in ["eagle", "mtp"],
         )
 
     def run_infer(self, share_inputs, **kwargs):
@@ -371,7 +372,6 @@ def run_infer(self, share_inputs, **kwargs):
             while self.model_inputs["not_need_stop"] and self.model_inputs["substep"] < self.max_draft_tokens:
                 self.last_seq_lens_this_time[:] = self.model_inputs["seq_lens_this_time"][:]
                 output_hidden_states = self.model.generate(**self.model_inputs)
-
                 self.model_inputs["substep"] += 1
                 if self.model_inputs["not_need_stop"] and self.model_inputs["substep"] < self.actual_draft_token_num:
                     self.model_inputs["hidden_states"] = eagle_get_self_hidden_states(
diff --git a/paddlenlp/experimental/transformers/qwen2/modeling.py b/paddlenlp/experimental/transformers/qwen2/modeling.py
index 6098079d9084..7624dfe983e3 100644
--- a/paddlenlp/experimental/transformers/qwen2/modeling.py
+++ b/paddlenlp/experimental/transformers/qwen2/modeling.py
@@ -87,8 +87,10 @@ def forward(self, x):
 
 @register_base_model
 class Qwen2InferenceModel(Qwen2PretrainedModel):
-    def __init__(self, config: Qwen2Config):
+    def __init__(self, config: Qwen2Config, base_model_prefix: str):
         super(Qwen2PretrainedModel, self).__init__(config)
+        self.base_model_prefix = base_model_prefix
+
         self.vocab_size = config.vocab_size
         self.hidden_size = config.hidden_size
         self.num_attention_heads = config.num_attention_heads
@@ -306,7 +308,7 @@ def __init__(self, config: Qwen2Config):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
@@ -771,7 +773,9 @@ def set_state_dict(self, state_dict):
                     ffn1_weight.cast(self.transformer_block.ffn1_weights[idx].dtype)
                 )
 
-            ffn2_weight = paddle.to_tensor(state_dict[f"{model_prefix}.mlp.down_proj.weight"])
+            ffn2_weight = paddle.to_tensor(state_dict[f"{model_prefix}.mlp.down_proj.weight"]).cast(
+                paddle.get_default_dtype()
+            )
             if self.use_weight_only:
                 ffn2_quanted_weight, ffn2_weight_scale = weight_quantize(ffn2_weight, algo=self.quant_algo)
                 self.transformer_block.ffn2_weights[idx].set_value(ffn2_quanted_weight)
@@ -1051,9 +1055,11 @@ def forward(
 
 
 class Qwen2ForCausalLMInferenceModel(GenerationInferenceModel, Qwen2PretrainedModel):
-    def __init__(self, config: Qwen2Config, **kwargs):
-        super(Qwen2ForCausalLMInferenceModel, self).__init__(config)
-        self.qwen2 = Qwen2InferenceModel(config)
+    def __init__(self, config: Qwen2Config, base_model_prefix: str = "qwen2"):
+        super().__init__(config)
+        self.base_model_prefix = base_model_prefix
+
+        self.qwen2 = Qwen2InferenceModel(config, base_model_prefix)
         if config.tie_word_embeddings:
             self.lm_head = Qwen2LMHead(config, embedding_weights=self.qwen2.embed_tokens.weight, transpose_y=True)
             self.tie_weights()
@@ -1214,9 +1220,9 @@ def set_state_dict(self, state_dict):
 
 @register_base_model
 class Qwen2BlockInferenceModel(Qwen2InferenceModel):
-    def __init__(self, config: Qwen2Config):
+    def __init__(self, config: Qwen2Config, base_model_prefix: str):
         self.append_attn = config.append_attn
-        super().__init__(config)
+        super().__init__(config, base_model_prefix)
         self.max_seq_len = config.max_seq_len
         self.block_size = config.block_size
 
@@ -1309,13 +1315,15 @@ class Qwen2ForCausalLMBlockInferenceModel(GenerationBlockInferenceModel, Qwen2Pr
 
     _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
 
-    def __init__(self, config):
+    def __init__(self, config: Qwen2Config, base_model_prefix: str = "qwen2"):
         super().__init__(config)
+        self.base_model_prefix = base_model_prefix
+
         self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
         self.verify_window = config.get("speculate_verify_window", 2)
         self.max_seq_len = config.max_seq_len
 
-        self.qwen2 = Qwen2BlockInferenceModel(config)
+        self.qwen2 = Qwen2BlockInferenceModel(config, base_model_prefix)
         if config.tie_word_embeddings:
             self.lm_head = Qwen2LMHead(config, embedding_weights=self.qwen2.embed_tokens.weight, transpose_y=True)
             self.tie_weights()
@@ -1347,6 +1355,31 @@ def get_tensor_parallel_split_mappings(num_layers):
                 "layers.0.mlp.down_proj.weight": partial(fn, is_column=False),
             }
 
+            if "a8w8" in config.quant_type:
+                if config.quantization_config.shift_smooth_all_linears:
+                    base_actions["layers.0.self_attn.o_proj.shift_bias"] = partial(fn, is_column=True)
+                    base_actions["layers.0.self_attn.o_proj.smooth_weight"] = partial(fn, is_column=True)
+                    base_actions["layers.0.mlp.down_proj.shift_bias"] = partial(fn, is_column=True)
+                    base_actions["layers.0.mlp.down_proj.smooth_weight"] = partial(fn, is_column=True)
+
+                if config.quantization_config.shift:
+                    if config.fuse_attention_qkv:
+                        base_actions["layers.0.self_attn.qkv_proj.bias"] = partial(fn, is_column=True)
+                    else:
+                        base_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True)
+                        # if we have enough num_key_value_heads to split, then split it.
+                        if config.num_key_value_heads % config.tensor_parallel_degree == 0:
+                            base_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True)
+                            base_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True)
+
+                    if config.fuse_attention_ffn:
+                        base_actions["layers.0.mlp.gate_up_fused_proj.bias"] = partial(
+                            fn, is_column=True, is_naive_2fuse=True
+                        )
+                    else:
+                        base_actions["layers.0.mlp.gate_proj.bias"] = partial(fn, is_column=True)
+                        base_actions["layers.0.mlp.up_proj.bias"] = partial(fn, is_column=True)
+
             # Column Linear
             if config.fuse_attention_qkv:
                 base_actions["layers.0.self_attn.qkv_proj.weight"] = partial(fn, is_column=True)
@@ -1520,6 +1553,18 @@ class Qwen2VLForConditionalGenerationBlockInferenceModel(Qwen2ForCausalLMBlockIn
     """
 
     # NOTE: (changwenbin) This function corresponds to QWen2-VL's second part, only used for QWen2-VL.
-    def __init__(self, config):
+    def __init__(self, config: Qwen2Config):
+        super().__init__(config)
+        self.qwen2.base_model_prefix = "model"
+
+
+class Qwen2_5_VLForConditionalGenerationBlockInferenceModel(Qwen2ForCausalLMBlockInferenceModel):
+    """
+    NOTE: (changwenbin) This class inherits from Qwen2ForCausalLMBlockInferenceModel.
+    Used only for QWen2-5-VL's second part.
+    """
+
+    # NOTE: (changwenbin) This function corresponds to QWen2-5-VL's second part, only used for QWen2-5-VL.
+    def __init__(self, config: Qwen2Config):
         super().__init__(config)
         self.qwen2.base_model_prefix = "model"
diff --git a/paddlenlp/experimental/transformers/qwen2_moe/modeling.py b/paddlenlp/experimental/transformers/qwen2_moe/modeling.py
index 1aa0969b4a11..9b1600fafd58 100644
--- a/paddlenlp/experimental/transformers/qwen2_moe/modeling.py
+++ b/paddlenlp/experimental/transformers/qwen2_moe/modeling.py
@@ -78,7 +78,6 @@ def __init__(self, config: Qwen2MoeConfig):
         self.num_key_value_heads = config.num_key_value_heads
         self.num_layers = config.num_hidden_layers
         self.rms_norm_eps = config.rms_norm_eps
-        self.max_position_embeddings = config.max_position_embeddings
         self.quant_type = config.quant_type
         self.rope_theta = config.rope_theta
 
@@ -217,7 +216,7 @@ def __init__(self, config: Qwen2MoeConfig):
             num_experts=self.num_experts,
             top_k=self.moe_topk,
             norm_topk_prob=self.norm_topk_prob,
-            moe_every2=False,
+            moe_intermediate_size=self.moe_intermediate_size,
             shared_expert_intermediate_size=self.shared_expert_intermediate_size,
             shared_expert_ffn1_weight_attrs=shared_expert_ffn1_weight_attrs,
             shared_expert_ffn1_weight_scale_attrs=shared_expert_ffn1_weight_scale_attrs,
@@ -230,7 +229,7 @@ def __init__(self, config: Qwen2MoeConfig):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.moe_intermediate_size,
+            intermediate_size=self.moe_intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
diff --git a/paddlenlp/generation/utils.py b/paddlenlp/generation/utils.py
index 151b2648e987..7d6ee91d259f 100644
--- a/paddlenlp/generation/utils.py
+++ b/paddlenlp/generation/utils.py
@@ -1437,16 +1437,8 @@ def _post_process_(
             outputs, input_ids, cur_len_gpu, origin_len_gpu, scores, unfinished_flag, model_kwargs, pad_token_id
         )
 
-        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
-            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
-            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
-            # removed after static graphs support inplace and stride.
-            with paddle.framework._no_check_dy2st_diff():
-                paddle.increment(cur_len)
-                paddle.increment(cur_len_gpu)
-        else:
-            paddle.increment(cur_len)
-            paddle.increment(cur_len_gpu)
+        cur_len += 1
+        cur_len_gpu += 1
 
         attn_mask = model_kwargs["attention_mask"]
         # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
@@ -1454,38 +1446,19 @@ def _post_process_(
         model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
         max_new_tokens = paddle.full([1], max_new_tokens + cur_len - 1, dtype="int64")
 
-        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
-            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
-            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
-            # removed after static graphs support inplace and stride.
-            with paddle.framework._no_check_dy2st_diff():
-                while cur_len < max_new_tokens and paddle.any(unfinished_flag):
-                    input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
-                        _forward_(**model_kwargs),
-                        input_ids,
-                        cur_len_gpu,
-                        origin_len_gpu,
-                        scores,
-                        unfinished_flag,
-                        model_kwargs,
-                        pad_token_id,
-                    )
-                    paddle.increment(cur_len)
-                    paddle.increment(cur_len_gpu)
-        else:
-            while cur_len < max_new_tokens and paddle.any(unfinished_flag):
-                input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
-                    _forward_(**model_kwargs),
-                    input_ids,
-                    cur_len_gpu,
-                    origin_len_gpu,
-                    scores,
-                    unfinished_flag,
-                    model_kwargs,
-                    pad_token_id,
-                )
-                paddle.increment(cur_len)
-                paddle.increment(cur_len_gpu)
+        while cur_len < max_new_tokens and paddle.any(unfinished_flag):
+            input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                _forward_(**model_kwargs),
+                input_ids,
+                cur_len_gpu,
+                origin_len_gpu,
+                scores,
+                unfinished_flag,
+                model_kwargs,
+                pad_token_id,
+            )
+            cur_len += 1
+            cur_len_gpu += 1
 
         return input_ids[:, origin_len:], scores
 
diff --git a/paddlenlp/mergekit/merge_config.py b/paddlenlp/mergekit/merge_config.py
index 484b4898f2dc..e440cec580f8 100644
--- a/paddlenlp/mergekit/merge_config.py
+++ b/paddlenlp/mergekit/merge_config.py
@@ -17,10 +17,7 @@
 from dataclasses import asdict, dataclass, field
 from typing import List, Optional
 
-import paddle
-
 from paddlenlp.utils.env import MERGE_CONFIG_NAME
-from paddlenlp.utils.log import logger
 
 
 @dataclass
@@ -30,7 +27,6 @@ class MergeConfig:
     """
 
     # Common parameters
-    device: str = field(default="cpu", metadata={"help": "Device to use for the merge.ex cpu、 gpu、low_gpu_mem"})
     tensor_type: str = field(
         default="np", metadata={"help": "Tensor type to use for the merge. Choose np(CPU Only) or pd (CPU/GPU)"}
     )
@@ -39,6 +35,8 @@ class MergeConfig:
     merge_method: str = field(default="linear", metadata={"help": "The merge strategy."})
     merge_type: str = field(default="linear", metadata={"help": "The type of merge process."})
     sparsify_type: str = field(default=None, metadata={"help": "The type of sparsify process."})
+    split_pieces: int = field(default=8, metadata={"help": "Split large tensor to multi-piece"})
+    max_tensor_mem: float = field(default=0.5, metadata={"help": "Split tensor if exceed setting max_tensor_mem."})
 
     # Model parameters
     model_path_list: Optional[List[str]] = field(default=None, metadata={"help": "Merge model name or path list"})
@@ -46,7 +44,11 @@ class MergeConfig:
         default=None, metadata={"help": "Merge model name or path string.(split by ',')"}
     )
     base_model_path: str = field(default=None, metadata={"help": "Base model name or path."})
-    output_path: str = field(default=None, metadata={"help": "Base model name or path."})
+    output_path: str = field(default=None, metadata={"help": "Output model name or path."})
+    lora_model_path: str = field(default=None, metadata={"help": "LoRA model name or path."})
+    copy_file_list: Optional[List[str]] = field(
+        default=None, metadata={"help": "Copy file list from base model path or first model path."}
+    )
     # merge parameters
     weight_list: Optional[List[float]] = field(
         default=None, metadata={"help": "Relative (or absolute if normalize=False) weighting of a given tensor"}
@@ -75,32 +77,43 @@ def config_check(self):
             os.makedirs(self.output_path, exist_ok=True)
         if self.tensor_type not in ["np", "pd"]:
             raise ValueError(f"Unsupported tensor type: {self.tensor_type}. Support 'np' and 'pd' only.")
-        if self.device == "gpu" and self.tensor_type == "np":
-            logger.warning("np only support cpu device, but got gpu. Setting `device` to `cpu`.")
-            self.device = "cpu"
-
-        elif self.merge_method not in ["linear", "ties", "slerp", "della_linear", "della", "dare_linear", "dare_ties"]:
-            raise ValueError(
-                f"Unsupported merge strategy: {self.merge_method}. Please choose one from ['linear', 'slerp']."
-            )
-        if self.model_path_str is not None:
-            self.model_path_list = self.model_path_str.split(",")
-        if self.model_path_list is not None:
-            if not isinstance(self.model_path_list, list) or len(self.model_path_list) < 2:
-                raise ValueError(f"Please specify the model_path_list at least two. But got {self.model_path_list}")
-            if self.weight_list is None:
-                self.weight_list = [1.0] * len(self.model_path_list)
-                self.normalize = True
-            if len(self.model_path_list) != len(self.weight_list):
-                raise ValueError("The length of model_path_list and weight_list must be the same.")
-        if self.reserve_p < 0 or self.reserve_p > 1:
-            raise ValueError("reserve_p must be between 0 and 1.")
-        if "della" in self.merge_method or self.sparsify_type == "magprune":
-            if self.reserve_p <= self.epsilon / 2 or self.reserve_p >= (1 - self.epsilon):
+        if self.lora_model_path is not None:
+            if self.base_model_path is None:
+                raise ValueError("Please specify the base_model_path when using LoRA merge.")
+            self.tensor_type = "pd"
+
+        if self.lora_model_path is None:
+            if self.merge_method not in [
+                "linear",
+                "ties",
+                "slerp",
+                "della_linear",
+                "della",
+                "dare_linear",
+                "dare_ties",
+            ]:
                 raise ValueError(
-                    f"Error: reserve_p +- epsilon/2 must be in the range (0, 1). reserve_p + epsilon/2 = {self.reserve_p + self.epsilon / 2 }, reserve_p - epsilon/2 = {self.reserve_p - self.epsilon / 2 }"
+                    f"Unsupported merge strategy: {self.merge_method}. Please choose one from ['linear', 'slerp', 'ties', 'della_linear', 'della', ']."
                 )
-        paddle.set_device(self.device)
+            if self.model_path_str is not None:
+                self.model_path_list = self.model_path_str.split(",")
+            if self.model_path_list is not None:
+                if not isinstance(self.model_path_list, list) or len(self.model_path_list) < 2:
+                    raise ValueError(
+                        f"Please specify the model_path_list at least two. But got {self.model_path_list}"
+                    )
+                if self.weight_list is None:
+                    self.weight_list = [1.0] * len(self.model_path_list)
+                    self.normalize = True
+                if len(self.model_path_list) != len(self.weight_list):
+                    raise ValueError("The length of model_path_list and weight_list must be the same.")
+            if self.reserve_p < 0 or self.reserve_p > 1:
+                raise ValueError("reserve_p must be between 0 and 1.")
+            if "della" in self.merge_method or self.sparsify_type == "magprune":
+                if self.reserve_p <= self.epsilon / 2 or self.reserve_p >= (1 - self.epsilon):
+                    raise ValueError(
+                        f"Error: reserve_p +- epsilon/2 must be in the range (0, 1). reserve_p + epsilon/2 = {self.reserve_p + self.epsilon / 2 }, reserve_p - epsilon/2 = {self.reserve_p - self.epsilon / 2 }"
+                    )
 
     @property
     def __dict__(self):
diff --git a/paddlenlp/mergekit/merge_method.py b/paddlenlp/mergekit/merge_method.py
index bd92f6da4d50..737312a75be5 100644
--- a/paddlenlp/mergekit/merge_method.py
+++ b/paddlenlp/mergekit/merge_method.py
@@ -48,11 +48,10 @@ def linear(self, tensor_list):
             tensor_output = sum(weight * tensor for weight, tensor in zip(weight_list, tensor_list))
             return tensor_output
         elif self.merge_config.tensor_type == "pd":
-            stacked_tensors = paddle.stack(tensor_list, axis=0)
-            weights = paddle.to_tensor(weight_list, dtype=stacked_tensors.dtype)
-            weights = weights.reshape([-1] + [1] * (len(stacked_tensors.shape) - 1))
-            weighted_sum = paddle.sum(stacked_tensors * weights, axis=0)
-            return weighted_sum
+            tensor_output = paddle.zeros_like(tensor_list[0])
+            for i, tensor in enumerate(tensor_list):
+                tensor_output += tensor * weight_list[i]
+            return tensor_output
         else:
             raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
@@ -155,28 +154,34 @@ def ties(self, tensor_list):
 
         elif self.merge_config.tensor_type == "pd":
             mask_dtype = tensor_list[0].dtype
-            weight_list = self.merge_config.weight_list
-            stacked_tensors = paddle.stack(tensor_list, axis=0)
-            weights = paddle.to_tensor(weight_list, dtype=stacked_tensors.dtype)
-            weights = weights.reshape([-1] + [1] * (len(stacked_tensors.shape) - 1))
-            weighted_tensors = stacked_tensors * weights
+
             # Elect majority sign
-            if self.merge_config.ties_elect_type == "sum":
-                majority_sign = (paddle.sum(weighted_tensors, axis=0) >= 0).astype(mask_dtype) * 2 - 1
-            elif self.merge_config.ties_elect_type == "count":
-                stacked_signs = paddle.sign(stacked_tensors).astype(mask_dtype)
-                majority_sign = (paddle.sum(stacked_signs, axis=0) >= 0).astype(mask_dtype) * 2 - 1
-            else:
-                raise NotImplementedError(f"ties_elect_type: {self.merge_config.ties_elect_type} is unknown.")
+            majority_sign = paddle.zeros_like(tensor_list[0])
+            for i, tensor in enumerate(tensor_list):
+                if self.merge_config.ties_elect_type == "sum":
+                    majority_sign += tensor * self.merge_config.weight_list[i]
+                elif self.merge_config.ties_elect_type == "count":
+                    majority_sign += tensor.sign()
+                else:
+                    raise NotImplementedError(f"ties_elect_type: {self.merge_config.ties_elect_type} is unknown.")
+            majority_sign = (majority_sign >= 0).astype(mask_dtype) * 2 - 1
 
             # Merge
-            stacked_masks = (paddle.sign(weighted_tensors) == majority_sign).astype(mask_dtype)
-            masked_tensors = stacked_masks * weighted_tensors
-            merge_tensor = paddle.sum(masked_tensors, axis=0)
+            merge_tensor = paddle.zeros_like(tensor_list[0])
+            if self.merge_config.normalize:
+                divisor = paddle.zeros_like(tensor_list[0])
+            for i, tensor in enumerate(tensor_list):
+                if self.merge_config.normalize:
+                    mask = (tensor.sign() == majority_sign).astype(mask_dtype) * self.merge_config.weight_list[i]
+                    divisor += mask
+                    merge_tensor += mask * tensor
+                else:
+                    merge_tensor += (
+                        (tensor.sign() == majority_sign).astype(mask_dtype) * tensor * self.merge_config.weight_list[i]
+                    )
+
             # Normalize
             if self.merge_config.normalize:
-                weight_masks = stacked_masks * weights
-                divisor = paddle.sum(weight_masks, axis=0)
                 divisor = paddle.where(paddle.abs(divisor) < 1e-8, paddle.ones_like(divisor), divisor)
                 merge_tensor /= divisor
 
diff --git a/paddlenlp/mergekit/merge_model.py b/paddlenlp/mergekit/merge_model.py
index 937041149f32..03684a51cf89 100644
--- a/paddlenlp/mergekit/merge_model.py
+++ b/paddlenlp/mergekit/merge_model.py
@@ -11,9 +11,10 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import gc
 import json
+import math
 import os
+import shutil
 from multiprocessing import Process
 
 import numpy as np
@@ -22,6 +23,8 @@
 from safetensors import safe_open
 from safetensors.numpy import save_file
 
+from paddlenlp.peft import LoRAConfig
+from paddlenlp.utils import device_guard
 from paddlenlp.utils.env import (
     LORA_WEIGHTS_NAME,
     PADDLE_MASTER_WEIGHTS_NAME,
@@ -29,14 +32,14 @@
     SAFE_MASTER_WEIGHTS_INDEX_NAME,
     SAFE_MASTER_WEIGHTS_NAME,
     SAFE_PEFT_WEIGHTS_INDEX_NAME,
-    SAFE_PEFT_WEIGHTS_NAME,
     SAFE_WEIGHTS_INDEX_NAME,
     SAFE_WEIGHTS_NAME,
 )
+from paddlenlp.utils.log import logger
 from paddlenlp.utils.safetensors import fast_safe_open
 
 from .merge_method import MergeMethod
-from .merge_utils import divide_positions
+from .merge_utils import divide_lora_key_list, divide_positions
 from .sparsify_method import SparsifyMethod
 
 SPARSIFY_MERGE_MAPPING = {
@@ -57,6 +60,10 @@ def __init__(self, merge_config):
         self.is_peft = False
 
     def reset_merge_model(self, merge_config=None, merge_param_dict=None):
+        self.is_cpu = "cpu" in paddle.device.get_device()
+        if not self.is_cpu:
+            if dist.get_world_size() > 1 and not paddle.distributed.is_initialized():
+                dist.init_parallel_env()
         if merge_config is not None:
             self.merge_config = merge_config
         elif merge_param_dict is not None:
@@ -76,207 +83,269 @@ def reset_merge_model(self, merge_config=None, merge_param_dict=None):
         self.merge_method = MergeMethod(merge_config, sparsify_method)
 
     def merge_model(self):
+        if self.merge_config.lora_model_path is not None:
+            self.merge_lora_model()
+        else:
+            if self.merge_config.tensor_type == "np" and not self.is_cpu:
+                # Avoid memory allocated on GPU
+                with device_guard():
+                    self.mergekit()
+            else:
+                self.mergekit()
+        self.copy_file()
+
+    def copy_file(self):
+        if self.merge_config.copy_file_list is not None:
+            if self.merge_config.base_model_path is not None:
+                src_path = self.merge_config.base_model_path
+            else:
+                src_path = self.merge_config.model_path_list[0]
+            for file in self.merge_config.copy_file_list:
+                src_file = os.path.join(src_path, file)
+                dst_file = os.path.join(self.merge_config.output_path, file)
+                if os.path.isfile(src_file):
+                    shutil.copy2(src_file, dst_file)
+                else:
+                    logger.warning(f"Copy failed: {file} not found in {src_path}")
+
+    def mergekit(self):
+        # Check model file type
         file_type_list = []
         for model_path in self.merge_config.model_path_list:
             file_type_list.append(self.check_model_path(model_path))
         if self.merge_config.base_model_path is not None:
             file_type_list.append(self.check_model_path(self.merge_config.base_model_path))
-        if not all(file_type[1] is True for file_type in file_type_list) and not all(
-            file_type[1] is False for file_type in file_type_list
-        ):
-            raise ValueError("Please ensure that all models should be same type.")
-        if all(
-            file_type[0] == "safetensors" or file_type[0] == "safetensors_without_index"
-            for file_type in file_type_list
-        ):
+
+        # Merge model (distinguish between safetensors and pdparams)
+        if all(file_type == "safetensors" or file_type == "safetensors_without_index" for file_type in file_type_list):
             self.merge_safetensor_model(file_type_list)
         else:
             self.merge_mix_model(file_type_list)
 
     def merge_mix_model(self, file_type_list):
+        # Load model state dict
         state_dict_list = []
         for i, model_path in enumerate(self.merge_config.model_path_list):
-            state_dict_list.append(self.get_model_state_dict(model_path, file_type_list[i][0]))
+            state_dict_list.append(self.get_model_state_dict(model_path, file_type_list[i]))
         if self.merge_config.base_model_path is not None:
-            state_dict_list.append(self.get_model_state_dict(self.merge_config.base_model_path, file_type_list[-1][0]))
+            state_dict_list.append(self.get_model_state_dict(self.merge_config.base_model_path, file_type_list[-1]))
+
         if not all(state_dict_list[0].keys() == state_dict.keys() for state_dict in state_dict_list):
             raise ValueError("State dict keys mismatch. Please make sure you load the correct weight file")
-        if self.merge_config.base_model_path is not None:
-            base_state_dict = state_dict_list.pop()
-            base_file_type = file_type_list.pop()
+
+        # Merge state dict
         merge_state_dict = {}
-        total_size = 0
-        weight_map = {}
-        for key in state_dict_list[0].keys():
-            is_bf16 = False
-            tensor_list = []
-            for state_dict, file_type in zip(state_dict_list, file_type_list):
-                if file_type[0] == "pdparams":
-                    if str(state_dict[key].dtype) == "paddle.bfloat16":
-                        is_bf16 = True
-                        state_dict[key] = state_dict[key].astype("float32").numpy()
+        index = {"metadata": {"total_size": 0}, "weight_map": {}}
+
+        key_list = list(state_dict_list[file_type_list.index("pdparams")].keys())
+        model_num = len(state_dict_list)
+        rank = dist.get_rank()
+        positions = divide_positions(len(key_list), dist.get_world_size())
+        local_keys = key_list[positions[rank] : positions[rank + 1]]
+        for ii in range(len(positions) - 1):
+            shard_file = f"{self.merge_config.merge_prefix}-{ii+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+            for key in key_list[positions[ii] : positions[ii + 1]]:
+                index["weight_map"][key] = shard_file
+                index["metadata"]["total_size"] += int(
+                    np.prod(state_dict_list[0][key].shape) * self.numpy_dtype_map[str(state_dict_list[0][key].dtype)]
+                )
+        for key in local_keys:
+            # Tensor preprocess
+            is_bf16 = str(state_dict_list[0][key].dtype) == "uint16"
+            tensor_list = [state_dict_list[i].pop(key) for i in range(model_num)]
+            tensor_mem = int(np.prod(tensor_list[0].shape) * self.numpy_dtype_map[str(tensor_list[0].dtype)]) / (
+                1024**3
+            )
+            if self.merge_config.tensor_type == "pd" and tensor_mem > self.merge_config.max_tensor_mem:
+                tensor_split_list = [
+                    np.array_split(tensor, self.merge_config.split_pieces, axis=0) for tensor in tensor_list
+                ]
+                merge_split = []
+                for sp in range(self.merge_config.split_pieces):
+                    tensor_list = [tensor_split[sp] for tensor_split in tensor_split_list]
+                    if is_bf16:
+                        tensor_list = [
+                            paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list
+                        ]
                     else:
-                        state_dict[key] = state_dict[key].numpy()
-                elif str(state_dict[key].dtype) == "uint16":
-                    is_bf16 = True
-                    state_dict[key] = paddle.to_tensor(state_dict[key], dtype="bfloat16").astype("float32").numpy()
-                tensor_list.append(state_dict[key])
-            if self.merge_config.base_model_path is not None:
-                if base_file_type[0] == "pdparams":
-                    if str(base_state_dict[key].dtype) == "paddle.bfloat16":
-                        base_state_dict[key] = base_state_dict[key].astype("float32").numpy()
+                        tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                    if self.merge_config.base_model_path is not None:
+                        base_tensor = tensor_list.pop()
+                        tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                    merge_tensor = self.merge_method.merge(tensor_list)
+                    if self.merge_config.base_model_path is not None:
+                        merge_tensor += base_tensor
+                    if is_bf16:
+                        merge_split.append(merge_tensor.astype("bfloat16").numpy())
                     else:
-                        base_state_dict[key] = base_state_dict[key].numpy()
-                elif str(base_state_dict[key].dtype) == "uint16":
-                    base_state_dict[key] = (
-                        paddle.to_tensor(base_state_dict[key], dtype="bfloat16").astype("float32").numpy()
-                    )
-                tensor_list = [tensor - base_state_dict[key] for tensor in tensor_list]
-            merge_state_dict[key] = self.merge_method.merge(tensor_list)
-            if self.merge_config.base_model_path is not None:
-                merge_state_dict[key] += base_state_dict[key]
-            # dtype==bfloat16: numpy(float32) -> paddle(float32) -> paddle(bfloat16) -> numpy(uint16)
-            if is_bf16:
-                merge_state_dict[key] = (
-                    paddle.to_tensor(merge_state_dict[key], dtype="float32").astype("bfloat16").numpy()
-                )
-            total_size += np.prod(merge_state_dict[key].shape) * self.numpy_dtype_map[str(merge_state_dict[key].dtype)]
-            if self.merge_config.merge_prefix == "model" and file_type_list[0][1] is True:
-                weight_map[key] = "peft_model-00001-of-00001.safetensors"
+                        merge_split.append(merge_tensor.numpy())
+                merge_state_dict[key] = np.concatenate(merge_split, axis=0)
             else:
-                weight_map[key] = f"{self.merge_config.merge_prefix}-00001-of-00001.safetensors"
-        # save safetensor file
-        if self.merge_config.merge_prefix == "model" and file_type_list[0][1] is True:
-            save_file(
-                merge_state_dict,
-                os.path.join(self.merge_config.output_path, "peft_model-00001-of-00001.safetensors"),
-                metadata={"format": "np"},
-            )
-        else:
-            save_file(
-                merge_state_dict,
-                os.path.join(
-                    self.merge_config.output_path, f"{self.merge_config.merge_prefix}-00001-of-00001.safetensors"
-                ),
-                metadata={"format": "np"},
-            )
-        # save safe index file
-        index = {"metadata": {"total_size": int(total_size)}, "weight_map": weight_map}
-        if self.merge_config.merge_prefix == "model" and file_type_list[0][1] is True:
-            save_index_file = os.path.join(self.merge_config.output_path, SAFE_PEFT_WEIGHTS_INDEX_NAME)
-        else:
+                if self.merge_config.tensor_type == "pd":
+                    if is_bf16:
+                        tensor_list = [
+                            paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list
+                        ]
+                    else:
+                        tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                elif self.merge_config.tensor_type == "np" and is_bf16:
+                    tensor_list = [
+                        paddle.Tensor(tensor, zero_copy=True).astype("float32").numpy() for tensor in tensor_list
+                    ]
+
+                if self.merge_config.base_model_path is not None:
+                    base_tensor = tensor_list.pop()
+                    tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                merge_tensor = self.merge_method.merge(tensor_list)
+                if self.merge_config.base_model_path is not None:
+                    merge_tensor += base_tensor
+                if self.merge_config.tensor_type == "pd":
+                    if is_bf16:
+                        merge_state_dict[key] = merge_tensor.astype("bfloat16").numpy()
+                    else:
+                        merge_state_dict[key] = merge_tensor.numpy()
+                elif self.merge_config.tensor_type == "np" and is_bf16:
+                    # dtype==bfloat16: numpy(float32) -> paddle(float32) -> paddle(bfloat16) -> numpy(uint16)
+                    merge_state_dict[key] = paddle.Tensor(merge_tensor, zero_copy=True).astype("bfloat16").numpy()
+
+        # Save safetensor file
+        save_file(
+            merge_state_dict,
+            os.path.join(
+                self.merge_config.output_path,
+                f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors",
+            ),
+            metadata={"format": "np"},
+        )
+        # Save index file & merge config file
+        if paddle.distributed.get_rank() == 0:
             save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
-        with open(save_index_file, "w", encoding="utf-8") as f:
-            content = json.dumps(index, indent=2) + "\n"
-            f.write(content)
-        # save merge config file
-        self.merge_config.save_pretrained(self.merge_config.output_path)
-        del state_dict_list
-        del merge_state_dict
-        if self.merge_config.base_model_path is not None:
-            del base_state_dict
-        gc.collect()
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
+            self.merge_config.save_pretrained(self.merge_config.output_path)
 
-    def get_model_state_dict(self, model_path, file_type):
+    def get_model_state_dict(self, model_path, file_type, key_list=None, file=None):
         if file_type == "safetensors":
             state_dict = {}
             with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
                 index = json.load(f)
-            for key in index["weight_map"].keys():
-                with fast_safe_open(
-                    os.path.join(model_path, index["weight_map"][key]),
-                    framework="np",
-                ) as f:
-                    state_dict[key] = f.get_tensor(key)
+            if file is not None:
+                with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                    for k in f.keys():
+                        state_dict[k] = f.get_tensor(k)
+            elif key_list is None:
+                files = set(index["weight_map"].values())
+                for file in files:
+                    with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                        for k in f.keys():
+                            state_dict[k] = f.get_tensor(k)
+            else:
+                file_map = {}
+                for key in key_list:
+                    if index["weight_map"][key] not in file_map:
+                        file_map[index["weight_map"][key]] = [key]
+                    else:
+                        file_map[index["weight_map"][key]].append(key)
+                for file in file_map.keys():
+                    with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                        for k in file_map[file]:
+                            state_dict[k] = f.get_tensor(k)
         elif file_type == "safetensors_without_index":
             state_dict = {}
             with fast_safe_open(os.path.join(model_path, self.safe_weight_name()), framework="numpy") as f:
-                for k in f.keys():
+                tgt_key_list = f.keys() if key_list is None else key_list
+                for k in tgt_key_list:
                     state_dict[k] = f.get_tensor(k)
         elif file_type == "pdparams":
-            state_dict = paddle.load(os.path.join(model_path, self.weight_name()))
+            state_dict = np.load(os.path.join(model_path, self.weight_name()), allow_pickle=True)
+            if "StructuredToParameterName@@" in state_dict.keys():
+                state_dict.pop("StructuredToParameterName@@")
+        elif file_type == "lora_pdparams":
+            state_dict = np.load(os.path.join(model_path, LORA_WEIGHTS_NAME), allow_pickle=True)
+        elif file_type == "lora_safetensors":
+            state_dict = {}
+            with open(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME), "r", encoding="utf-8") as f:
+                index = json.load(f)
+            files = set(index["weight_map"].values())
+            for file in files:
+                with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                    for k in f.keys():
+                        state_dict[k] = f.get_tensor(k)
         else:
             raise ValueError(f"Unsupported file_type: {file_type}")
         return state_dict
 
-    def create_safetensor_index(self, model_path):
-        weight_map = {}
-        total_size = 0
-
-        with safe_open(os.path.join(model_path, self.safe_weight_name()), framework="numpy") as f:
-            for key in f.keys():
-                tensor = f.get_tensor(key)
-                total_size += np.prod(tensor.shape) * self.numpy_dtype_map[str(tensor.dtype)]
-                weight_map[key] = self.safe_weight_name()
-        index = {"metadata": {"total_size": total_size}, "weight_map": weight_map}
+    def get_safetensor_index(self, model_path, file_type):
+        if file_type == "safetensors":
+            with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
+                index = json.load(f)
+        elif file_type == "safetensors_without_index":
+            weight_map = {}
+            total_size = 0
+            with safe_open(os.path.join(model_path, self.safe_weight_name()), framework="numpy") as f:
+                for key in f.keys():
+                    tensor = f.get_tensor(key)
+                    total_size += int(np.prod(tensor.shape) * self.numpy_dtype_map[str(tensor.dtype)])
+                    weight_map[key] = self.safe_weight_name()
+            index = {"metadata": {"total_size": total_size}, "weight_map": weight_map}
         return index
 
     def merge_safetensor_model(self, file_type_list):
-        use_gpu = self.merge_config.device == "gpu"
-
-        if use_gpu:
-            rank = dist.get_rank()
-        if dist.get_world_size() > 1:
-            dist.init_parallel_env()
-
         # Load index
         index_list = []
-        model_path_list = self.merge_config.model_path_list
+        model_path_list = self.merge_config.model_path_list.copy()
         if self.merge_config.base_model_path is not None:
             model_path_list += [self.merge_config.base_model_path]
 
         for model_path, file_type in zip(model_path_list, file_type_list):
-            if file_type[0] == "safetensors":
-                if file_type[1] is False:
-                    with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
-                        index_list.append(json.load(f))
-                else:
-                    with open(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME), "r", encoding="utf-8") as f:
-                        index_list.append(json.load(f))
-            else:
-                index = self.create_safetensor_index(model_path)
-                index_list.append(index)
+            index_list.append(self.get_safetensor_index(model_path, file_type))
 
         # Check index
         if not all(index_list[0]["metadata"]["total_size"] == index["metadata"]["total_size"] for index in index_list):
             raise ValueError("Weights total_size mismatch. Please make sure you load the correct weight file")
         if not all(index_list[0]["weight_map"].keys() == index["weight_map"].keys() for index in index_list):
             raise ValueError("Weights weight_map mismatch. Please make sure you load the correct weight file")
-
         # Initialize new index
         index = {}
         index["metadata"] = index_list[0]["metadata"]
         index["metadata"]["total_size"] = int(index["metadata"]["total_size"])
         index["weight_map"] = {}
-
+        num = self.merge_config.n_process if self.is_cpu else dist.get_world_size()
         key_list = list(index_list[0]["weight_map"].keys())
-        if use_gpu:
-            positions = divide_positions(len(key_list), dist.get_world_size())
-        else:
-            positions = divide_positions(len(key_list), self.merge_config.n_process)
-
-        if use_gpu:
-            start_idx = positions[rank]
-            end_idx = positions[rank + 1] if rank + 1 < len(positions) else len(key_list)
-            local_keys = key_list[start_idx:end_idx]
-            # use index_gpu to preserve index
-            index_gpu = index
-            for i in range(len(positions) - 1):
+        positions = divide_positions(len(key_list), num)
+        if not self.is_cpu:
+            rank = dist.get_rank()
+            file_list = sorted(list(set(index_list[0]["weight_map"].values())))
+            if file_type_list[0] == "safetensors" and len(file_list) >= num:
+                positions = divide_positions(len(file_list), num)
+                index["weight_map"] = index_list[0]["weight_map"]
+                file_map = {}
+                for key in key_list:
+                    if index["weight_map"][key] not in file_map:
+                        file_map[index["weight_map"][key]] = [key]
+                    else:
+                        file_map[index["weight_map"][key]].append(key)
+                for shard_file in file_list[positions[rank] : positions[rank + 1]]:
+                    if self.merge_config.tensor_type == "np":
+                        self.shard_merge_np(file_map[shard_file], index_list, shard_file)
+                    else:
+                        self.shard_merge_pd(file_map[shard_file], index_list, shard_file)
+            else:
+                local_keys = key_list[positions[rank] : positions[rank + 1]]
                 shard_file = (
-                    f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{self.merge_config.n_process:05d}.safetensors"
+                    f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors"
                 )
-                for k in key_list[positions[i] : positions[i + 1]]:
-                    index_gpu["weight_map"][k] = shard_file
-            shard_file = f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors"
-            for k in local_keys:
-                index["weight_map"][k] = shard_file
+                if self.merge_config.tensor_type == "np":
+                    self.shard_merge_np(local_keys, index_list, shard_file)
+                else:
+                    self.shard_merge_pd(local_keys, index_list, shard_file)
 
-            if self.merge_config.tensor_type == "np":
-                ValueError(f"Tensor type '{self.merge_config.tensor_type}' should be 'pd' when using GPU.")
-            else:
-                self.shard_merge_pd(local_keys, index_list, shard_file)
-            if dist.get_world_size() > 1:
-                dist.barrier()
+                for i in range(len(positions) - 1):
+                    shard_file = (
+                        f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                    )
+                    for k in key_list[positions[i] : positions[i + 1]]:
+                        index["weight_map"][k] = shard_file
         else:
             threads = []
             for i in range(len(positions) - 1):
@@ -299,17 +368,11 @@ def merge_safetensor_model(self, file_type_list):
                 t.start()
             for t in threads:
                 t.join()
-
         # Save safe index file
-        if not use_gpu or (use_gpu and rank == 0):
+        if paddle.distributed.get_rank() == 0:
             save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
             with open(save_index_file, "w", encoding="utf-8") as f:
-                if not use_gpu:
-                    content = json.dumps(index, indent=2) + "\n"
-                else:
-                    content = json.dumps(index_gpu, indent=2) + "\n"
-                f.write(content)
-            self.merge_config.save_pretrained(self.merge_config.output_path)
+                f.write(json.dumps(index, indent=2) + "\n")
 
     def shard_merge_np(
         self,
@@ -320,14 +383,13 @@ def shard_merge_np(
         merge_state_dict = {}
         for k in key_list:
             tensor_list = []
-
             for i, model_path in enumerate(self.merge_config.model_path_list):
                 with fast_safe_open(os.path.join(model_path, index_list[i]["weight_map"][k]), framework="np") as w:
                     tensor = w.get_tensor(k)
                     dtype = tensor.dtype
                     # dtype==bfloat16: numpy(uint16) -> paddle(bfloat16) -> paddle(float32) -> numpy(float32)
                     if tensor.dtype == np.uint16:
-                        tensor = paddle.to_tensor(tensor, dtype="bfloat16").astype("float32").numpy()
+                        tensor = paddle.Tensor(tensor, zero_copy=True).astype("float32").numpy()
                     tensor_list.append(tensor)
             if self.merge_config.base_model_path is not None:
                 with fast_safe_open(
@@ -336,25 +398,19 @@ def shard_merge_np(
                 ) as w:
                     base_tensor = w.get_tensor(k)
                     if base_tensor.dtype == np.uint16:
-                        base_tensor = paddle.to_tensor(base_tensor, dtype="bfloat16").astype("float32").numpy()
+                        base_tensor = paddle.Tensor(base_tensor, zero_copy=True).astype("float32").numpy()
                 tensor_list = [tensor - base_tensor for tensor in tensor_list]
             merge_state_dict[k] = self.merge_method.merge(tensor_list)
             if self.merge_config.base_model_path is not None:
                 merge_state_dict[k] += base_tensor
             # dtype==bfloat16: numpy(float32) -> paddle(float32) -> paddle(bfloat16) -> numpy(uint16)
             if dtype == np.uint16:
-                merge_state_dict[k] = paddle.to_tensor(merge_state_dict[k], dtype="float32").astype("bfloat16").numpy()
-            del tensor_list
-            if self.merge_config.base_model_path is not None:
-
-                del base_tensor
+                merge_state_dict[k] = paddle.Tensor(merge_state_dict[k], zero_copy=True).astype("bfloat16").numpy()
         save_file(
             merge_state_dict,
             os.path.join(self.merge_config.output_path, shard_file),
             metadata={"format": "np"},
         )
-        del merge_state_dict
-        gc.collect()
 
     def shard_merge_pd(
         self,
@@ -365,81 +421,85 @@ def shard_merge_pd(
         merge_state_dict = {}
         for k in key_list:
             tensor_list = []
-            is_bf16 = False
             for i, model_path in enumerate(self.merge_config.model_path_list):
                 with fast_safe_open(os.path.join(model_path, index_list[i]["weight_map"][k]), framework="np") as w:
-                    tensor = w.get_tensor(k)
-                    tensor = paddle.to_tensor(tensor)
-                    if tensor.dtype == paddle.bfloat16 and self.merge_config.device == "cpu":
-                        is_bf16 = True
-                        tensor = tensor.astype("float32")
-                    tensor_list.append(tensor)
+                    tensor_list.append(w.get_tensor(k))
             if self.merge_config.base_model_path is not None:
                 with fast_safe_open(
                     os.path.join(self.merge_config.base_model_path, index_list[-1]["weight_map"][k]),
                     framework="np",
                 ) as w:
-                    base_tensor = w.get_tensor(k)
-                    base_tensor = paddle.to_tensor(base_tensor)
-                tensor_list = [tensor - base_tensor for tensor in tensor_list]
-
-            merge_tensor = self.merge_method.merge(tensor_list)
-
-            if self.merge_config.base_model_path is not None:
-                merge_tensor += base_tensor
-            if is_bf16:
-                merge_tensor = merge_tensor.astype("bfloat16")
-            merge_state_dict[k] = merge_tensor.numpy()
-
-            del tensor_list
-            paddle.device.cuda.empty_cache()
-            if self.merge_config.base_model_path is not None:
-                del base_tensor
-                paddle.device.cuda.empty_cache()
-
+                    tensor_list.append(w.get_tensor(k))
+            is_bf16 = str(tensor_list[0].dtype) == "uint16"
+            tensor_mem = int(np.prod(tensor_list[0].shape) * self.numpy_dtype_map[str(tensor_list[0].dtype)]) / (
+                1024**3
+            )
+            if tensor_mem > self.merge_config.max_tensor_mem:
+                tensor_split_list = [
+                    np.array_split(tensor, self.merge_config.split_pieces, axis=0) for tensor in tensor_list
+                ]
+                merge_split = []
+                for sp in range(self.merge_config.split_pieces):
+                    tensor_list = [tensor_split[sp] for tensor_split in tensor_split_list]
+                    if is_bf16:
+                        tensor_list = [
+                            paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list
+                        ]
+                    else:
+                        tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                    if self.merge_config.base_model_path is not None:
+                        base_tensor = tensor_list.pop()
+                        tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                    merge_tensor = self.merge_method.merge(tensor_list)
+                    if self.merge_config.base_model_path is not None:
+                        merge_tensor += base_tensor
+                    if is_bf16:
+                        merge_split.append(merge_tensor.astype("bfloat16").numpy())
+                    else:
+                        merge_split.append(merge_tensor.numpy())
+                merge_state_dict[k] = np.concatenate(merge_split, axis=0)
+            else:
+                if is_bf16:
+                    tensor_list = [paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list]
+                else:
+                    tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                if self.merge_config.base_model_path is not None:
+                    base_tensor = tensor_list.pop()
+                    tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                merge_tensor = self.merge_method.merge(tensor_list)
+                if self.merge_config.base_model_path is not None:
+                    merge_tensor += base_tensor
+                if is_bf16:
+                    merge_state_dict[k] = merge_tensor.astype("bfloat16").numpy()
+                else:
+                    merge_state_dict[k] = merge_tensor.numpy()
         save_file(
             merge_state_dict,
             os.path.join(self.merge_config.output_path, shard_file),
             metadata={"format": "np"},
         )
-        del merge_state_dict
-        paddle.device.cuda.empty_cache()
-        gc.collect()
 
-    def check_model_path(self, model_path):
+    def check_model_path(self, model_path, lora_merge=False):
         if os.path.exists(os.path.join(model_path, self.safe_index_name())):
-            with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
-                index = json.load(f)
-                safe_file_list = list(set(index["weight_map"][k] for k in index["weight_map"]))
-                for i in range(len(safe_file_list)):
-                    if os.path.exists(os.path.join(model_path, safe_file_list[i])):
-                        continue
-                    else:
-                        ValueError(f"Not found {os.path.join(model_path, safe_file_list[i])}.")
-            file_type = ["safetensors", False]
+            file_type = "safetensors"
         elif os.path.exists(os.path.join(model_path, self.safe_weight_name())):
-            file_type = ["safetensors_without_index", False]
+            file_type = "safetensors_without_index"
         elif os.path.exists(os.path.join(model_path, self.weight_name())):
-            file_type = ["pdparams", False]
+            file_type = "pdparams"
+        else:
+            raise ValueError(
+                f"Please check path {model_path} is correct. Support safetensors and pdparams only in complete parameter format (not TP or PP format) only."
+            )
+        return file_type
 
-        # lora
-        elif os.path.exists(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME)):
-            with open(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME), "r", encoding="utf-8") as f:
-                index = json.load(f)
-                safe_file_list = list(set(index["weight_map"][k] for k in index["weight_map"]))
-                for i in range(len(safe_file_list)):
-                    if os.path.exists(os.path.join(model_path, safe_file_list[i])):
-                        continue
-                    else:
-                        ValueError(f"Not found {os.path.join(model_path, safe_file_list[i])}.")
-            file_type = ["safetensors", True]
-        elif os.path.exists(os.path.join(model_path, SAFE_PEFT_WEIGHTS_NAME)):
-            file_type = ["safetensors_without_index", True]
+    def check_lora_model_path(self, model_path):
+        if os.path.exists(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME)):
+            file_type = "lora_safetensors"
         elif os.path.exists(os.path.join(model_path, LORA_WEIGHTS_NAME)):
-            file_type = ["pdparams", True]
+            file_type = "lora_pdparams"
         else:
             raise ValueError(
-                f"Please check path {model_path} is correct. Support safetensors and pdparams in complete parameter format (not TP or PP format) only."
+                f"Please check lora path {model_path} is correct. Support safetensors and pdparams only in complete parameter format (not TP or PP format) only."
             )
         return file_type
 
@@ -460,3 +520,198 @@ def safe_index_name(self):
             return SAFE_WEIGHTS_INDEX_NAME
         else:
             return SAFE_MASTER_WEIGHTS_INDEX_NAME
+
+    def merge_lora_model(self):
+        # Check model file type
+        file_type_list = []
+        file_type_list.append(self.check_lora_model_path(self.merge_config.lora_model_path))
+        file_type_list.append(self.check_model_path(self.merge_config.base_model_path))
+        # Merge model (distinguish between safetensors and pdparams)
+        if "safetensors" in file_type_list[-1]:
+            self.merge_safetensor_lora_model(file_type_list)
+        else:
+            self.merge_pdparams_lora_model(file_type_list)
+
+    def shard_lora_merge(self, base_index, shard_file, lora_config, file_type_list, key_list=None, file=None):
+        merge_state_dict = {}
+        base_state_dict = self.get_model_state_dict(
+            self.merge_config.base_model_path, file_type_list[1], key_list=key_list, file=file
+        )
+        lora_state_dict = self.get_model_state_dict(self.merge_config.lora_model_path, file_type_list[0])
+        if not lora_config.rslora:
+            scaling = lora_config.lora_alpha / lora_config.r
+        else:
+            scaling = lora_config.lora_alpha / math.sqrt(lora_config.r)
+
+        model_key_list = list(base_state_dict.keys())
+        for k in model_key_list:
+            if lora_state_dict is not None and k in lora_state_dict.keys():
+                tensor = lora_state_dict.pop(k)
+            else:
+                tensor = base_state_dict.pop(k)
+            if "weight" in k:
+                lora_A_key, lora_B_key = k.replace("weight", "lora_A"), k.replace("weight", "lora_B")
+                lora_A_tensor = None
+                if lora_state_dict is not None and lora_A_key in lora_state_dict.keys():
+                    lora_A_tensor, lora_B_tensor = lora_state_dict.pop(lora_A_key), lora_state_dict.pop(lora_B_key)
+                    is_bf16 = tensor.dtype == np.uint16
+                    tensor = paddle.Tensor(tensor, zero_copy=True)
+                    lora_A_tensor = paddle.Tensor(lora_A_tensor, zero_copy=True)
+                    lora_B_tensor = paddle.Tensor(lora_B_tensor, zero_copy=True)
+                    if self.is_cpu and is_bf16:
+                        tensor = tensor.astype("float32")
+                        lora_A_tensor = lora_A_tensor.astype("float32")
+                        lora_B_tensor = lora_B_tensor.astype("float32")
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                        tensor = tensor.astype("bfloat16").numpy()
+                    else:
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                        tensor = tensor.numpy()
+            merge_state_dict[k] = tensor
+        save_file(
+            merge_state_dict,
+            os.path.join(self.merge_config.output_path, shard_file),
+            metadata={"format": "np"},
+        )
+
+    def merge_safetensor_lora_model(self, file_type_list):
+        # Load index
+        base_index = self.get_safetensor_index(self.merge_config.base_model_path, file_type_list[-1])
+        lora_config = LoRAConfig.from_pretrained(self.merge_config.lora_model_path)
+
+        # Initialize new index
+        index = {}
+        index["metadata"] = base_index["metadata"]
+        index["metadata"]["total_size"] = int(index["metadata"]["total_size"])
+        index["weight_map"] = {}
+
+        # LoRA Merge
+        key_list = list(base_index["weight_map"].keys())
+        if not self.is_cpu:
+            rank = dist.get_rank()
+            file_list = sorted(list(set(base_index["weight_map"].values())))
+            if file_type_list[-1] == "safetensors" and len(file_list) >= dist.get_world_size():
+                positions = divide_positions(len(file_list), dist.get_world_size())
+                for shard_file in file_list[positions[rank] : positions[rank + 1]]:
+                    self.shard_lora_merge(base_index, shard_file, lora_config, file_type_list, file=shard_file)
+                index["weight_map"] = base_index["weight_map"]
+            else:
+                divided_key_list = divide_lora_key_list(key_list, dist.get_world_size(), lora_config)
+                local_keys = divided_key_list[rank]
+                shard_file = (
+                    f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                )
+                self.shard_lora_merge(base_index, shard_file, lora_config, file_type_list, key_list=local_keys)
+                for i in range(len(divided_key_list)):
+                    shard_file = (
+                        f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                    )
+                    for k in divided_key_list[i]:
+                        index["weight_map"][k] = shard_file
+        else:
+            divided_key_list = divide_lora_key_list(key_list, self.merge_config.n_process, lora_config)
+            threads = []
+            for i in range(len(divided_key_list)):
+                shard_file = (
+                    f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{self.merge_config.n_process:05d}.safetensors"
+                )
+                t = Process(
+                    target=self.shard_lora_merge,
+                    args=(
+                        base_index,  # base index
+                        shard_file,  # shard_file name
+                        lora_config,
+                        file_type_list,
+                        divided_key_list[i],  # key_list
+                    ),
+                )
+                threads.append(t)
+                for k in divided_key_list[i]:
+                    index["weight_map"][k] = shard_file
+
+            for t in threads:
+                t.start()
+            for t in threads:
+                t.join()
+
+        # Save safe index file
+        if paddle.distributed.get_rank() == 0:
+            save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
+            self.merge_config.save_pretrained(self.merge_config.output_path)
+
+    def merge_pdparams_lora_model(self, file_type_list):
+        # Load & check state dict
+        lora_state_dict = self.get_model_state_dict(self.merge_config.lora_model_path, file_type_list[0])
+        base_state_dict = self.get_model_state_dict(self.merge_config.base_model_path, file_type_list[1])
+        for key in lora_state_dict.keys():
+            if "lora_A" in key:
+                if key.replace("lora_A", "lora_B") not in lora_state_dict.keys():
+                    raise ValueError(f"{key} is not paired with {key.replace('lora_A', 'lora_B')}")
+                if key.replace("lora_A", "weight") not in base_state_dict.keys():
+                    raise ValueError(f'{key.replace("lora_A", "weight")} does not exist in base model.')
+
+        # Load lora config
+        lora_config = LoRAConfig.from_pretrained(self.merge_config.lora_model_path)
+        if not lora_config.rslora:
+            scaling = lora_config.lora_alpha / lora_config.r
+        else:
+            scaling = lora_config.lora_alpha / math.sqrt(lora_config.r)
+
+        # Create index
+        merge_state_dict = {}
+        index = {"metadata": {"total_size": 0}, "weight_map": {}}
+        key_list = list(base_state_dict.keys())
+        positions = divide_positions(len(key_list), dist.get_world_size())
+        for ii in range(len(positions) - 1):
+            shard_file = f"{self.merge_config.merge_prefix}-{ii+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+            for key in key_list[positions[ii] : positions[ii + 1]]:
+                index["weight_map"][key] = shard_file
+                index["metadata"]["total_size"] += int(
+                    np.prod(base_state_dict[key].shape) * self.numpy_dtype_map[str(base_state_dict[key].dtype)]
+                )
+
+        # Merge state dict
+        rank = dist.get_rank()
+        local_keys = key_list[positions[rank] : positions[rank + 1]]
+        for k in local_keys:
+            if k in lora_state_dict.keys():
+                tensor = lora_state_dict[k]
+            else:
+                tensor = base_state_dict[k]
+            if "weight" in k:
+                lora_A_key, lora_B_key = k.replace("weight", "lora_A"), k.replace("weight", "lora_B")
+                if lora_A_key in lora_state_dict.keys():
+                    lora_A_tensor = lora_state_dict[lora_A_key]
+                    lora_B_tensor = lora_state_dict[lora_B_key]
+                    is_bf16 = tensor.dtype == np.uint16
+                    tensor = paddle.Tensor(tensor, zero_copy=True)
+                    lora_A_tensor = paddle.Tensor(lora_A_tensor, zero_copy=True)
+                    lora_B_tensor = paddle.Tensor(lora_B_tensor, zero_copy=True)
+                    if self.is_cpu and is_bf16:
+                        tensor = tensor.astype("float32")
+                        lora_A_tensor = lora_A_tensor.astype("float32")
+                        lora_B_tensor = lora_B_tensor.astype("float32")
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                        tensor = tensor.astype("bfloat16")
+                    else:
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                    tensor = tensor.numpy()
+            merge_state_dict[k] = tensor
+
+        # Save safetensor file
+        save_file(
+            merge_state_dict,
+            os.path.join(
+                self.merge_config.output_path,
+                f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors",
+            ),
+            metadata={"format": "np"},
+        )
+        # Save index file & merge config file
+        if paddle.distributed.get_rank() == 0:
+            save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
+            self.merge_config.save_pretrained(self.merge_config.output_path)
diff --git a/paddlenlp/mergekit/merge_utils.py b/paddlenlp/mergekit/merge_utils.py
index c96a9ad2fe76..5e0fcf80741b 100644
--- a/paddlenlp/mergekit/merge_utils.py
+++ b/paddlenlp/mergekit/merge_utils.py
@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import re
 
 
 def divide_positions(m, n):
@@ -29,3 +30,44 @@ def divide_positions(m, n):
             positions.append(positions[-1] + base_value)
     positions.append(m)
     return positions
+
+
+def divide_lora_key_list(key_list, n, lora_config):
+    lora_key = []
+    other_key = []
+    for module_name in key_list:
+        if (
+            any(re.fullmatch(target_module, module_name) for target_module in lora_config.target_modules)
+            and "weight" in module_name
+        ):
+            lora_key.append(module_name)
+        else:
+            other_key.append(module_name)
+    lora_positions = divide_positions(len(lora_key), n)
+    other_positions = divide_positions(len(other_key), n)
+    divided_key_list = []
+    for i in range(len(lora_positions) - 1):
+        divided_key = (
+            lora_key[lora_positions[i] : lora_positions[i + 1]]
+            + other_key[other_positions[i] : other_positions[i + 1]]
+        )
+        divided_key_list.append(divided_key)
+    return divided_key_list
+
+
+def divide_safetensor_key_list(weight_map, n):
+    file_map = {}
+    for key in weight_map:
+        if weight_map[key] in file_map:
+            file_map[weight_map[key]].append(key)
+        else:
+            file_map[weight_map[key]] = [key]
+    file_list = list(file_map.keys())
+    p = divide_positions(len(file_list), n)
+    key_list = []
+    positions = [0]
+    for i in range(n):
+        for file in file_list[p[i] : p[i + 1]]:
+            key_list += file_map[file]
+        positions.append(len(key_list))
+    return key_list, positions
diff --git a/paddlenlp/mergekit/sparsify_method.py b/paddlenlp/mergekit/sparsify_method.py
index 8d711b4573f2..c13e6b97b8d6 100644
--- a/paddlenlp/mergekit/sparsify_method.py
+++ b/paddlenlp/mergekit/sparsify_method.py
@@ -33,24 +33,20 @@ def sparsify(self, tensor):
 
     def dare(self, tensor):
         if self.merge_config.tensor_type == "np":
-            mask = np.random.binomial(1, self.merge_config.reserve_p, size=tensor.shape).astype(tensor.dtype)
-            tensor *= mask
+            tensor *= (np.random.rand(*tensor.shape) < self.merge_config.reserve_p).astype(tensor.dtype)
             if self.merge_config.rescale:
                 tensor /= self.merge_config.reserve_p
             return tensor
         elif self.merge_config.tensor_type == "pd":
-            mask = paddle.bernoulli(paddle.full(tensor.shape, self.merge_config.reserve_p, dtype=tensor.dtype))
-
-            tensor *= mask
-            if self.merge_config.rescale:
-                tensor /= self.merge_config.reserve_p
+            mode = "upscale_in_train" if self.merge_config.rescale else "downscale_in_infer"
+            tensor = paddle.nn.functional.dropout(tensor, p=1 - self.merge_config.reserve_p, mode=mode, training=True)
             return tensor
         else:
             raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
     def magprune(self, tensor):
         if self.merge_config.tensor_type == "np":
-            if np.all(tensor == 0):
+            if not np.any(tensor != 0):
                 return tensor
             drop_p = 1 - self.merge_config.reserve_p
             # 1: ranking(descending)
@@ -72,7 +68,7 @@ def magprune(self, tensor):
                 tensor /= 1 - probs
             return tensor
         elif self.merge_config.tensor_type == "pd":
-            if paddle.all(tensor == 0):
+            if not paddle.any(tensor != 0):
                 return tensor
             drop_p = 1 - self.merge_config.reserve_p
             abs_tensor = paddle.abs(tensor)
@@ -84,8 +80,7 @@ def magprune(self, tensor):
             probs = probs * self.merge_config.epsilon / tensor.numel()
             p_min = drop_p - self.merge_config.epsilon / 2
             probs += p_min
-
-            mask = paddle.cast(paddle.bernoulli(1 - probs), tensor.dtype)
+            mask = paddle.bernoulli(1 - probs).astype(tensor.dtype)
             tensor *= mask
             if self.merge_config.rescale:
                 tensor /= 1 - probs
@@ -96,7 +91,6 @@ def magprune(self, tensor):
     def trim(self, tensor):
         if self.merge_config.tensor_type == "np":
             shape = tensor.shape
-            org_sum = np.sum(np.abs(tensor))
             tensor = tensor.flatten()
             abs_tensor = np.abs(tensor)
             threshold = np.quantile(abs_tensor, 1 - self.merge_config.reserve_p)
@@ -111,14 +105,11 @@ def trim(self, tensor):
                 tensor[abs_tensor < threshold] = 0
             return tensor.reshape(shape)
         elif self.merge_config.tensor_type == "pd":
-            shape = tensor.shape
-            org_sum = paddle.sum(paddle.abs(tensor))
             abs_tensor = paddle.abs(tensor)
             threshold = paddle.quantile(abs_tensor, 1 - self.merge_config.reserve_p)
-            mask = paddle.cast(abs_tensor >= threshold, tensor.dtype)
-            tensor = tensor * mask
-
+            tensor = paddle.where(abs_tensor < threshold, paddle.zeros_like(tensor), tensor)
             if self.merge_config.rescale:
+                org_sum = paddle.sum(abs_tensor)
                 new_sum = paddle.sum(paddle.abs(tensor))
                 if org_sum >= 1e-8 and new_sum >= 1e-8:
                     tensor *= org_sum / new_sum
diff --git a/paddlenlp/server/predictor.py b/paddlenlp/server/predictor.py
index 45d803e4b13b..226c15b54fba 100644
--- a/paddlenlp/server/predictor.py
+++ b/paddlenlp/server/predictor.py
@@ -21,6 +21,7 @@
 
 import paddle
 
+from ..utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 from ..utils.log import logger
 
 
@@ -40,13 +41,15 @@ def __init__(self, model_path, precision, device):
 
     def _get_default_static_model_path(self):
         # The model path had the static_model_path
-        static_model_path = os.path.join(self._model_path, self._default_static_model_path, "inference.pdmodel")
+        static_model_path = os.path.join(
+            self._model_path, self._default_static_model_path, f"inference{PADDLE_INFERENCE_MODEL_SUFFIX}"
+        )
         if os.path.exists(static_model_path):
             return os.path.join(self._model_path, self._default_static_model_path, "inference")
         for file_name in os.listdir(self._model_path):
             # FIXME(wawltor) The path maybe not correct
-            if file_name.count(".pdmodel"):
-                return os.path.join(self._model_path, file_name[:-8])
+            if file_name.count(PADDLE_INFERENCE_MODEL_SUFFIX):
+                return os.path.join(self._model_path, file_name[: -len(PADDLE_INFERENCE_MODEL_SUFFIX)])
         return None
 
     def _is_int8_model(self, model_path):
@@ -110,7 +113,10 @@ def _prepare_paddle_mode(self, static_model_path):
         """
         Construct the input data and predictor in the PaddlePaddele static mode.
         """
-        self._config = paddle.inference.Config(static_model_path + ".pdmodel", static_model_path + ".pdiparams")
+        self._config = paddle.inference.Config(
+            static_model_path + PADDLE_INFERENCE_MODEL_SUFFIX,
+            static_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+        )
         self._config.disable_glog_info()
         if paddle.get_device() == "cpu":
             self._config.disable_gpu()
@@ -146,7 +152,7 @@ def _prepare_onnx_mode(self, static_model_path):
             os.mkdir(onnx_dir)
         float_onnx_file = os.path.join(onnx_dir, "model.onnx")
         if not os.path.exists(float_onnx_file):
-            model_path = static_model_path + ".pdmodel"
+            model_path = static_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
             params_file = static_model_path + ".pdiparams"
             onnx_model = paddle2onnx.command.c_paddle_to_onnx(
                 model_file=model_path, params_file=params_file, opset_version=13, enable_onnx_checker=True
diff --git a/paddlenlp/taskflow/information_extraction.py b/paddlenlp/taskflow/information_extraction.py
index fac8d7231395..19459187a3b8 100644
--- a/paddlenlp/taskflow/information_extraction.py
+++ b/paddlenlp/taskflow/information_extraction.py
@@ -12,7 +12,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
 import base64
 import json
 import os
@@ -25,7 +24,14 @@
 
 from ..datasets import load_dataset
 from ..layers import GlobalPointerForEntityExtraction, GPLinkerForRelationExtraction
-from ..transformers import UIE, UIEM, UIEX, AutoModel, AutoTokenizer
+from ..transformers import (
+    UIE,
+    UIEM,
+    UIEX,
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
 from ..utils.doc_parser import DocParser
 from ..utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME
 from ..utils.ie_utils import map_offset, pad_image_data
@@ -115,6 +121,300 @@ def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length
     return max_length
 
 
+LLM_IE_PROMPT = """你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。
+**句子开始**
+{sentence}
+**句子结束**
+**问题开始**
+{prompt}
+**问题结束**
+**回答开始**
+"""
+
+
+class UIELLMTask(Task):
+    def __init__(self, task, model, schema, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._dtype = kwargs.get("dtype", "float16")
+        self.kwargs["generation_task"] = task
+        self._tgt_length = kwargs.get("tgt_length", 50)
+        # Token max length
+        self._max_seq_length = kwargs.get("max_seq_length", 512)
+        self._top_k = kwargs.get("top_k", 1)
+        self._top_p = kwargs.get("top_p", 1.0)
+        self._temperature = kwargs.get("temperature", 1.0)
+        self._decode_strategy = kwargs.get("decode_strategy", "greedy_search")
+        self._num_return_sequences = kwargs.get("num_return_sequences", 1)
+        self._prompt = LLM_IE_PROMPT
+
+        self._construct_tokenizer(model)
+        self.set_schema(schema)
+        self._construct_model(model)
+        self._construct_input_spec()
+
+        if not schema:
+            logger.warning(
+                "The schema has not been set yet, please set a schema via set_schema(). "
+                "More details about the setting of schema please refer to https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/taskflow_text.md"
+            )
+            self._schema_tree = None
+        else:
+            self.set_schema(schema)
+
+        self._is_en = False
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = AutoModelForCausalLM.from_pretrained(model, dtype=self._infer_precision)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(model)
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+        # Separates data into some batches.
+        one_batch = []
+        for example in data:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield one_batch
+                one_batch = []
+        if one_batch:
+            yield one_batch
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        return inputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = self._multi_stage_predict(inputs)
+        return results
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        return inputs
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        if paddle.get_device().split(":", 1)[0] == "npu":
+            input_spec_dtype = "int32"
+        else:
+            input_spec_dtype = "int64"
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
+        ]
+
+    def _single_stage_predict(self, inputs):
+        inputs = [self._prompt.format(sentence=dic["text"], prompt=dic["prompt"]) for dic in inputs]
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        batches = self._batchify(inputs, batch_size)
+        examples = []
+        for input_text in batches:
+            if self._tokenizer.chat_template is not None:
+                input_text = [input_text] if isinstance(input_text, str) else input_text
+                input_text = [self._tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in input_text]
+            tokenized_output = self._tokenizer(
+                input_text,
+                return_tensors="pd",
+                return_position_ids=True,
+                padding_side="left",
+                padding=True,
+                max_new_tokens=self._max_seq_length,
+                truncation=True,
+                truncation_side="left",
+                add_special_tokens=self._tokenizer.chat_template is None,
+            )
+            examples.append(tokenized_output)
+
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["data_loader"] = examples
+
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        results = []
+        for batch_inputs in outputs["data_loader"]:
+            result = self._model.generate(
+                **batch_inputs,
+                decode_strategy=self._decode_strategy,
+                top_k=self._top_k,
+                top_p=self._top_p,
+                temperature=self._temperature,
+                max_new_tokens=self._tgt_length,
+                bos_token_id=self._tokenizer.bos_token_id,
+                eos_token_id=self._tokenizer.eos_token_id,
+                pad_token_id=self._tokenizer.pad_token_id,
+                num_return_sequences=self._num_return_sequences,
+                use_cache=True,
+            )
+            results.extend(result[0])
+        out_list = []
+        for x in results:
+            res = self._tokenizer.decode(x.numpy().tolist(), skip_special_tokens=True)
+            res = res.strip("\n")
+            end_idx = res.find("\n**回答结束**")
+            if end_idx != -1:
+                res = res[:end_idx]
+            out_list.append([{"text": res}])
+
+        return out_list
+
+    def _multi_stage_predict(self, data):
+        """
+        Traversal the schema tree and do multi-stage prediction.
+
+        Args:
+            data (list): a list of strings
+
+        Returns:
+            list: a list of predictions, where the list's length
+                equals to the length of `data`
+        """
+        results = [{} for _ in range(len(data))]
+        # Input check to early return
+        if len(data) < 1 or self._schema_tree is None:
+            return results
+
+        # Copy to stay `self._schema_tree` unchanged
+        schema_list = self._schema_tree.children[:]
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+            examples = []
+            input_map = {}
+            cnt = 0
+            idx = 0
+            if not node.prefix:
+                for one_data in data:
+                    examples.append(
+                        {
+                            "text": one_data,
+                            "prompt": dbc2sbc(node.name),
+                        }
+                    )
+                    input_map[cnt] = [idx]
+                    idx += 1
+                    cnt += 1
+            else:
+                for pre, one_data in zip(node.prefix, data):
+                    if len(pre) == 0:
+                        input_map[cnt] = []
+                    else:
+                        for p in pre:
+                            prompt = p + node.name
+                            examples.append(
+                                {
+                                    "text": one_data,
+                                    "prompt": dbc2sbc(prompt),
+                                }
+                            )
+                        input_map[cnt] = [i + idx for i in range(len(pre))]
+                        idx += len(pre)
+                    cnt += 1
+            if len(examples) == 0:
+                result_list = []
+            else:
+                result_list = self._single_stage_predict(examples)
+
+            if not node.parent_relations:
+                relations = [[] for i in range(len(data))]
+                for k, v in input_map.items():
+                    for idx in v:
+                        if len(result_list[idx]) == 0:
+                            continue
+                        if node.name not in results[k].keys():
+                            results[k][node.name] = result_list[idx]
+                        else:
+                            results[k][node.name].extend(result_list[idx])
+                    if node.name in results[k].keys():
+                        relations[k].extend(results[k][node.name])
+            else:
+                relations = node.parent_relations
+                for k, v in input_map.items():
+                    for i in range(len(v)):
+                        if len(result_list[v[i]]) == 0:
+                            continue
+                        if "relations" not in relations[k][i].keys():
+                            relations[k][i]["relations"] = {node.name: result_list[v[i]]}
+                        elif node.name not in relations[k][i]["relations"].keys():
+                            relations[k][i]["relations"][node.name] = result_list[v[i]]
+                        else:
+                            relations[k][i]["relations"][node.name].extend(result_list[v[i]])
+                new_relations = [[] for i in range(len(data))]
+                for i in range(len(relations)):
+                    for j in range(len(relations[i])):
+                        if "relations" in relations[i][j].keys() and node.name in relations[i][j]["relations"].keys():
+                            for k in range(len(relations[i][j]["relations"][node.name])):
+                                new_relations[i].append(relations[i][j]["relations"][node.name][k])
+                relations = new_relations
+
+            prefix = [[] for _ in range(len(data))]
+            for k, v in input_map.items():
+                for idx in v:
+                    for i in range(len(result_list[idx])):
+                        if self._is_en:
+                            prefix[k].append(" of " + result_list[idx][i]["text"])
+                        else:
+                            prefix[k].append(result_list[idx][i]["text"] + "的")
+
+            for child in node.children:
+                child.prefix = prefix
+                child.parent_relations = relations
+                schema_list.append(child)
+        return results
+
+    def set_schema(self, schema):
+        if isinstance(schema, dict) or isinstance(schema, str):
+            schema = [schema]
+        self._schema_tree = self._build_tree(schema)
+
+    @classmethod
+    def _build_tree(cls, schema, name="root"):
+        """
+        Build the schema tree.
+        """
+        schema_tree = SchemaTree(name)
+        for s in schema:
+            if isinstance(s, str):
+                schema_tree.add_child(SchemaTree(s))
+            elif isinstance(s, dict):
+                for k, v in s.items():
+                    if isinstance(v, str):
+                        child = [v]
+                    elif isinstance(v, list):
+                        child = v
+                    else:
+                        raise TypeError(
+                            "Invalid schema, value for each key:value pairs should be list or string"
+                            "but {} received".format(type(v))
+                        )
+                    schema_tree.add_child(cls._build_tree(child, name=k))
+            else:
+                raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+        return schema_tree
+
+
 class UIETask(Task):
     """
     Universal Information Extraction Task.
@@ -510,7 +810,6 @@ def __init__(self, task, model, schema=None, **kwargs):
                 self._check_task_files()
                 with open(os.path.join(self._task_path, CONFIG_NAME)) as f:
                     self._init_class = json.load(f)["architectures"].pop()
-
         self._is_en = True if model in ["uie-base-en"] or self._schema_lang == "en" else False
 
         if self._init_class in ["UIEX"]:
@@ -583,7 +882,9 @@ def _construct_model(self, model):
         Construct the inference model for the predictor.
         """
         model_instance = MODEL_MAP[self._init_class].from_pretrained(
-            self._task_path, from_hf_hub=self.from_hf_hub, convert_from_torch=self._convert_from_torch
+            self._task_path,
+            from_hf_hub=self.from_hf_hub,
+            convert_from_torch=self._convert_from_torch,
         )
         self._model = model_instance
         self._model.eval()
@@ -621,16 +922,20 @@ def _check_input_text(self, inputs):
                     if "doc" in example.keys():
                         if not self._parser_map[self._ocr_lang_choice]:
                             self._parser_map[self._ocr_lang_choice] = DocParser(
-                                ocr_lang=self._ocr_lang, layout_analysis=self._layout_analysis
+                                ocr_lang=self._ocr_lang,
+                                layout_analysis=self._layout_analysis,
                             )
                         if "layout" in example.keys():
                             data = self._parser_map[self._ocr_lang_choice].parse(
-                                {"doc": example["doc"]}, do_ocr=False, expand_to_a4_size=self._expand_to_a4_size
+                                {"doc": example["doc"]},
+                                do_ocr=False,
+                                expand_to_a4_size=self._expand_to_a4_size,
                             )
                             data["layout"] = example["layout"]
                         else:
                             data = self._parser_map[self._ocr_lang_choice].parse(
-                                {"doc": example["doc"]}, expand_to_a4_size=self._expand_to_a4_size
+                                {"doc": example["doc"]},
+                                expand_to_a4_size=self._expand_to_a4_size,
                             )
                     elif "text" in example.keys():
                         if not isinstance(example["text"], str):
@@ -658,14 +963,16 @@ def _check_input_text(self, inputs):
     def _single_stage_predict(self, inputs):
         input_texts = [d["text"] for d in inputs]
         prompts = [d["prompt"] for d in inputs]
-
         # max predict length should exclude the length of prompt and summary tokens
         max_predict_len = self._max_seq_len - len(max(prompts)) - self._summary_token_num
 
         if self._init_class in ["UIEX"]:
             bbox_list = [d["bbox"] for d in inputs]
             short_input_texts, short_bbox_list, input_mapping = self._auto_splitter(
-                input_texts, max_predict_len, bbox_list=bbox_list, split_sentence=self._split_sentence
+                input_texts,
+                max_predict_len,
+                bbox_list=bbox_list,
+                split_sentence=self._split_sentence,
             )
         else:
             short_input_texts, input_mapping = self._auto_splitter(
@@ -761,7 +1068,14 @@ def _process_bbox(tokens, bbox_lines, offset_mapping, offset_bias):
                 return bbox_list
 
             def _encode_doc(
-                tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len
+                tokenizer,
+                offset_mapping,
+                last_offset,
+                prompt,
+                this_text_line,
+                inputs_ids,
+                q_sep_index,
+                max_seq_len,
             ):
                 if len(offset_mapping) == 0:
                     content_encoded_inputs = tokenizer(
@@ -795,7 +1109,10 @@ def _encode_doc(
                     last_offset = offset_mapping[-1][-1]
                 else:
                     content_encoded_inputs = tokenizer(
-                        text=this_text_line, max_seq_len=max_seq_len, return_dict=False, return_offsets_mapping=True
+                        text=this_text_line,
+                        max_seq_len=max_seq_len,
+                        return_dict=False,
+                        return_offsets_mapping=True,
                     )
                     inputs_ids += content_encoded_inputs["input_ids"][1:-1]
                     sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
@@ -842,7 +1159,7 @@ def _encode_doc(
 
                     bbox_list = [[0, 0, 0, 0] for x in range(len(inputs_ids))]
                     token_type_ids = [
-                        1 if token_index <= q_sep_index or token_index > c_sep_index else 0
+                        (1 if token_index <= q_sep_index or token_index > c_sep_index else 0)
                         for token_index in range(self._max_seq_len)
                     ]
                     padded_image = np.zeros([3, 224, 224])
@@ -930,7 +1247,13 @@ def _encode_doc(
                     padded_image,
                     offset_mapping,
                 ]
-                input_list = [inputs_ids, token_type_ids, position_ids, attention_mask, bbox_list]
+                input_list = [
+                    inputs_ids,
+                    token_type_ids,
+                    position_ids,
+                    attention_mask,
+                    bbox_list,
+                ]
                 return_list = [np.array(x, dtype="int64") for x in input_list]
                 return_list.append(np.array(padded_image, dtype="float32"))
                 return_list.append(np.array(offset_mapping, dtype="int64"))
@@ -946,14 +1269,25 @@ def _encode_doc(
         batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
 
         infer_data_loader = paddle.io.DataLoader(
-            dataset=infer_ds, batch_sampler=batch_sampler, num_workers=self._num_workers, return_list=True
+            dataset=infer_ds,
+            batch_sampler=batch_sampler,
+            num_workers=self._num_workers,
+            return_list=True,
         )
 
         sentence_ids = []
         probs = []
         for batch in infer_data_loader:
             if self._init_class in ["UIEX"]:
-                input_ids, token_type_ids, pos_ids, att_mask, bbox, image, offset_maps = batch
+                (
+                    input_ids,
+                    token_type_ids,
+                    pos_ids,
+                    att_mask,
+                    bbox,
+                    image,
+                    offset_maps,
+                ) = batch
             elif self._init_class in ["UIEM"]:
                 input_ids, pos_ids, offset_maps = batch
             else:
@@ -1033,7 +1367,10 @@ def _auto_joiner(self, short_results, short_inputs, input_mapping):
                     if len(short_results[v]) == 0:
                         continue
                     if short_results[v][0]["text"] not in cls_options.keys():
-                        cls_options[short_results[v][0]["text"]] = [1, short_results[v][0]["probability"]]
+                        cls_options[short_results[v][0]["text"]] = [
+                            1,
+                            short_results[v][0]["probability"],
+                        ]
                     else:
                         cls_options[short_results[v][0]["text"]][0] += 1
                         cls_options[short_results[v][0]["text"]][1] += short_results[v][0]["probability"]
@@ -1087,7 +1424,14 @@ def _parse_inputs(self, inputs):
                         box = self._parser_map[self._ocr_lang_choice]._normalize_box(box, [img_w, img_h], [1000, 1000])
                         text += segment[1]
                         bbox.extend([box] * len(segment[1]))
-                    _inputs.append({"text": text, "bbox": bbox, "image": d["image"], "layout": d["layout"]})
+                    _inputs.append(
+                        {
+                            "text": text,
+                            "bbox": bbox,
+                            "image": d["image"],
+                            "layout": d["layout"],
+                        }
+                    )
                 else:
                     _inputs.append({"text": d["text"], "bbox": None, "image": None})
             else:
@@ -1162,7 +1506,6 @@ def _multi_stage_predict(self, data):
                 result_list = []
             else:
                 result_list = self._single_stage_predict(examples)
-
             if not node.parent_relations:
                 relations = [[] for i in range(len(data))]
                 for k, v in input_map.items():
@@ -1249,7 +1592,12 @@ def _add_bbox(result, char_boxes):
                     if len(segment) == 2 or (len(segment) == 3 and segment[2] != "table"):
                         char_w = (sbox[2] - sbox[0]) * 1.0 / text_len
                         for i in range(text_len):
-                            cbox = [sbox[0] + i * char_w, sbox[1], sbox[0] + (i + 1) * char_w, sbox[3]]
+                            cbox = [
+                                sbox[0] + i * char_w,
+                                sbox[1],
+                                sbox[0] + (i + 1) * char_w,
+                                sbox[3],
+                            ]
                             char_boxes.append((segment[1][i], cbox))
                     else:
                         cell_bbox = [(segment[1][i], sbox) for i in range(text_len)]
@@ -1281,7 +1629,12 @@ def _convert_ids_to_results(self, examples, sentence_ids, probs):
                     result = {"text": prompt[start:end], "probability": prob[i]}
                     result_list.append(result)
                 else:
-                    result = {"text": text[start:end], "start": start, "end": end, "probability": prob[i]}
+                    result = {
+                        "text": text[start:end],
+                        "start": start,
+                        "end": end,
+                        "probability": prob[i],
+                    }
                     result_list.append(result)
             results.append(result_list)
         return results
@@ -1507,7 +1860,10 @@ def _postprocess_opinion_extraction(self, inputs):
             for rel in all_rel_preds[i]:
                 r = aspect_maps[(rel["aspect"], rel["aspect_start_index"])]
                 r["relations"] = {}
-                sentiment = {"probability": rel["probability"], "text": rel["sentiment"]}
+                sentiment = {
+                    "probability": rel["probability"],
+                    "text": rel["sentiment"],
+                }
                 opinion = {
                     "text": rel["opinion"],
                     "start": rel["opinion_start_index"],
diff --git a/paddlenlp/taskflow/multimodal_feature_extraction.py b/paddlenlp/taskflow/multimodal_feature_extraction.py
index 3e6050081643..7671ba3c3991 100644
--- a/paddlenlp/taskflow/multimodal_feature_extraction.py
+++ b/paddlenlp/taskflow/multimodal_feature_extraction.py
@@ -19,6 +19,7 @@
 from PIL import Image
 
 from ..transformers import AutoModel, AutoProcessor
+from ..utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 from ..utils.log import logger
 from .task import Task
 from .utils import dygraph_mode_guard, static_mode_guard
@@ -411,9 +412,9 @@ def _get_inference_model(self):
         self.inference_image_model_path = os.path.join(_base_path, "static", "get_image_features")
         self.inference_text_model_path = os.path.join(_base_path, "static", "get_text_features")
         if (
-            not os.path.exists(self.inference_image_model_path + ".pdiparams")
+            not os.path.exists(self.inference_image_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX)
             or self._param_updated
-            or not os.path.exists(self.inference_text_model_path + ".pdiparams")
+            or not os.path.exists(self.inference_text_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX)
         ):
             with dygraph_mode_guard():
                 self._construct_model(self.model)
@@ -422,8 +423,8 @@ def _get_inference_model(self):
         if self._predictor_type == "paddle-inference":
             # Get text inference model
             self.inference_model_path = self.inference_text_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
             self._prepare_static_mode()
 
@@ -435,8 +436,8 @@ def _get_inference_model(self):
 
             # Get image inference model
             self.inference_model_path = self.inference_image_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
             self._prepare_static_mode()
 
@@ -449,15 +450,15 @@ def _get_inference_model(self):
             # Get text onnx model
             self.export_type = "text"
             self.inference_model_path = self.inference_text_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._prepare_onnx_mode()
             self.predictor_map["text"] = self.predictor
 
             # Get image onnx model
             self.export_type = "image"
             self.inference_model_path = self.inference_image_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._prepare_onnx_mode()
             self.predictor_map["image"] = self.predictor
diff --git a/paddlenlp/taskflow/task.py b/paddlenlp/taskflow/task.py
index 22b178b61d35..f7cdd87cec74 100644
--- a/paddlenlp/taskflow/task.py
+++ b/paddlenlp/taskflow/task.py
@@ -20,9 +20,14 @@
 from multiprocessing import cpu_count
 
 import paddle
+from paddle.base.framework import use_pir_api
 from paddle.dataset.common import md5file
 
-from ..utils.env import PPNLP_HOME
+from ..utils.env import (
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+    PPNLP_HOME,
+)
 from ..utils.log import logger
 from .utils import cut_chinese_sent, download_check, download_file, dygraph_mode_guard
 
@@ -54,7 +59,15 @@ def __init__(self, model, task, priority_path=None, **kwargs):
         self._param_updated = False
 
         self._num_threads = self.kwargs["num_threads"] if "num_threads" in self.kwargs else math.ceil(cpu_count() / 2)
-        self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
+        if (
+            self.task == "paddlenlp/PP-UIE-0.5B"
+            or self.task == "paddlenlp/PP-UIE-1.5B"
+            or self.task == "paddlenlp/PP-UIE-7B"
+            or self.task == "paddlenlp/PP-UIE-14B"
+        ):
+            self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "float16"
+        else:
+            self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
         # Default to use Paddle Inference
         self._predictor_type = "paddle-inference"
         # The root directory for storing Taskflow related files, default to ~/.paddlenlp.
@@ -118,12 +131,12 @@ def _construct_input_spec(self):
     def _get_static_model_name(self):
         names = []
         for file_name in os.listdir(self._task_path):
-            if ".pdmodel" in file_name:
-                names.append(file_name[:-8])
+            if PADDLE_INFERENCE_MODEL_SUFFIX in file_name:
+                names.append(file_name[: -len(PADDLE_INFERENCE_MODEL_SUFFIX)])
         if len(names) == 0:
-            raise IOError(f"{self._task_path} should include '.pdmodel' file.")
+            raise IOError(f"{self._task_path} should include '{PADDLE_INFERENCE_MODEL_SUFFIX}' file.")
         if len(names) > 1:
-            logger.warning(f"{self._task_path} includes more than one '.pdmodel' file.")
+            logger.warning(f"{self._task_path} includes more than one '{PADDLE_INFERENCE_MODEL_SUFFIX}' file.")
         return names[0]
 
     def _check_task_files(self):
@@ -212,18 +225,25 @@ def _prepare_static_mode(self):
             # TODO(linjieccc): enable after fixed
             self._config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
             self._config.delete_pass("fused_multi_transformer_encoder_pass")
+            self._config.delete_pass("fused_rotary_position_embedding_pass")
+
+        self._config.switch_ir_optim(True)
+        self._config.enable_new_executor()
+
         self._config.set_cpu_math_library_num_threads(self._num_threads)
         self._config.switch_use_feed_fetch_ops(False)
         self._config.disable_glog_info()
         self._config.enable_memory_optim()
-
         # TODO(linjieccc): some temporary settings and will be remove in future
         # after fixed
-        if self.task in ["document_intelligence", "knowledge_mining", "zero_shot_text_classification"]:
+        if self.task in [
+            "document_intelligence",
+            "knowledge_mining",
+            "zero_shot_text_classification",
+        ]:
             self._config.switch_ir_optim(False)
         if self.model == "uie-data-distill-gp":
             self._config.enable_memory_optim(False)
-
         self.predictor = paddle.inference.create_predictor(self._config)
         self.input_names = [name for name in self.predictor.get_input_names()]
         self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
@@ -281,12 +301,14 @@ def _get_inference_model(self):
         """
         if self._custom_model:
             param_path = os.path.join(self._task_path, "model_state.pdparams")
-
             if os.path.exists(param_path):
                 cache_info_path = os.path.join(self._task_path, ".cache_info")
                 md5 = md5file(param_path)
                 self._param_updated = True
-                if os.path.exists(cache_info_path) and open(cache_info_path).read()[:-8] == md5:
+                if (
+                    os.path.exists(cache_info_path)
+                    and open(cache_info_path).read()[: -len(PADDLE_INFERENCE_MODEL_SUFFIX)] == md5
+                ):
                     self._param_updated = False
                 elif self.task == "information_extraction" and self.model != "uie-data-distill-gp":
                     # UIE related models are moved to paddlenlp.transformers after v2.4.5
@@ -296,13 +318,20 @@ def _get_inference_model(self):
                     fp.write(md5 + "taskflow")
                     fp.close()
                     model_state = paddle.load(param_path)
-                    prefix_map = {"UIE": "ernie", "UIEM": "ernie_m", "UIEX": "ernie_layout"}
+                    prefix_map = {
+                        "UIE": "ernie",
+                        "UIEM": "ernie_m",
+                        "UIEX": "ernie_layout",
+                    }
                     new_state_dict = {}
                     for name, param in model_state.items():
                         if "ernie" in name:
                             new_state_dict[name] = param
                         elif "encoder.encoder" in name:
-                            trans_name = name.replace("encoder.encoder", prefix_map[self._init_class] + ".encoder")
+                            trans_name = name.replace(
+                                "encoder.encoder",
+                                prefix_map[self._init_class] + ".encoder",
+                            )
                             new_state_dict[trans_name] = param
                         elif "encoder" in name:
                             trans_name = name.replace("encoder", prefix_map[self._init_class])
@@ -318,11 +347,11 @@ def _get_inference_model(self):
         # When the user-provided model path is already a static model, skip to_static conversion
         if self.is_static_model:
             self.inference_model_path = os.path.join(self._task_path, self._static_model_name)
-            if not os.path.exists(self.inference_model_path + ".pdmodel") or not os.path.exists(
-                self.inference_model_path + ".pdiparams"
+            if not os.path.exists(self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX) or not os.path.exists(
+                self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             ):
                 raise IOError(
-                    f"{self._task_path} should include {self._static_model_name + '.pdmodel'} and {self._static_model_name + '.pdiparams'} while is_static_model is True"
+                    f"{self._task_path} should include {self._static_model_name + PADDLE_INFERENCE_MODEL_SUFFIX} and {self._static_model_name + PADDLE_INFERENCE_WEIGHTS_SUFFIX} while is_static_model is True"
                 )
             if self.paddle_quantize_model(self.inference_model_path):
                 self._infer_precision = "int8"
@@ -336,19 +365,20 @@ def _get_inference_model(self):
                 else os.path.join(self._home_path, "taskflow", self.task, self._task_path)
             )
             self.inference_model_path = os.path.join(_base_path, "static", "inference")
-            if not os.path.exists(self.inference_model_path + ".pdiparams") or self._param_updated:
+            if not os.path.exists(self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX) or self._param_updated:
                 with dygraph_mode_guard():
                     self._construct_model(self.model)
                     self._construct_input_spec()
                     self._convert_dygraph_to_static()
 
-        self._static_model_file = self.inference_model_path + ".pdmodel"
-        self._static_params_file = self.inference_model_path + ".pdiparams"
+
+        self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+        self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
 
         if paddle.get_device().split(":", 1)[0] == "npu" and self._infer_precision == "fp16":
             # transform fp32 model tp fp16 model
-            self._static_fp16_model_file = self.inference_model_path + "-fp16.pdmodel"
-            self._static_fp16_params_file = self.inference_model_path + "-fp16.pdiparams"
+            self._static_fp16_model_file = self.inference_model_path + f"-fp16{PADDLE_INFERENCE_MODEL_SUFFIX}"
+            self._static_fp16_params_file = self.inference_model_path + f"-fp16{PADDLE_INFERENCE_WEIGHTS_SUFFIX}"
             if not os.path.exists(self._static_fp16_model_file) and not os.path.exists(self._static_fp16_params_file):
                 logger.info("Converting to the inference model from fp32 to fp16.")
                 paddle.inference.convert_to_mixed_precision(
@@ -368,7 +398,10 @@ def _get_inference_model(self):
             self._static_model_file = self._static_fp16_model_file
             self._static_params_file = self._static_fp16_params_file
         if self._predictor_type == "paddle-inference":
-            self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
+            if use_pir_api():
+                self._config = paddle.inference.Config(self._static_json_file, self._static_params_file)
+            else:
+                self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
             self._prepare_static_mode()
         else:
             self._prepare_onnx_mode()
@@ -384,7 +417,8 @@ def _convert_dygraph_to_static(self):
             self._input_spec is not None
         ), "The input spec must be created before converting the dygraph model to static model."
         logger.info("Converting to the inference model cost a little time.")
-        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec)
+
+        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec, full_graph=True)
 
         paddle.jit.save(static_model, self.inference_model_path)
         logger.info("The inference model save in the path:{}".format(self.inference_model_path))
@@ -512,7 +546,7 @@ def paddle_quantize_model(self, model_path):
         program = model.program()
         for block in program.blocks:
             for op in block.ops:
-                if op.type.count("quantize"):
+                if "quantize" in op.name():
                     return True
         return False
 
diff --git a/paddlenlp/taskflow/taskflow.py b/paddlenlp/taskflow/taskflow.py
index 520ad4cf5886..fdba3ee9ad3d 100644
--- a/paddlenlp/taskflow/taskflow.py
+++ b/paddlenlp/taskflow/taskflow.py
@@ -23,7 +23,7 @@
 from .dialogue import DialogueTask
 from .document_intelligence import DocPromptTask
 from .fill_mask import FillMaskTask
-from .information_extraction import GPTask, UIETask
+from .information_extraction import GPTask, UIELLMTask, UIETask
 from .knowledge_mining import NPTagTask, WordTagTask
 from .lexical_analysis import LacTask
 from .multimodal_feature_extraction import MultimodalFeatureExtractionTask
@@ -67,7 +67,10 @@
     },
     "dialogue": {
         "models": {
-            "plato-mini": {"task_class": DialogueTask, "task_flag": "dialogue-plato-mini"},
+            "plato-mini": {
+                "task_class": DialogueTask,
+                "task_flag": "dialogue-plato-mini",
+            },
             "__internal_testing__/tiny-random-plato": {
                 "task_class": DialogueTask,
                 "task_flag": "dialogue-tiny-random-plato",
@@ -79,7 +82,10 @@
     },
     "fill_mask": {
         "models": {
-            "fill_mask": {"task_class": FillMaskTask, "task_flag": "fill_mask-fill_mask"},
+            "fill_mask": {
+                "task_class": FillMaskTask,
+                "task_flag": "fill_mask-fill_mask",
+            },
         },
         "default": {
             "model": "fill_mask",
@@ -206,7 +212,10 @@
     },
     "text_correction": {
         "models": {
-            "ernie-csc": {"task_class": CSCTask, "task_flag": "text_correction-ernie-csc"},
+            "ernie-csc": {
+                "task_class": CSCTask,
+                "task_flag": "text_correction-ernie-csc",
+            },
         },
         "default": {"model": "ernie-csc"},
     },
@@ -314,16 +323,56 @@
     },
     "information_extraction": {
         "models": {
-            "uie-base": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base"},
+            "paddlenlp/PP-UIE-0.5B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 896,
+                "task_flag": "information_extraction-pp-uie-0.5b",
+            },
+            "paddlenlp/PP-UIE-1.5B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 1536,
+                "task_flag": "information_extraction-pp-uie-1.5b",
+            },
+            "paddlenlp/PP-UIE-7B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 3584,
+                "task_flag": "information_extraction-pp-uie-7b",
+            },
+            "paddlenlp/PP-UIE-14B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 5120,
+                "task_flag": "information_extraction-pp-uie-14b",
+            },
+            "uie-base": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-base",
+            },
             "uie-medium": {
                 "task_class": UIETask,
                 "hidden_size": 768,
                 "task_flag": "information_extraction-uie-medium",
             },
-            "uie-mini": {"task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-mini"},
-            "uie-micro": {"task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-micro"},
-            "uie-nano": {"task_class": UIETask, "hidden_size": 312, "task_flag": "information_extraction-uie-nano"},
-            "uie-tiny": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-tiny"},
+            "uie-mini": {
+                "task_class": UIETask,
+                "hidden_size": 384,
+                "task_flag": "information_extraction-uie-mini",
+            },
+            "uie-micro": {
+                "task_class": UIETask,
+                "hidden_size": 384,
+                "task_flag": "information_extraction-uie-micro",
+            },
+            "uie-nano": {
+                "task_class": UIETask,
+                "hidden_size": 312,
+                "task_flag": "information_extraction-uie-nano",
+            },
+            "uie-tiny": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-tiny",
+            },
             "uie-medical-base": {
                 "task_class": UIETask,
                 "hidden_size": 768,
@@ -349,7 +398,10 @@
                 "hidden_size": 768,
                 "task_flag": "information_extraction-uie-x-base",
             },
-            "uie-data-distill-gp": {"task_class": GPTask, "task_flag": "information_extraction-uie-data-distill-gp"},
+            "uie-data-distill-gp": {
+                "task_class": GPTask,
+                "task_flag": "information_extraction-uie-data-distill-gp",
+            },
             "__internal_testing__/tiny-random-uie": {
                 "task_class": UIETask,
                 "hidden_size": 8,
@@ -693,6 +745,10 @@
 }
 
 support_schema_list = [
+    "paddlenlp/PP-UIE-0.5B",
+    "paddlenlp/PP-UIE-1.5B",
+    "paddlenlp/PP-UIE-7B",
+    "paddlenlp/PP-UIE-14B",
     "uie-base",
     "uie-medium",
     "uie-mini",
@@ -736,6 +792,10 @@
     "openai/disco-diffusion-clip-rn50",
     "openai/disco-diffusion-clip-rn101",
     "PaddlePaddle/disco_diffusion_ernie_vil-2.0-base-zh",
+    "paddlenlp/PP-UIE-0.5B",
+    "paddlenlp/PP-UIE-1.5B",
+    "paddlenlp/PP-UIE-7B",
+    "paddlenlp/PP-UIE-14B",
     "uie-base",
     "uie-medium",
     "uie-mini",
@@ -807,7 +867,11 @@ def __init__(self, task, model=None, mode=None, device_id=0, from_hf_hub=False,
         self.kwargs = kwargs
         task_class = TASKS[self.task][tag][self.model]["task_class"]
         self.task_instance = task_class(
-            model=self.model, task=self.task, priority_path=self.priority_path, from_hf_hub=from_hf_hub, **self.kwargs
+            model=self.model,
+            task=self.task,
+            priority_path=self.priority_path,
+            from_hf_hub=from_hf_hub,
+            **self.kwargs,
         )
         task_list = TASKS.keys()
         Taskflow.task_list = task_list
diff --git a/paddlenlp/taskflow/text_similarity.py b/paddlenlp/taskflow/text_similarity.py
index 5792125218f0..cb3a296db343 100644
--- a/paddlenlp/taskflow/text_similarity.py
+++ b/paddlenlp/taskflow/text_similarity.py
@@ -196,7 +196,7 @@ def _construct_model(self, model):
         """
 
         if "rocketqav2-en" in model or "ernie-search" in model:
-            self._model = ErnieCrossEncoder(self._task_path, num_classes=1, reinitialize=True)
+            self._model = ErnieCrossEncoder(self._task_path, num_classes=2, reinitialize=True)
         elif "rocketqa" in model:
             self._model = ErnieCrossEncoder(self._task_path, num_classes=2)
         else:
@@ -274,7 +274,6 @@ def _run_model(self, inputs):
         if "rocketqa" in self.model_name or "ernie-search" in self.model_name:
             with static_mode_guard():
                 for batch in inputs["data_loader"]:
-
                     if self._predictor_type == "paddle-inference":
                         input_ids, segment_ids = self._batchify_fn(batch)
                         self.input_handles[0].copy_from_cpu(input_ids)
diff --git a/paddlenlp/taskflow/zero_shot_text_classification.py b/paddlenlp/taskflow/zero_shot_text_classification.py
index 43d9f8ff5756..9bd5251d6048 100644
--- a/paddlenlp/taskflow/zero_shot_text_classification.py
+++ b/paddlenlp/taskflow/zero_shot_text_classification.py
@@ -259,6 +259,7 @@ class ZeroShotTextClassificationTask(Task):
     def __init__(self, task: str, model: str, schema: list = None, **kwargs):
         super().__init__(task=task, model=model, **kwargs)
 
+        self._static_mode = False
         self._set_utc_schema(schema)
         self._max_seq_len = kwargs.get("max_seq_len", 512)
         self._batch_size = kwargs.get("batch_size", 1)
@@ -269,7 +270,10 @@ def __init__(self, task: str, model: str, schema: list = None, **kwargs):
         self._check_task_files()
         self._construct_tokenizer()
         self._check_predictor_type()
-        self._get_inference_model()
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
 
     def _set_utc_schema(self, schema):
         if schema is None:
@@ -293,7 +297,7 @@ def _construct_input_spec(self):
             InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
             InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
             InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
-            InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"),
+            InputSpec(shape=[None, None], dtype="float32", name="attention_mask"),
             InputSpec(shape=[None, None], dtype="int64", name="omask_positions"),
             InputSpec(shape=[None], dtype="int64", name="cls_positions"),
         ]
@@ -311,7 +315,10 @@ def _construct_tokenizer(self):
         Construct the tokenizer for the predictor.
         """
         self._tokenizer = AutoTokenizer.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub)
-        self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        if self._static_mode:
+            self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        else:
+            self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="pd")
         self._template = UTCTemplate(self._tokenizer, self._max_seq_len)
 
     def _check_input_text(self, inputs):
@@ -381,19 +388,26 @@ def _run_model(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
             "omask_positions": "int64",
             "cls_positions": "int64",
         }
-        with static_mode_guard():
+        if self._static_mode:
+            with static_mode_guard():
+                for batch in inputs["batches"]:
+                    if self._predictor_type == "paddle-inference":
+                        for i, input_name in enumerate(self.input_names):
+                            self.input_handles[i].copy_from_cpu(batch[input_name].astype(dtype_dict[input_name]))
+                        self.predictor.run()
+                        logits = self.output_handle[0].copy_to_cpu().tolist()
+                    else:
+                        input_dict = {}
+                        for input_name in dtype_dict:
+                            input_dict[input_name] = batch[input_name].astype(dtype_dict[input_name])
+                        logits = self.predictor.run(None, input_dict)[0].tolist()
+                    outputs["batch_logits"].append(logits)
+        else:
             for batch in inputs["batches"]:
-                if self._predictor_type == "paddle-inference":
-                    for i, input_name in enumerate(self.input_names):
-                        self.input_handles[i].copy_from_cpu(batch[input_name].astype(dtype_dict[input_name]))
-                    self.predictor.run()
-                    logits = self.output_handle[0].copy_to_cpu().tolist()
-                else:
-                    input_dict = {}
-                    for input_name in dtype_dict:
-                        input_dict[input_name] = batch[input_name].astype(dtype_dict[input_name])
-                    logits = self.predictor.run(None, input_dict)[0].tolist()
-                outputs["batch_logits"].append(logits)
+                if batch["soft_token_ids"] is not None:
+                    del batch["soft_token_ids"]
+                logits = self._model(**batch)
+                outputs["batch_logits"].append(np.array(logits))
 
         return outputs
 
diff --git a/paddlenlp/trainer/trainer.py b/paddlenlp/trainer/trainer.py
index d1ce35093084..347b89a36752 100644
--- a/paddlenlp/trainer/trainer.py
+++ b/paddlenlp/trainer/trainer.py
@@ -466,6 +466,9 @@ def fn(layer):
 
         # very last
         self._memory_tracker.stop_and_update_metrics()
+        if self.args.count_trained_tokens:
+            self.trained_effective_tokens = 0
+            self.trained_tokens = 0
 
     def _wrap_amp_model(self, args, model):
         logger.info("Using half precision")
@@ -1122,6 +1125,9 @@ def _inner_training_loop(
                     is_no_sync = True
 
                 sync_context = model.no_sync() if is_no_sync else contextlib.nullcontext()
+                if self.args.count_trained_tokens:
+                    self.trained_effective_tokens += (inputs["input_ids"] != self.args.pad_token_id).sum()
+                    self.trained_tokens += inputs["input_ids"].numel()
                 with sync_context:
                     if "step_control" in inspect.signature(self.training_step).parameters:
                         tr_loss_step = self.training_step(model, inputs, step_control=step_control)
@@ -1505,13 +1511,13 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
             )
 
             seq_length = None
-            model_flops = None
+            model_flops_per_token = None
             if getattr(self, "is_pretraining", False) and hasattr(self.model, "config"):
                 seq_length = getattr(self.model.config, "seq_length", None)
                 try:
-                    model_flops = self.model.get_hardware_flops(seq_length=seq_length, recompute=self.args.recompute)
+                    model_flops_per_token = self.model.get_hardware_flops()
                 except NotImplementedError:
-                    model_flops = None
+                    model_flops_per_token = None
 
             # Do not log speed metrics if all steps are skipped since last log.
             if num_steps > 0:
@@ -1522,7 +1528,7 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
                         num_samples=total_train_batch_size * num_steps,
                         num_steps=num_steps,
                         seq_length=seq_length,
-                        model_flops=model_flops,
+                        model_flops_per_token=model_flops_per_token,
                     )
                 )
 
@@ -1570,6 +1576,27 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
             self._save_checkpoint(model, metrics=metrics)
             logger.info(f"{self.runtime_timer.log()}")
             self.control = self.callback_handler.on_save(self.args, self.state, self.control)
+            self.log_trained_tokens()
+
+    def log_trained_tokens(self):
+        if self.args.count_trained_tokens:
+            token_list = []
+            for token_num in [self.trained_effective_tokens, self.trained_tokens]:
+                tensors = token_num.reshape([1])
+                if self.hcg._sharding_degree > 1:
+                    output_tensors = []
+                    paddle.distributed.all_gather(output_tensors, tensors, group=self.hcg._sharding_comm_group)
+                    tensors = paddle.concat(output_tensors).sum().reshape([1])
+                if self.hcg._dp_degree > 1:
+                    output_tensors = []
+                    paddle.distributed.all_gather(output_tensors, tensors, group=self.hcg._dp_comm_group)
+                    tensors = paddle.concat(output_tensors).sum().reshape([1])
+                token_list.append(tensors.item())
+            if self.is_local_process_zero():
+
+                logger.info(
+                    f"Update to now, trained_effective_tokens: {token_list[0]}, trained_tokens: {token_list[1]}."
+                )
 
     def _get_learning_rate(self):
         return self.optimizer.get_lr()
@@ -2546,7 +2573,49 @@ def _save_checkpoint(self, model, metrics=None):
         else:
             self.save_model(output_dir)
 
-        # only save model state dict, ignore optimizer and scheduler
+        # Determine the new best metric / best model checkpoint
+        if metrics is not None and self.args.metric_for_best_model is not None:
+            metric_to_check = self.args.metric_for_best_model
+            if not metric_to_check.startswith("eval_"):
+                metric_to_check = f"eval_{metric_to_check}"
+            metric_value = metrics[metric_to_check]
+
+            operator = np.greater if self.args.greater_is_better else np.less
+            if (
+                self.state.best_metric is None
+                or self.state.best_model_checkpoint is None
+                or operator(metric_value, self.state.best_metric)
+            ):
+                self.state.best_metric = metric_value
+                self.state.best_model_checkpoint = output_dir
+
+        # Save the Trainer state
+        if self.args.should_save:
+            self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
+
+        # Save RNG state in non-distributed training
+        rng_states = {
+            "python": random.getstate(),
+            "numpy": np.random.get_state(),
+            "cuda": paddle.get_rng_state(),
+            "cpu": paddle.framework.core.default_cpu_generator().get_state(),
+        }
+        if self.args.use_hybrid_parallel:
+            rng_states[
+                "hybrid_parallel_rng_state_tracker"
+            ] = fleet.meta_parallel.get_rng_state_tracker().get_states_tracker()
+
+        if self.args.world_size > 1:
+            rng_states_list = []
+            paddle.distributed.all_gather_object(rng_states_list, rng_states)
+            if self.args.should_save:
+                os.makedirs(output_dir, exist_ok=True)
+                paddle.save(rng_states_list, os.path.join(output_dir, f"rng_state_{self.args.world_size}.pth"))
+        else:
+            os.makedirs(output_dir, exist_ok=True)
+            paddle.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
+
+            # only save model state dict, ignore optimizer and scheduler
         if not self.args.ignore_save_lr_and_optim:
             optimizer_name = _add_variant(OPTIMIZER_NAME, self.args.optimizer_name_suffix)
             saved_signal_path = os.path.join(output_dir, f"saved_signal_{dist.get_rank()}")
@@ -2632,47 +2701,6 @@ def _save_checkpoint(self, model, metrics=None):
                             paddle.save(global_rank, os.path.join(signal_dir, f".master_weight.done.{global_rank}"))
 
         self.runtime_timer.stop()
-        # Determine the new best metric / best model checkpoint
-        if metrics is not None and self.args.metric_for_best_model is not None:
-            metric_to_check = self.args.metric_for_best_model
-            if not metric_to_check.startswith("eval_"):
-                metric_to_check = f"eval_{metric_to_check}"
-            metric_value = metrics[metric_to_check]
-
-            operator = np.greater if self.args.greater_is_better else np.less
-            if (
-                self.state.best_metric is None
-                or self.state.best_model_checkpoint is None
-                or operator(metric_value, self.state.best_metric)
-            ):
-                self.state.best_metric = metric_value
-                self.state.best_model_checkpoint = output_dir
-
-        # Save the Trainer state
-        if self.args.should_save:
-            self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
-
-        # Save RNG state in non-distributed training
-        rng_states = {
-            "python": random.getstate(),
-            "numpy": np.random.get_state(),
-            "cuda": paddle.get_rng_state(),
-            "cpu": paddle.framework.core.default_cpu_generator().get_state(),
-        }
-        if self.args.use_hybrid_parallel:
-            rng_states[
-                "hybrid_parallel_rng_state_tracker"
-            ] = fleet.meta_parallel.get_rng_state_tracker().get_states_tracker()
-
-        if self.args.world_size > 1:
-            rng_states_list = []
-            paddle.distributed.all_gather_object(rng_states_list, rng_states)
-            if self.args.should_save:
-                os.makedirs(output_dir, exist_ok=True)
-                paddle.save(rng_states_list, os.path.join(output_dir, f"rng_state_{self.args.world_size}.pth"))
-        else:
-            os.makedirs(output_dir, exist_ok=True)
-            paddle.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
 
         # Maybe delete some older checkpoints.
         # For hybrid parallel training, the checkpoint files maybe on different node.
diff --git a/paddlenlp/trainer/trainer_compress.py b/paddlenlp/trainer/trainer_compress.py
index f2f945cd128f..44420ca9de02 100644
--- a/paddlenlp/trainer/trainer_compress.py
+++ b/paddlenlp/trainer/trainer_compress.py
@@ -38,6 +38,7 @@
     prepare_qkv_ofa,
     reorder_neuron_head,
 )
+from ..utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 from ..utils.log import logger
 from .trainer import Trainer
 
@@ -651,8 +652,8 @@ def _batch_generator_func():
             executor=exe,
             batch_generator=_batch_generator_func,
             model_dir=model_dir,
-            model_filename=args.input_filename_prefix + ".pdmodel",
-            params_filename=args.input_filename_prefix + ".pdiparams",
+            model_filename=args.input_filename_prefix + PADDLE_INFERENCE_MODEL_SUFFIX,
+            params_filename=args.input_filename_prefix + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
             batch_size=batch_size,
             batch_nums=batch_nums,
             scope=None,
@@ -675,8 +676,8 @@ def _batch_generator_func():
         save_model_path = os.path.join(model_dir, algo + "_".join([str(batch_size), str(batch_nums)]))
         post_training_quantization.save_quantized_model(
             save_model_path=save_model_path,
-            model_filename=args.output_filename_prefix + ".pdmodel",
-            params_filename=args.output_filename_prefix + ".pdiparams",
+            model_filename=args.output_filename_prefix + PADDLE_INFERENCE_MODEL_SUFFIX,
+            params_filename=args.output_filename_prefix + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
         )
         output_dir_list.append(save_model_path)
 
diff --git a/paddlenlp/trainer/trainer_utils.py b/paddlenlp/trainer/trainer_utils.py
index 538f4c8ec32d..6a3f9754712e 100644
--- a/paddlenlp/trainer/trainer_utils.py
+++ b/paddlenlp/trainer/trainer_utils.py
@@ -359,7 +359,7 @@ def total_processes_number(local_rank):
     return 1
 
 
-def speed_metrics(split, start_time, num_samples=None, num_steps=None, seq_length=None, model_flops=None):
+def speed_metrics(split, start_time, num_samples=None, num_steps=None, seq_length=None, model_flops_per_token=None):
     """
     Measure and return speed performance metrics.
 
@@ -380,9 +380,9 @@ def speed_metrics(split, start_time, num_samples=None, num_steps=None, seq_lengt
         if seq_length is not None:
             tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()
             result[f"{split}_tokens_per_second_per_device"] = round(tokens_per_second_per_device, 4)
-        if model_flops is not None:
+        if model_flops_per_token is not None:
             result[f"{split}_hardware_tflops_per_device"] = round(
-                tokens_per_second_per_device * model_flops / seq_length / 2**40, 2
+                tokens_per_second_per_device * model_flops_per_token / 2**40, 2
             )
 
     if num_steps is not None:
diff --git a/paddlenlp/trainer/training_args.py b/paddlenlp/trainer/training_args.py
index 1a6e7ea41362..b52e5137dd5e 100644
--- a/paddlenlp/trainer/training_args.py
+++ b/paddlenlp/trainer/training_args.py
@@ -978,6 +978,14 @@ class TrainingArguments:
         default=300,
         metadata={"help": "Timeout seconds for downloading checkpoint from remote cluster."},
     )
+    count_trained_tokens: bool = field(
+        default=False,
+        metadata={"help": "Whether to count trained tokens."},
+    )
+    pad_token_id: int = field(
+        default=0,
+        metadata={"help": "The id of the padding token."},
+    )
 
     def __post_init__(self):
         if in_auto_parallel_align_mode():
@@ -1632,13 +1640,12 @@ def is_segment_parallel_supported():
                             "enable_mp_async_allreduce",  # allreduce_matmul_grad_overlapping in auto_parallel
                             "enable_delay_scale_loss",
                             "replace_with_c_embedding",
-                            # "enable_mp_skip_c_identity",
                             # "enable_mp_fused_linear_param_grad_add",
                             "replace_with_parallel_cross_entropy",
                         ]:
                             raise ValueError(
                                 f"Found unknown tensor parallell config {x}, "
-                                f"accept config is enable_mp_async_allreduce, replace_with_c_embedding, enable_mp_skip_c_identity and enable_mp_fused_linear_param_grad_add"
+                                f"accept config is enable_mp_async_allreduce, replace_with_c_embedding, and enable_mp_fused_linear_param_grad_add"
                             )
                 try:
                     if "enable_mp_async_allreduce" in mp_config:
diff --git a/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py b/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py
index 19e88b402bcf..9b162d4a88c1 100644
--- a/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py
+++ b/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py
@@ -305,7 +305,11 @@ def load_resolved_archive_file(
                     )
                 )
         if has_master_weights:
-            key_name = "_".join([static_name, FP32_MASTER, key_name[1]])
+            if model_state_dict[key_name[0]].dtype != paddle.float32:
+                key_name = "_".join([static_name, FP32_MASTER, key_name[1]])
+            else:
+                # for parameters with float32 dtype, no need to have fp32 master weights.
+                key_name = "_".join([static_name, key_name[1]])
         else:
             key_name = "_".join([static_name, key_name[1]])
 
diff --git a/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py b/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py
index cd999f1dba46..41ba54972efb 100644
--- a/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py
+++ b/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py
@@ -67,6 +67,7 @@
     FP32_MASTER,
     UnifiedCheckpointOption,
     filter_params,
+    filter_sync_parameters,
     gather_sharded_object,
     generate_base_static_name,
     get_expected_state_dict,
@@ -218,25 +219,9 @@ def save_non_merge_optimizer(self, model, optim_state_dict, master_weights, outp
             for key in list(master_weights.keys()):
                 master_weights[static2struct_name_mappings[key]] = master_weights.pop(key)
 
-        no_sync_kname = []
-        model_state_dict = get_expected_state_dict(model)
-        for k, v in model_state_dict.items():
-            if getattr(v, "no_sync", False):
-                no_sync_kname.append(k)
-
-        hcg = fleet.get_hybrid_communicate_group()
-        dp_group = hcg.get_data_parallel_group()
-        dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
         if self.args.use_expert_parallel:
-            for k in list(optim_state_dict.keys()):
-                model_k = k.split("/")[0]
-                if dp_rank > 0 and model_k not in no_sync_kname:
-                    optim_state_dict.pop(k)
-            if master_weights is not None:
-                for k in list(master_weights.keys()):
-                    model_k = k.split("/")[0]
-                    if dp_rank > 0 and model_k not in no_sync_kname:
-                        master_weights.pop(k)
+            model_state_dict = get_expected_state_dict(model)
+            filter_sync_parameters(model_state_dict, optim_state_dict, master_weights, is_model_weight=False)
 
         optimizer_name = _add_variant(SAFE_OPTIMIZER_NAME, self.args.optimizer_name_suffix)
         master_weights_name = _add_variant(SAFE_MASTER_WEIGHTS_NAME, self.args.optimizer_name_suffix)
@@ -516,6 +501,10 @@ def unified_checkpoint_into_shards(
 
     config_to_save = copy.deepcopy(model_to_save.config)
 
+    if args.use_expert_parallel:
+        # ignore saving `no_sync=False` tensors when using expert_parallel under dp_rank > 0.
+        filter_sync_parameters(state_dict, is_model_weight=True)
+
     if config_to_save.tensor_parallel_degree > 1:
         if isinstance(model_to_save, LoRAModel) or isinstance(model_to_save, PrefixModelForCausalLM):
             tp_actions = model_to_save._get_tensor_parallel_convert_actions(
@@ -625,6 +614,9 @@ def unified_optimizer_into_shards(
     tp_group = fleet.get_hybrid_communicate_group().get_model_parallel_group()
     tp_size = tp_group.nranks
 
+    if args.use_expert_parallel:
+        filter_sync_parameters(state_dict, optim_state_dict, master_weights, is_model_weight=False)
+
     if tp_size > 1:
         # get tp_actions
         model_keys = []
@@ -643,7 +635,6 @@ def unified_optimizer_into_shards(
             optim_state_dict,
             tp_actions,
             filter_optim_keys,
-            state_dict if args.use_expert_parallel else None,
         )
         empty_device_cache()
 
@@ -653,7 +644,6 @@ def unified_optimizer_into_shards(
                 master_weights,
                 tp_actions,
                 filter_master_keys,
-                state_dict if args.use_expert_parallel else None,
             )
             empty_device_cache()
 
diff --git a/paddlenlp/trainer/unified_checkpoint/utils.py b/paddlenlp/trainer/unified_checkpoint/utils.py
index bbb49ae14820..413ca7c47210 100644
--- a/paddlenlp/trainer/unified_checkpoint/utils.py
+++ b/paddlenlp/trainer/unified_checkpoint/utils.py
@@ -354,9 +354,7 @@ def merge_tensor_parallel_with_shard(state_dict, tp_actions, all_filter_keys):
     """
     hcg = fleet.get_hybrid_communicate_group()
     tp_group = hcg.get_model_parallel_group()
-    dp_group = hcg.get_data_parallel_group()
     tp_rank = tp_group.rank
-    dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
 
     # filter actions for pipeline mode
     if hcg.get_pipe_parallel_group().nranks > 1:
@@ -373,10 +371,9 @@ def merge_tensor_parallel_with_shard(state_dict, tp_actions, all_filter_keys):
             if i > len(filter_keys) - 1:
                 continue
             key = filter_keys[i]
-            tensor = state_dict[key]
-            # When using expert parallel, there's no need to save tensors with `no_sync=False` when dp_rank > 0.
-            if dp_rank > 0 and not getattr(tensor, "no_sync", False):
+            if key not in state_dict:
                 continue
+            tensor = state_dict[key]
             if key in tp_actions:
                 # Get tensor size
                 tensor_bytes = tensor.numel().item() * dtype_byte_size(tensor.dtype) * tp_group.nranks
@@ -405,21 +402,13 @@ def merge_tensor_parallel_with_shard(state_dict, tp_actions, all_filter_keys):
     return state_dict_to_save
 
 
-def merge_tensor_parallel_for_optimizer(state_dict, tp_actions, all_filter_keys, model_state_dict=None):
+def merge_tensor_parallel_for_optimizer(state_dict, tp_actions, all_filter_keys):
     """
     Merge tensor parallel according to tp_actions, used for master_weight and optimizer weight.
     """
     hcg = fleet.get_hybrid_communicate_group()
     tp_group = hcg.get_model_parallel_group()
-    dp_group = hcg.get_data_parallel_group()
     tp_rank = tp_group.rank
-    dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
-
-    no_sync_kname = []
-    if model_state_dict is not None:
-        for k, v in model_state_dict.items():
-            if getattr(v, "no_sync", False):
-                no_sync_kname.append(k)
 
     state_dict_to_save = {}
     max_key_len = max([len(_) for _ in all_filter_keys])
@@ -430,10 +419,9 @@ def merge_tensor_parallel_for_optimizer(state_dict, tp_actions, all_filter_keys,
                 continue
             # get base model key
             model_key = filter_keys[i].split("/")[0]
-            tensor = state_dict[filter_keys[i]]
-            # When using expert parallel, there's no need to save tensors with `no_sync=False` when dp_rank > 0.
-            if dp_rank > 0 and model_key not in no_sync_kname:
+            if filter_keys[i] not in state_dict:
                 continue
+            tensor = state_dict[filter_keys[i]]
             if model_key in tp_actions:
                 # for example: beta1, beta2
                 if tensor.numel().item() == 1:
@@ -770,3 +758,31 @@ def save_config(model_to_save):
     # save generation config
     if model_to_save.can_generate():
         model_to_save.generation_config.save_pretrained(save_directory)
+
+
+def filter_sync_parameters(model_state_dict, optim_state_dict=None, master_weights=None, is_model_weight=True):
+    """Filter sync parameters under expert parallel mode."""
+
+    hcg = fleet.get_hybrid_communicate_group()
+    dp_group = hcg.get_data_parallel_group()
+    dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
+
+    if is_model_weight:
+        for key in list(model_state_dict.keys()):
+            if dp_rank > 0 and not getattr(model_state_dict[key], "no_sync", False):
+                model_state_dict.pop(key)
+    else:
+        no_sync_kname = []
+        for k, v in model_state_dict.items():
+            if getattr(v, "no_sync", False):
+                no_sync_kname.append(k)
+
+        for key in list(optim_state_dict.keys()):
+            model_key = key.split("/")[0]
+            if dp_rank > 0 and model_key not in no_sync_kname:
+                optim_state_dict.pop(key)
+
+        if master_weights is not None:
+            for key in list(master_weights.keys()):
+                if dp_rank > 0 and key not in no_sync_kname:
+                    master_weights.pop(key)
diff --git a/paddlenlp/trainer/utils/ckpt_converter.py b/paddlenlp/trainer/utils/ckpt_converter.py
index 63fbe4250875..23f085e18f44 100644
--- a/paddlenlp/trainer/utils/ckpt_converter.py
+++ b/paddlenlp/trainer/utils/ckpt_converter.py
@@ -16,6 +16,7 @@
 import os
 import re
 from functools import reduce
+from typing import List, Union
 
 import paddle
 from paddle.distributed.checkpoint.load_state_dict import (
@@ -47,7 +48,7 @@ def __init__(
         parameter_to_structured_name,
         trainging_args=None,
         patch_dict=None,
-        local_view_pattern: list | bool = None,
+        local_view_pattern: Union[List, bool] = None,
     ):
         self.use_dist = True if paddle.distributed.get_world_size() > 1 else False
         self.path = hybrid_parallel_ckpt_path
diff --git a/paddlenlp/transformers/__init__.py b/paddlenlp/transformers/__init__.py
index e420babc3142..0e99466f8e4b 100644
--- a/paddlenlp/transformers/__init__.py
+++ b/paddlenlp/transformers/__init__.py
@@ -143,11 +143,8 @@
 from .deberta_v2.configuration import *
 from .deberta_v2.modeling import *
 from .deberta_v2.tokenizer import *
-from .deepseek_v2.configuration import *
-from .deepseek_v2.modeling import *
-from .deepseek_v2.tokenizer_fast import *
-from .deepseek_v3.configuration import *
-from .deepseek_v3.modeling import *
+from .deepseek_v2 import *
+from .deepseek_v3 import *
 from .distilbert.configuration import *
 from .distilbert.modeling import *
 from .distilbert.tokenizer import *
@@ -215,6 +212,7 @@
 from .layoutxlm.modeling import *
 from .layoutxlm.tokenizer import *
 from .llama import *
+from .llm_embed.modeling import *
 from .luke.configuration import *
 from .luke.modeling import *
 from .luke.tokenizer import *
@@ -246,6 +244,7 @@
 from .nezha.configuration import *
 from .nezha.modeling import *
 from .nezha.tokenizer import *
+from .nv_embed.modeling import *
 from .nystromformer.configuration import *
 from .nystromformer.modeling import *
 from .nystromformer.tokenizer import *
diff --git a/paddlenlp/transformers/auto/configuration.py b/paddlenlp/transformers/auto/configuration.py
index d800252a7a5e..ebc33c3cbd40 100644
--- a/paddlenlp/transformers/auto/configuration.py
+++ b/paddlenlp/transformers/auto/configuration.py
@@ -213,6 +213,14 @@
     ]
 )
 
+MULTI_MODELS_MAPPING = OrderedDict(
+    # multi models mapping
+    [
+        ("qwen2_vl", "qwen2"),
+        ("qwen2_5_vl", "qwen2"),
+    ]
+)
+
 
 def config_class_to_model_type(config):
     """Converts a config class name to the corresponding model type"""
@@ -238,8 +246,9 @@ def __init__(self, mapping):
 
     def __getitem__(self, key):
         # NOTE: (changwenbin) This is to enable the qwen2_vl language model to use qwen2 reasoning optimization
-        if key == "qwen2_vl":
-            key = "qwen2"
+        for model_type, model_key in MULTI_MODELS_MAPPING.items():
+            if key == model_type:
+                key = model_key
         if key in self._extra_content:
             return self._extra_content[key]
         if key not in self._mapping:
diff --git a/paddlenlp/transformers/auto/modeling.py b/paddlenlp/transformers/auto/modeling.py
index 27103eada754..38e773f56bb4 100644
--- a/paddlenlp/transformers/auto/modeling.py
+++ b/paddlenlp/transformers/auto/modeling.py
@@ -822,12 +822,16 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         tensor_parallel_degree = kwargs.pop("tensor_parallel_degree", 1)
         tensor_parallel_rank = kwargs.pop("tensor_parallel_rank", 0)
         model_arg = kwargs.pop("model_args", None)
-        is_eagle = kwargs.pop("is_eagle", False)
-        eagle_flag = ""
+        spec_model_type = kwargs.pop("spec_model_type", "None")
+        spec_flag = ""
 
         # Check whether the model_type is img2txt in inference mode
-        if is_eagle:
-            eagle_flag = "Eagle"
+        if spec_model_type == "eagle":
+            spec_flag = "Eagle"
+            attn_type = "Block"
+            model_name = f"{config.architectures[0]}{attn_type}"
+        elif spec_model_type == "mtp":
+            spec_flag = "MTP"
             attn_type = "Block"
             model_name = f"{config.architectures[0]}{attn_type}"
         else:
@@ -849,7 +853,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         # Import the InferenceModel
         import_class = importlib.import_module(f"paddlenlp.experimental.transformers.{config.model_type}.modeling")
 
-        model_class_name = f"{eagle_flag}{model_name}InferenceModel"
+        model_class_name = f"{spec_flag}{model_name}InferenceModel"
         model_class = getattr(import_class, model_class_name)
 
         # It may return a new model class, like LlamaForCausalLMAvxInferenceModel
diff --git a/paddlenlp/transformers/chatglm/modeling.py b/paddlenlp/transformers/chatglm/modeling.py
index 5e3d8e493896..708b068f5fd8 100755
--- a/paddlenlp/transformers/chatglm/modeling.py
+++ b/paddlenlp/transformers/chatglm/modeling.py
@@ -132,14 +132,7 @@ def forward(self, position_ids):
             cos_cached = emb.cos().unsqueeze(1).cast(self.default_dtype)
             sin_cached = emb.sin().unsqueeze(1).cast(self.default_dtype)
 
-            if hasattr(paddle.framework, "_no_check_dy2st_diff"):
-                # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
-                # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
-                # removed after static graphs support inplace and stride.
-                with paddle.framework._no_check_dy2st_diff():
-                    self.cos_cached, self.sin_cached = cos_cached, sin_cached
-            else:
-                self.cos_cached, self.sin_cached = cos_cached, sin_cached
+            self.cos_cached, self.sin_cached = cos_cached, sin_cached
 
         cos, sin = self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
         if self.position_encoding_2d:
diff --git a/paddlenlp/transformers/configuration_utils.py b/paddlenlp/transformers/configuration_utils.py
index 5ecd7f907db6..40db64a003bf 100644
--- a/paddlenlp/transformers/configuration_utils.py
+++ b/paddlenlp/transformers/configuration_utils.py
@@ -235,6 +235,7 @@ class LlmMetaConfig:
         ("use_fused_rope", bool, False, "Enable rope fusion or not."),
         ("use_fused_linear", bool, False, "GPT3 model, use fused linear layer"),
         ("use_fused_dropout_add", bool, False, "GPT3 model, use fused `dropout + residual add` op."),
+        ("use_fused_linear_cross_entropy", bool, False, "use fused `linear + cross_entropy` fuse op."),
     ]
 
     hybrid_parallel_attributes = [
diff --git a/paddlenlp/transformers/conversion_utils.py b/paddlenlp/transformers/conversion_utils.py
index d4258baa1c34..e95d94f8a3ed 100644
--- a/paddlenlp/transformers/conversion_utils.py
+++ b/paddlenlp/transformers/conversion_utils.py
@@ -1311,6 +1311,7 @@ def _get_tensor_parallel_mappings(cls, config: PretrainedConfig, is_split=True)
     def _resolve_prefix_keys(state_keys_base, state_keys_real, ignore_error=False):
         # state_keys_map base to real
         state_keys_map = {}
+
         # sorted by length，match from long to short for A.key B.key ...
         state_keys_base = sorted(state_keys_base, key=lambda x: len(x), reverse=True)
         state_keys_real = set(state_keys_real)
diff --git a/paddlenlp/transformers/deepseek_v2/__init__.py b/paddlenlp/transformers/deepseek_v2/__init__.py
index 5144d20699db..f68a341b4fbc 100644
--- a/paddlenlp/transformers/deepseek_v2/__init__.py
+++ b/paddlenlp/transformers/deepseek_v2/__init__.py
@@ -14,4 +14,6 @@
 
 from .configuration import *
 from .modeling import *
+from .modeling_auto import *
+from .modeling_pp import *
 from .tokenizer_fast import *
diff --git a/paddlenlp/transformers/deepseek_v2/configuration.py b/paddlenlp/transformers/deepseek_v2/configuration.py
index 6883b5cf7802..221e732b3f47 100644
--- a/paddlenlp/transformers/deepseek_v2/configuration.py
+++ b/paddlenlp/transformers/deepseek_v2/configuration.py
@@ -116,6 +116,8 @@ class DeepseekV2Config(PretrainedConfig):
             Whether to use a bias in the query, key, value and output projection layers during self-attention.
         attention_dropout (`float`, *optional*, defaults to 0.0):
             The dropout ratio for the attention probabilities.
+        speculate_model_type (`str`, defaults to `None`, *optional*, defaults to `False`):
+            The model type for speculate. Support ['eagle', 'mtp'] Now.
 
     ```python
     >>> from paddlenlp.transformers import DeepseekV2Model, DeepseekV2Config
@@ -174,6 +176,7 @@ def __init__(
         rope_scaling=None,
         attention_bias=False,
         attention_dropout=0.0,
+        speculate_model_type=False,
         **kwargs,
     ):
         self.vocab_size = vocab_size
@@ -218,6 +221,7 @@ def __init__(
         self.rope_scaling = rope_scaling
         self.attention_bias = attention_bias
         self.attention_dropout = attention_dropout
+        self.speculate_model_type = speculate_model_type
 
         super().__init__(
             pad_token_id=pad_token_id,
diff --git a/paddlenlp/transformers/deepseek_v2/mfu_utils.py b/paddlenlp/transformers/deepseek_v2/mfu_utils.py
new file mode 100644
index 000000000000..3574ec16e5f8
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v2/mfu_utils.py
@@ -0,0 +1,206 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# https://github.com/GHGmc2/deepseek-projection/blob/af62687fba22e3362469a343d048a1235047388c/projection/deepseek_proj.py#L1
+
+
+class DeepSeekProjection:
+    def __init__(self, model_config, train_options=None):
+        self._model_config = model_config
+        self._train_options = train_options
+
+        # for internal usage
+        (
+            self._vocab_size,
+            self._max_seq_len,
+            self._dim,
+            self._intermediate_size,
+            self._moe_intermediate_size,
+            self._n_layers,
+            self._n_dense_layers,
+            self._n_heads,
+            self._qk_nope_head_dim,
+            self._q_lora_rank,
+            self._kv_lora_rank,
+            self._qk_rope_head_dim,
+            self._n_experts_shared,
+            self._n_experts_routed,
+            self._router_top_k,
+        ) = (
+            model_config.vocab_size,
+            model_config.seq_length,
+            model_config.hidden_size,
+            model_config.intermediate_size,
+            model_config.moe_intermediate_size,
+            model_config.num_hidden_layers,  #
+            model_config.first_k_dense_replace,  #
+            model_config.num_attention_heads,  #
+            model_config.qk_nope_head_dim,  #
+            model_config.q_lora_rank,  #
+            model_config.kv_lora_rank,  #
+            model_config.qk_rope_head_dim,  #
+            model_config.n_shared_experts,  #
+            model_config.n_routed_experts,  #
+            model_config.num_experts_per_tok,
+        )
+
+        if train_options is not None:
+            self._causal_mask = train_options.causal_mask
+            self._fused_atten = train_options.fused_atten
+            # self._bytes_of_dtype = train_options.use_dtype.bytes_of_dtype()
+        else:
+            self._causal_mask = True
+            self._fused_atten = True
+
+    def get_num_params(self, include_embedding: bool = True) -> tuple[int, int]:
+        num_params_embedding = 0
+        if include_embedding:
+            num_params_embedding = (
+                self._vocab_size
+                * self._dim  # Word Token Embedding(WTE)
+                # + self._max_seq_len * self._dim  # Word Position Embedding (WPE)
+            )
+
+        # MLA projection for Q, K and V
+        if self._q_lora_rank is None:
+            num_params_proj_q = self._dim * self._n_heads * (self._qk_nope_head_dim + self._qk_rope_head_dim)
+        else:
+            num_params_down_q = self._dim * self._q_lora_rank
+            num_params_up_q = self._q_lora_rank * self._n_heads * self._qk_nope_head_dim
+            num_params_rope_q = self._q_lora_rank * self._n_heads * self._qk_rope_head_dim
+            num_params_proj_q = num_params_down_q + num_params_up_q + num_params_rope_q
+        num_params_down_kv = self._dim * self._kv_lora_rank
+        num_params_up_k = self._kv_lora_rank * self._n_heads * self._qk_nope_head_dim
+        num_params_rope_k = self._dim * self._qk_rope_head_dim
+        num_params_up_v = self._kv_lora_rank * self._n_heads * self._qk_nope_head_dim
+        # out proj
+        num_params_o = self._n_heads * self._qk_nope_head_dim * self._dim  # v_head_dim = qk_nope_head_dim
+        num_params_atten = (
+            num_params_proj_q
+            + num_params_down_kv
+            + num_params_up_k
+            + num_params_rope_k
+            + num_params_up_v
+            + num_params_o
+        )
+
+        num_params_ffn = self._dim * self._moe_intermediate_size * 3
+        num_params_ffn_dense = self._dim * self._intermediate_size * 3
+        # MoE, the sparse param count
+        num_params_gate = 0
+        n_experts = self._n_experts_routed + self._n_experts_shared
+        num_params_ffn_activated = num_params_ffn
+        if n_experts > 1:
+            num_params_gate = self._dim * self._n_experts_routed
+            num_params_ffn *= n_experts
+            num_params_ffn_activated *= self._n_experts_shared + self._router_top_k
+
+        num_params_norm = 2 * self._dim
+        # additional RMSNorm after the compressed latent vectors
+        num_params_norm += self._kv_lora_rank + 0 if self._q_lora_rank is None else self._q_lora_rank
+
+        num_params_final_norm = self._dim
+
+        num_params = (
+            num_params_embedding
+            + self._n_dense_layers * (num_params_atten + num_params_norm + num_params_ffn_dense)
+            + (self._n_layers - self._n_dense_layers)
+            * (num_params_atten + num_params_norm + num_params_ffn + num_params_gate)
+            + num_params_final_norm
+        )
+
+        num_params_activated = (
+            num_params_embedding
+            + self._n_dense_layers * (num_params_atten + num_params_norm + num_params_ffn_dense)
+            + (self._n_layers - self._n_dense_layers)
+            * (num_params_atten + num_params_norm + num_params_ffn_activated + num_params_gate)
+            + num_params_final_norm
+        )
+        return num_params, num_params_activated
+
+    def get_num_flop_fwd(self, batch_size: int) -> int:
+        # MLA projection of Q, K and V
+        if self._q_lora_rank is None:
+            num_flop_proj_q = (
+                2
+                * batch_size
+                * self._max_seq_len
+                * self._dim
+                * self._n_heads
+                * (self._qk_nope_head_dim + self._qk_rope_head_dim)
+            )
+        else:
+            num_flop_down_q = 2 * batch_size * self._max_seq_len * self._dim * self._q_lora_rank
+            num_flop_up_q = (
+                2 * batch_size * self._max_seq_len * self._q_lora_rank * self._qk_nope_head_dim * self._n_heads
+            )
+            num_flop_rope_q = (
+                2 * batch_size * self._max_seq_len * self._q_lora_rank * self._qk_rope_head_dim * self._n_heads
+            )
+            num_flop_proj_q = num_flop_down_q + num_flop_up_q + num_flop_rope_q
+        num_flop_down_k = 2 * batch_size * self._max_seq_len * self._dim * self._kv_lora_rank
+        num_flop_up_k = (
+            2 * batch_size * self._max_seq_len * self._kv_lora_rank * self._qk_nope_head_dim * self._n_heads
+        )
+        num_flop_rope_k = 2 * batch_size * self._max_seq_len * self._dim * self._qk_rope_head_dim
+        num_flop_proj_k = num_flop_down_k + num_flop_up_k + num_flop_rope_k
+        num_flop_proj_v = 2 * batch_size * self._max_seq_len * self._qk_nope_head_dim * self._n_heads * self._dim
+        num_flop_qkv_proj = num_flop_proj_q + num_flop_proj_k + num_flop_proj_v
+
+        # see the discussion: https://github.com/pytorch/torchtitan/pull/280
+        num_flop_sdpa = 4 * batch_size * self._max_seq_len**2 * self._dim
+        num_flop_sdpa //= 2 if self._causal_mask else 1
+        num_flop_out_proj = 2 * batch_size * self._max_seq_len * self._dim**2
+        num_flop_fwd_atten = num_flop_qkv_proj + num_flop_sdpa + num_flop_out_proj
+
+        num_flop_fwd_ffn = (2 * batch_size * self._max_seq_len * self._dim * self._moe_intermediate_size) * 3
+        num_flop_fwd_ffn_dense = (2 * batch_size * self._max_seq_len * self._dim * self._intermediate_size) * 3
+        # MoE, the active param
+        n_experts = self._n_experts_shared + self._n_experts_routed
+        if n_experts > 1:
+            num_flop_fwd_ffn *= self._n_experts_shared + self._router_top_k  # num of activated experts
+            num_flop_gate = 2 * batch_size * self._max_seq_len * self._dim * self._n_experts_routed
+            num_flop_fwd_ffn += num_flop_gate
+
+        num_flop_fwd_logits = 2 * batch_size * self._max_seq_len * self._dim * self._vocab_size
+
+        return (
+            self._n_dense_layers * (num_flop_fwd_atten + num_flop_fwd_ffn_dense)
+            + (self._n_layers - self._n_dense_layers) * (num_flop_fwd_atten + num_flop_fwd_ffn)
+            + num_flop_fwd_logits
+        )
+
+    def get_num_flop_per_token(self):
+        batch_size = 1  # dummy
+        num_flop_per_token = self.get_num_flop_fwd(batch_size) / batch_size / self._max_seq_len * 3  # bwd = 2 * fwd
+        print("num_flop_per_token:\t", num_flop_per_token)
+        return num_flop_per_token
+
+    def _get_num_flop_QK_fwd(self, batch_size: int) -> int:
+        """
+        Forward FLOPs for QK^T of all chunked transformer blocks, which is re-computed on backward by Flash attention
+        """
+        num_flop_qk = self._n_layers * (2 * batch_size * self._max_seq_len**2 * self._dim)
+        num_flop_qk //= 2 if self._causal_mask else 1
+        return num_flop_qk
+
+    def get_num_flop_bwd(self, batch_size: int) -> int:
+        num_flop_fwd = self.get_num_flop_fwd(batch_size)
+        num_flop_bwd = num_flop_fwd * 2
+        # Flash-attention uses re-computation for QK^T
+        if self._fused_atten:
+            qk_fwd_flop = self._get_num_flop_QK_fwd(batch_size)
+            num_flop_bwd += qk_fwd_flop
+
+        return num_flop_bwd
diff --git a/paddlenlp/transformers/deepseek_v2/modeling.py b/paddlenlp/transformers/deepseek_v2/modeling.py
index 8d57239324a7..ee58e1b638d9 100644
--- a/paddlenlp/transformers/deepseek_v2/modeling.py
+++ b/paddlenlp/transformers/deepseek_v2/modeling.py
@@ -17,7 +17,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Paddle DeepSeek model."""
+"""Paddle DeepSeek model."""
+
 from __future__ import annotations
 
 import math
@@ -60,15 +61,22 @@
 from ..activations import ACT2FN
 from ..conversion_utils import StateDictNameMapping, init_name_mappings
 from ..linear_utils import Linear
+from ..llama import fusion_ops
+from ..llama.modeling import get_use_casual_mask
 from ..model_outputs import (
     BaseModelOutputWithPast,
     CausalLMOutputWithPast,
     SequenceClassifierOutputWithPast,
 )
 from ..model_utils import PretrainedModel, register_base_model
+from ..moe_gate import PretrainedMoEGate
+from ..moe_layer import MoELayer
+from ..utils import device_guard
 from .configuration import DeepseekV2Config
 
 __all__ = [
+    "DeepseekV2LMHead",
+    "DeepseekV2PretrainingCriterion",
     "DeepseekV2ForCausalLM",
     "DeepseekV2ForSequenceClassification",
     "DeepseekV2Model",
@@ -155,34 +163,49 @@ def scaled_dot_product_attention(
     value_states,
     attention_mask,
     output_attentions,
+    attn_mask_startend_row_indices=None,
     softmax_scale=1.0,
     training=True,
     sequence_parallel=False,
 ):
     bsz, q_len, num_heads, head_dim = query_states.shape
-    _, kv_seq_len, _, v_head_dim = value_states.shape
+    _, kv_seq_len, v_num_heads, v_head_dim = value_states.shape
 
     if config.use_flash_attention and flash_attention:
         # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
         # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
 
-        attn_output = F.scaled_dot_product_attention(
+        # Note: Flash Attention does not support softmax_scale, so we need to scale the query_states
+        q_head_dim = query_states.shape[-1]
+        softmax_scale = softmax_scale * (q_head_dim**0.5)
+        query_states = query_states * softmax_scale
+        value_padding = paddle.zeros(
+            [bsz, kv_seq_len, v_num_heads, head_dim - v_head_dim],
+            dtype=value_states.dtype,
+        )
+        value_states = paddle.concat([value_states, value_padding], axis=-1)
+
+        outputs = fusion_ops.fusion_flash_attention(
             query_states,
+            config,
             key_states,
             value_states,
-            attn_mask=attention_mask,
-            is_causal=attention_mask is None,
-            dropout_p=config.attention_dropout if training else 0.0,
-            training=training,
+            attention_mask,
+            output_attentions,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            sequence_parallel=sequence_parallel,
         )
-        attn_output *= (head_dim ** (0.5)) * softmax_scale
-        attn_weights = None
 
-        if sequence_parallel:
-            attn_output = attn_output.reshape([bsz * q_len, v_head_dim * num_heads])
+        if isinstance(outputs, tuple):
+            outputs[0] = outputs[0].reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs[0] = outputs[0][..., :v_head_dim]
+            outputs[0] = outputs[0].reshape([bsz, q_len, -1])
         else:
-            attn_output = attn_output.reshape([bsz, q_len, v_head_dim * num_heads])
-        return (attn_output, attn_weights) if output_attentions else attn_output
+            outputs = outputs.reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs = outputs[..., :v_head_dim]
+            outputs = outputs.reshape([bsz, q_len, -1])
+        return outputs
+
     else:
         #  [ bz, seqlen, nhead, head_dim] -> [bs, nhead, seq_len, head_dim]
         query_states = paddle.transpose(query_states, [0, 2, 1, 3])
@@ -228,7 +251,7 @@ def scaled_dot_product_attention(
 
 def masked_fill(x, mask, value):
     y = paddle.full(x.shape, value, x.dtype)
-    return paddle.where(mask, y, x)
+    return paddle.where(mask.to("bool"), y, x)
 
 
 def is_casual_mask(attention_mask):
@@ -300,6 +323,18 @@ def __init__(self, config: DeepseekV2Config, hidden_size=None, eps=1e-6, use_seq
             mark_as_sequence_parallel_parameter(self.weight)
 
     def forward(self, hidden_states):
+        if self.config.use_fused_rms_norm and get_env_device() == "xpu":
+            if self.weight.dtype != hidden_states.dtype:
+                hidden_states = paddle.cast(hidden_states, self.weight.dtype)
+            try:
+                import paddle_xpu_nn  # noqa: F821
+
+                return paddle_xpu_nn.xpu_rms_norm(hidden_states, self.weight, self.variance_epsilon)[0]
+            except ImportError:
+                raise NotImplementedError(
+                    f"Implementation of fused_rms_norm is not available on {get_env_device()}. Please install paddle_xpu to use this feature"
+                )
+
         if paddle.in_dynamic_mode():
             with paddle.amp.auto_cast(False):
                 hidden_states = hidden_states.astype("float32")
@@ -323,8 +358,11 @@ def __init__(self, dim, max_position_embeddings=2048, base=10000):
         self.max_position_embeddings = max_position_embeddings
         self.base = base
         # [dim / 2]
-        self.inv_freq = 1.0 / (self.base ** (paddle.cast(paddle.arange(0, self.dim, 2), dtype="float32") / self.dim))
-        self._set_cos_sin_cache(seq_len=max_position_embeddings)
+        with device_guard("cpu"):
+            self.inv_freq = 1.0 / (
+                self.base ** (paddle.cast(paddle.arange(0, self.dim, 2), dtype="float32") / self.dim)
+            )
+            self._set_cos_sin_cache(seq_len=max_position_embeddings)
 
         self.max_seq_len_cached = None
 
@@ -524,7 +562,7 @@ def rotate_half(x):
     return paddle.concat([-x2, x1], axis=-1)  # shape is the same as x
 
 
-def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids, fuse_rope=False):
     """Applies Rotary Position Embedding to the query and key tensors.
 
     Args:
@@ -545,6 +583,24 @@ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
     Returns:
         `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
     """
+    b, s, h, d = q.shape
+    q = q.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
+
+    b, s, h, d = k.shape
+    k = k.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
+
+    if get_env_device() == "xpu" and fuse_rope:
+        q_embed, k_embed, _ = fused_rotary_position_embedding(
+            q,
+            k,
+            None,
+            sin=sin,
+            cos=cos,
+            position_ids=position_ids,
+            use_neox_rotary_style=False,
+        )
+        return q_embed, k_embed
+
     if position_ids is None:
         # Note: Only for MixtralForCausalLMPipe model pretraining
         cos = cos[:, : q.shape[1], :, :]  # [bs, seq_len, 1, axis]
@@ -555,12 +611,6 @@ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
         cos = cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, axis]
         sin = sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, axis]
 
-    b, s, h, d = q.shape
-    q = q.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
-
-    b, s, h, d = k.shape
-    k = k.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
-
     q_embed = (q * cos) + (rotate_half(q) * sin)
     k_embed = (k * cos) + (rotate_half(k) * sin)
     return q_embed, k_embed
@@ -611,110 +661,44 @@ def forward(self, x):
         return down_proj
 
 
-class MoEGate(nn.Layer):
-    def __init__(self, config: DeepseekV2Config):
-        super().__init__()
-        self.config = config
-        self.top_k = config.num_experts_per_tok
-        self.n_routed_experts = config.n_routed_experts
-        self.routed_scaling_factor = config.routed_scaling_factor
+class MoEGate(PretrainedMoEGate):
+    def __init__(self, config, num_experts, expert_hidden_size, **kwargs):
+        super().__init__(config, num_experts, expert_hidden_size, **kwargs)
+        # [hidden_size, n_expert]
+
         self.scoring_func = config.scoring_func
-        self.alpha = config.aux_loss_alpha
-        self.seq_aux = config.seq_aux
         self.topk_method = config.topk_method
-        self.n_group = config.n_group
-        self.topk_group = config.topk_group
 
-        # topk selection algorithm
-        self.norm_topk_prob = config.norm_topk_prob
-        self.gating_dim = config.hidden_size
         self.weight = paddle.create_parameter(
-            shape=[self.gating_dim, self.n_routed_experts],
+            shape=[expert_hidden_size, num_experts],
             dtype=paddle.get_default_dtype(),
+            is_bias=False,
             default_initializer=nn.initializer.Constant(1.0),
         )
 
-        if self.topk_method == "noaux_tc":
+        if config.topk_method == "noaux_tc":
             self.e_score_correction_bias = paddle.create_parameter(
-                shape=[self.n_routed_experts],
+                shape=[num_experts],
                 dtype=paddle.get_default_dtype(),
                 default_initializer=nn.initializer.Constant(0.0),
             )
 
     def forward(self, hidden_states):
-        bsz, seq_len, h = hidden_states.shape
+        """
+        Args:
+            hidden_states (_type_): [batch_size * seq_len, hidden_size]
+        """
+        _, h_dim = hidden_states.shape
+
         # compute gating score
-        hidden_states = hidden_states.reshape([-1, h])
+        logits = F.linear(hidden_states, self.weight, None)
+
         with paddle.amp.auto_cast(False):
-            logits = F.linear(
-                paddle.cast(hidden_states, paddle.float32), paddle.cast(self.weight, paddle.float32), None
-            )
+            scores = self.gate_score_func(logits=logits)
+            scores = scores.cast(paddle.get_default_dtype())
 
-        if self.scoring_func == "softmax":
-            with paddle.amp.auto_cast(False):
-                scores = F.softmax(logits.astype("float32"), axis=-1)
-        elif self.scoring_func == "sigmoid":
-            with paddle.amp.auto_cast(False):
-                scores = F.sigmoid(logits.astype("float32"))
-        else:
-            raise NotImplementedError(f"insupportable scoring function for MoE gating: {self.scoring_func}")
-
-        # select top-k experts
-        if self.topk_method == "greedy":
-            topk_weight, topk_idx = paddle.topk(scores, k=self.top_k, axis=-1, sorted=False)
-        elif self.topk_method in ["group_limited_greedy", "noaux_tc"]:
-            if self.topk_method == "group_limited_greedy":
-                group_scores = scores.reshape([bsz * seq_len, self.n_group, -1]).max(axis=-1).values  # [n, n_group]
-            elif self.topk_method == "noaux_tc":
-                assert not self.training
-                scores = scores.reshape([bsz * seq_len, -1]) + self.e_score_correction_bias.unsqueeze(0)
-                group_scores = (
-                    scores.reshape([bsz * seq_len, self.n_group, -1]).topk(2, axis=-1)[0].sum(axis=-1)
-                )  # [n, n_group]
-            group_idx = paddle.topk(group_scores, k=self.topk_group, axis=-1, sorted=False)[1]  # [n, top_k_group]
-            group_mask = paddle.zeros_like(group_scores)  # [n, n_group]
-            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
-            score_mask = (
-                group_mask.unsqueeze(-1)
-                .expand(bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group)
-                .reshape(bsz * seq_len, -1)
-            )  # [n, e]
-            tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
-            topk_weight, topk_idx = paddle.topk(tmp_scores, k=self.top_k, axis=-1, sorted=False)
-            topk_weight = scores.gather(topk_idx, axis=1) if self.topk_method == "noaux_tc" else topk_weight
-
-        # norm gate to sum 1
-        if self.top_k > 1 and self.norm_topk_prob:
-            denominator = topk_weight.sum(axis=-1, keepdim=True) + 1e-20
-            topk_weight = topk_weight / denominator
-        else:
-            topk_weight = topk_weight * self.routed_scaling_factor
-        # expert-level computation auxiliary loss
-        if self.training and self.alpha > 0.0:
-            scores_for_aux = scores
-            aux_topk = self.top_k
-            # always compute aux loss based on the naive greedy topk method
-            topk_idx_for_aux_loss = topk_idx.reshape([bsz, -1])  # [bsz, top_k*seq_len]
-            if self.seq_aux:
-                scores_for_seq_aux = scores_for_aux.reshape([bsz, seq_len, -1])
-                ce = paddle.zeros([bsz, self.n_routed_experts])
-                ce.put_along_axis_(
-                    axis=1,
-                    indices=topk_idx_for_aux_loss,
-                    values=paddle.ones([bsz, seq_len * aux_topk]),
-                    reduce="add",
-                )
-                ce /= seq_len * aux_topk / self.n_routed_experts
-                aux_loss = (ce * scores_for_seq_aux.mean(axis=1)).sum(axis=1).mean() * self.alpha
-            else:
-                mask_ce = F.one_hot(topk_idx_for_aux_loss.reshape([-1]), num_classes=self.n_routed_experts)
-                ce = mask_ce.float().mean(0)
-                Pi = scores_for_aux.mean(0)
-                fi = ce * self.n_routed_experts
-                aux_loss = (Pi * fi).sum() * self.alpha
-        else:
-            aux_loss = None
-        return topk_idx, topk_weight, aux_loss
+        capacity, combine_weights, dispatch_mask, exp_counts, l_aux, l_zloss = self.topkgating(scores)
+        return capacity, combine_weights, dispatch_mask, exp_counts, l_aux, l_zloss
 
 
 class AddAuxiliaryLoss(paddle.autograd.PyLayer):
@@ -738,49 +722,47 @@ def backward(ctx, grad_output):
         return grad_output, grad_loss
 
 
-class DeepseekV2MoE(nn.Layer):
+class DeepseekV2MoE(MoELayer):
     """
     A mixed expert module containing shared experts.
     """
 
-    def __init__(self, config):
-        super().__init__()
-        self.config = config
-        self.num_experts_per_tok = config.num_experts_per_tok
+    def __init__(self, config: DeepseekV2Config):
+        gate = MoEGate(
+            config=config,
+            num_experts=config.n_routed_experts,
+            expert_hidden_size=config.hidden_size,
+            top_k=config.num_experts_per_tok,
+            topk_method=config.topk_method,
+            n_group=config.n_group,
+            topk_group=config.topk_group,
+            norm_topk_prob=config.norm_topk_prob,
+            routed_scaling_factor=config.routed_scaling_factor,
+            drop_tokens=False,
+        )
 
-        self.ep_size = 1
-        self.experts_per_rank = config.n_routed_experts
-        self.ep_rank = 0
-        self.experts = nn.LayerList(
-            [
-                DeepseekV2MLP(config, intermediate_size=config.moe_intermediate_size, is_moe=True)
-                for i in range(config.n_routed_experts)
-            ]
+        super().__init__(
+            config=config,
+            moe_num_experts=config.n_routed_experts,
+            expert_class=DeepseekV2MLP,
+            expert_kwargs={"config": config, "intermediate_size": config.moe_intermediate_size, "is_moe": True},
+            gate=gate,
+            capacity=2.0,
         )
-        self.gate = MoEGate(config)
+        self.alpha = config.aux_loss_alpha
         if config.n_shared_experts is not None:
             intermediate_size = config.moe_intermediate_size * config.n_shared_experts
-            self.shared_experts = DeepseekV2MLP(config=config, intermediate_size=intermediate_size, is_moe=True)
+            self.shared_experts = DeepseekV2MLP(config=config, intermediate_size=intermediate_size, is_moe=False)
 
     def forward(self, hidden_states):
-        identity = hidden_states
-        orig_shape = hidden_states.shape
-        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
-        hidden_states = hidden_states.reshape([-1, hidden_states.shape[-1]])
-        flat_topk_idx = topk_idx.reshape([-1])
-        # remove the infer method
-        hidden_states = hidden_states.repeat_interleave(self.num_experts_per_tok, axis=0)
-        y = paddle.empty_like(hidden_states)
-        for i, expert in enumerate(self.experts):
-            if paddle.any(flat_topk_idx == i):
-                y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
-        y = (y.reshape([*topk_weight.shape, -1]) * topk_weight.unsqueeze(-1)).sum(axis=1)
-        y = paddle.cast(y, hidden_states.dtype).reshape([*orig_shape])
-        if self.training and self.gate.alpha > 0.0:
-            y = AddAuxiliaryLoss.apply(y, aux_loss)
+        final_hidden_states, l_aux, l_zloss = super().forward(hidden_states)
+        if self.training and self.alpha > 0.0:
+            final_hidden_states = AddAuxiliaryLoss.apply(final_hidden_states, l_aux)
+
         if self.config.n_shared_experts is not None:
-            y = y + self.shared_experts(identity)
-        return y
+            shared_expert_output = self.shared_experts(hidden_states)
+            final_hidden_states = final_hidden_states + shared_expert_output
+        return final_hidden_states
 
 
 def repeat_kv(hidden_states: paddle.Tensor, n_rep: int) -> paddle.Tensor:
@@ -818,6 +800,7 @@ def __init__(self, config: DeepseekV2Config, layerwise_recompute: bool = False):
         self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
 
         self.is_causal = True
+        self.fuse_rope = config.use_fused_rope
 
         self.seq_length = config.seq_length
         self.sequence_parallel = config.sequence_parallel
@@ -939,11 +922,12 @@ def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
     def forward(
         self,
         hidden_states: paddle.Tensor,
-        attention_mask: Optional[paddle.Tensor] = None,
-        position_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
         past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: bool = False,
         use_cache: bool = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
         if "padding_mask" in kwargs:
@@ -979,7 +963,7 @@ def forward(
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
         cos = cos[None, :, None, :]
         sin = sin[None, :, None, :]
-        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
+        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids, self.fuse_rope)
 
         query_states = paddle.empty([bsz, q_len, self.num_heads, self.q_head_dim], dtype=self.config.dtype)
         query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
@@ -1011,6 +995,7 @@ def forward(
                 value_states,
                 attention_mask,
                 output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 softmax_scale=self.softmax_scale,
                 training=self.training,
                 sequence_parallel=self.sequence_parallel,
@@ -1024,6 +1009,7 @@ def forward(
                 value_states,
                 attention_mask,
                 output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 softmax_scale=self.softmax_scale,
                 training=self.training,
                 sequence_parallel=self.sequence_parallel,
@@ -1046,6 +1032,7 @@ def forward(
 class DeepseekV2DecoderLayer(nn.Layer):
     def __init__(self, config: DeepseekV2Config, layer_idx: int, layerwise_recompute: bool = False):
         super().__init__()
+        self.config = config
 
         self.enable_recompute = False
         self.layerwise_recompute = layerwise_recompute
@@ -1070,11 +1057,12 @@ def __init__(self, config: DeepseekV2Config, layer_idx: int, layerwise_recompute
     def forward(
         self,
         hidden_states: paddle.Tensor,
-        attention_mask: Optional[paddle.Tensor] = None,
         position_ids: Optional[paddle.Tensor] = None,
-        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: Optional[bool] = False,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
         use_cache: Optional[bool] = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
         """
@@ -1107,24 +1095,26 @@ def forward(
             and has_gradient
             and self.recompute_granularity == "full_attn"
         ):
-            recompute()
-            hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states, self_attn_weights, present_key_value = recompute(
+                self.self_attn,
                 hidden_states=hidden_states,
-                attention_mask=attention_mask,
                 position_ids=position_ids,
-                past_key_value=past_key_value,
+                attention_mask=attention_mask,
                 output_attentions=output_attentions,
+                past_key_value=past_key_value,
                 use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 **kwargs,
             )
         else:
             hidden_states, self_attn_weights, present_key_value = self.self_attn(
                 hidden_states=hidden_states,
-                attention_mask=attention_mask,
                 position_ids=position_ids,
-                past_key_value=past_key_value,
+                attention_mask=attention_mask,
                 output_attentions=output_attentions,
+                past_key_value=past_key_value,
                 use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 **kwargs,
             )
         hidden_states = residual + hidden_states
@@ -1143,6 +1133,9 @@ def forward(
         if use_cache:
             outputs += (present_key_value,)
 
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
         return outputs
 
 
@@ -1151,6 +1144,22 @@ class DeepseekV2PretrainedModel(PretrainedModel):
     base_model_prefix = "deepseek_v2"
     _no_split_modules = ["DeepseekV2DecoderLayer"]
 
+    def _get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
+        from .mfu_utils import DeepSeekProjection
+
+        # self._
+        mfu_cal_proj = DeepSeekProjection(self.config)
+        if seq_length is None:
+            if hasattr(self.config, "seq_length"):
+                seq_length = self.config.seq_length
+            else:
+                seq_length = 2048
+
+        return mfu_cal_proj.get_num_flop_per_token()
+
+    def _get_hardware_flops(self, *args, **kwargs):
+        return self._get_model_flops(*args, **kwargs)
+
     @classmethod
     def _get_name_mappings(cls, config: DeepseekV2Config) -> list[StateDictNameMapping]:
         mappings: list[StateDictNameMapping] = []
@@ -1256,6 +1265,10 @@ def get_tensor_parallel_split_mappings(num_layers):
             base_actions["layers.0.mlp.gate_proj.weight"] = partial(fn, is_column=True)
             base_actions["layers.0.mlp.down_proj.weight"] = partial(fn, is_column=False)
 
+            base_actions["layers.0.mlp.shared_experts.gate_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.up_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.down_proj.weight"] = partial(fn, is_column=False)
+
             for key, action in base_actions.items():
                 if "layers.0." in key:
                     for i in range(num_layers):
@@ -1299,7 +1312,6 @@ def _init_weights(self, layer):
                 linear_utils.ColumnSequenceParallelLinear,
             ),
         ):
-
             # In the dygraph mode, use the `set_value` to reset the parameter directly,
             # and reset the `state_dict` to update parameter in static mode.
             if isinstance(layer.weight, paddle.Tensor):
@@ -1421,11 +1433,12 @@ def recompute_training_full(
         self,
         layer_module: nn.Layer,
         hidden_states: Tensor,
-        attention_mask: Tensor,
         position_ids: Optional[Tensor],
-        past_key_value: Tensor,
+        attention_mask: Tensor,
         output_attentions: bool,
+        past_key_value: Tensor,
         use_cache: bool,
+        attn_mask_startend_row_indices: Optional[Tensor] = None,
     ):
         def create_custom_forward(module):
             def custom_forward(*inputs):
@@ -1436,11 +1449,12 @@ def custom_forward(*inputs):
         hidden_states = recompute(
             create_custom_forward(layer_module),
             hidden_states,
-            attention_mask,
             position_ids,
-            past_key_value,
+            attention_mask,
             output_attentions,
+            past_key_value,
             use_cache,
+            attn_mask_startend_row_indices,
             use_reentrant=self.config.recompute_use_reentrant,
         )
 
@@ -1449,14 +1463,16 @@ def custom_forward(*inputs):
     def forward(
         self,
         input_ids: paddle.Tensor = None,
-        attention_mask: Optional[paddle.Tensor] = None,
         position_ids: Optional[paddle.Tensor] = None,
-        past_key_values: Optional[List[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         inputs_embeds: Optional[paddle.Tensor] = None,
         use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices: Optional[Tensor] = None,
+        **kwargs,
     ) -> Union[Tuple, BaseModelOutputWithPast]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
@@ -1505,17 +1521,20 @@ def forward(
             inputs_embeds = self.embed_tokens(input_ids)
 
         # embed positions
-        if attention_mask is None:
+        if attn_mask_startend_row_indices is not None or get_use_casual_mask():
+            attention_mask = None
+        else:
             # [bs, seq_len]
-            attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
-
-        # 4d mask is passed through the layers
-        attention_mask = self._prepare_decoder_attention_mask(
-            attention_mask,
-            (batch_size, seq_length),
-            past_key_values_length,
-            inputs_embeds.dtype,
-        )
+            attention_mask = (
+                paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+                if attention_mask is None
+                else attention_mask
+            )
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), past_key_values_length, inputs_embeds.dtype
+            )  # [bs, 1, seq_len, seq_len]
+            if self.config.use_flash_attention:
+                attention_mask = None if is_casual_mask(attention_mask) else attention_mask
 
         if self.config.sequence_parallel:
             # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
@@ -1547,21 +1566,23 @@ def forward(
             ):
                 layer_outputs = self.recompute_training_full(
                     decoder_layer,
-                    hidden_states,
-                    attention_mask,
-                    position_ids,
-                    past_key_value,
-                    output_attentions,
-                    use_cache,
+                    hidden_states=hidden_states,
+                    position_ids=position_ids,
+                    attention_mask=attention_mask,
+                    output_attentions=output_attentions,
+                    past_key_value=past_key_value,
+                    use_cache=use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
             else:
                 layer_outputs = decoder_layer(
-                    hidden_states,
-                    attention_mask=attention_mask,
+                    hidden_states=hidden_states,
                     position_ids=position_ids,
-                    past_key_value=past_key_value,
+                    attention_mask=attention_mask,
                     output_attentions=output_attentions,
+                    past_key_value=past_key_value,
                     use_cache=use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
 
             # NOTE: clear outdate cache after it has been used for memory saving
@@ -1595,14 +1616,14 @@ def forward(
         )
 
 
-class DeepSeekV2PretrainingCriterion(nn.Layer):
+class DeepseekV2PretrainingCriterion(nn.Layer):
     """
     Criterion for Mixtral.
     It calculates the final loss.
     """
 
     def __init__(self, config: DeepseekV2Config):
-        super(DeepSeekV2PretrainingCriterion, self).__init__()
+        super(DeepseekV2PretrainingCriterion, self).__init__()
         self.ignore_index = getattr(config, "ignore_index", -100)
         self.config = config
         self.enable_parallel_cross_entropy = config.tensor_parallel_degree > 1 and config.tensor_parallel_output
@@ -1624,15 +1645,23 @@ def forward(self, prediction_scores, masked_lm_labels):
             masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
 
             # skip ignore_index which loss == 0
-            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
-            loss = paddle.mean(masked_lm_loss)
+            # masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
+            # loss = paddle.mean(masked_lm_loss)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
 
         return loss
 
 
-class DeepSeekV2LMHead(nn.Layer):
+class DeepseekV2LMHead(nn.Layer):
     def __init__(self, config: DeepseekV2Config):
-        super().__init__()
+        super(DeepseekV2LMHead, self).__init__()
 
         self.config = config
         if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
@@ -1669,8 +1698,8 @@ def __init__(self, config: DeepseekV2Config):
         self.config = config
         self.deepseek_v2 = DeepseekV2Model(config)
         self.vocab_size = config.vocab_size
-        self.lm_head = DeepSeekV2LMHead(config)
-        self.criterion = DeepSeekV2PretrainingCriterion(config)
+        self.lm_head = DeepseekV2LMHead(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
 
     def get_input_embeddings(self):
         return self.deepseek_v2.embed_tokens
@@ -1693,15 +1722,16 @@ def get_decoder(self):
     def forward(
         self,
         input_ids: paddle.Tensor = None,
-        attention_mask: Optional[paddle.Tensor] = None,
         position_ids: Optional[paddle.Tensor] = None,
-        past_key_values: Optional[List[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         inputs_embeds: Optional[paddle.Tensor] = None,
         labels: Optional[paddle.Tensor] = None,
         use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
         r"""
         Args:
@@ -1734,26 +1764,57 @@ def forward(
         )
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
         outputs = self.deepseek_v2(
             input_ids=input_ids,
-            attention_mask=attention_mask,
             position_ids=position_ids,
-            past_key_values=past_key_values,
+            attention_mask=attention_mask,
             inputs_embeds=inputs_embeds,
             use_cache=use_cache,
+            past_key_values=past_key_values,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
         )
 
         hidden_states = outputs[0]
-        logits = self.lm_head(hidden_states)
 
-        loss = None
-        # TODO@DrownFish19: shift labels
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            # if labels is None，means we need full output, instead of tensor_parallel_output
+            # tensor_parallel_output is together with ParallelCrossEntropy
+            tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
+
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/deepseek_v2/modeling_auto.py b/paddlenlp/transformers/deepseek_v2/modeling_auto.py
new file mode 100644
index 000000000000..284b12a29cb8
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v2/modeling_auto.py
@@ -0,0 +1,994 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle DeepSeek_V2 model."""
+
+from __future__ import annotations
+
+import warnings
+from typing import List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import Linear
+
+try:
+    from paddle.incubate.nn.functional import fused_rotary_position_embedding
+except ImportError:
+    fused_rotary_position_embedding = None
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+
+import paddle.distributed as dist
+
+from ...utils.log import logger
+from ...utils.tools import get_env_device
+from ..activations import ACT2FN
+from ..llama import fusion_ops
+from ..llama.modeling import get_use_casual_mask
+from ..model_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from ..model_utils import PretrainedModel, register_base_model
+from ..moe_layer import MoELayer
+from .configuration import DeepseekV2Config
+from .modeling import (
+    AddAuxiliaryLoss,
+    DeepseekV2DynamicNTKScalingRotaryEmbedding,
+    DeepseekV2LinearScalingRotaryEmbedding,
+    DeepseekV2PretrainingCriterion,
+    DeepseekV2RMSNorm,
+    DeepseekV2RotaryEmbedding,
+    DeepseekV2YarnRotaryEmbedding,
+    MoEGate,
+    _expand_2d_mask,
+    _make_causal_mask,
+    apply_rotary_pos_emb,
+    get_triangle_upper_mask,
+    is_casual_mask,
+    yarn_get_mscale,
+)
+
+__all__ = [
+    "DeepseekV2LMHeadAuto",
+    "DeepseekV2ForCausalLMAuto",
+    "DeepseekV2ModelAuto",
+    "DeepseekV2PretrainedModelAuto",
+]
+
+
+def scaled_dot_product_attention(
+    query_states,
+    config,
+    key_states,
+    value_states,
+    attention_mask,
+    output_attentions,
+    attn_mask_startend_row_indices=None,
+    softmax_scale=1.0,
+    training=True,
+    sequence_parallel=False,
+):
+    bsz, q_len, num_heads, head_dim = query_states.shape
+    _, kv_seq_len, v_num_heads, v_head_dim = value_states.shape
+
+    if config.use_flash_attention and flash_attention:
+        # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
+        # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
+
+        # Note: Flash Attention does not support softmax_scale, so we need to scale the query_states
+        q_head_dim = query_states.shape[-1]
+        softmax_scale = softmax_scale * (q_head_dim**0.5)
+        query_states = query_states * softmax_scale
+        value_padding = paddle.zeros(
+            [bsz, kv_seq_len, v_num_heads, head_dim - v_head_dim],
+            dtype=value_states.dtype,
+        )
+        value_states = paddle.concat([value_states, value_padding], axis=-1)
+
+        outputs = fusion_ops.fusion_flash_attention(
+            query_states,
+            config,
+            key_states,
+            value_states,
+            attention_mask,
+            output_attentions,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            sequence_parallel=False,
+        )
+
+        if isinstance(outputs, tuple):
+            outputs[0] = outputs[0].reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs[0] = outputs[0][..., :v_head_dim]
+            outputs[0] = outputs[0].reshape([bsz, q_len, -1])
+        else:
+            outputs = outputs.reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs = outputs[..., :v_head_dim]
+            outputs = outputs.reshape([bsz, q_len, -1])
+        return outputs
+
+    else:
+        #  [ bz, seqlen, nhead, head_dim] -> [bs, nhead, seq_len, head_dim]
+        query_states = paddle.transpose(query_states, [0, 2, 1, 3])
+        # merge with the next transpose
+        key_states = paddle.transpose(key_states, [0, 2, 1, 3])
+        value_states = paddle.transpose(value_states, [0, 2, 1, 3])
+
+        # matmul and divide by sqrt(head_dim)
+        attn_weights = paddle.matmul(query_states * softmax_scale, key_states.transpose([0, 1, 3, 2]))
+
+        if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention weights should be of shape {(bsz, num_heads, q_len, kv_seq_len)}, but is"
+                f" {attn_weights.shape}"
+            )
+
+        if attention_mask is None:
+            attention_mask = get_triangle_upper_mask(attn_weights)
+        attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len])
+        if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention mask should be of shape {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
+            )
+
+        attn_weights = attn_weights + attention_mask
+        if not paddle.in_dynamic_mode():
+            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+        else:
+            with paddle.amp.auto_cast(False):
+                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+
+        attn_weights = F.dropout(attn_weights, p=config.attention_dropout, training=training)
+
+        attn_output = paddle.matmul(attn_weights.astype("float32"), value_states.astype("float32"))
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+
+        if sequence_parallel:
+            attn_output = attn_output.reshape([bsz * q_len, v_head_dim * num_heads])
+        else:
+            attn_output = attn_output.reshape([bsz, q_len, v_head_dim * num_heads])
+        return (attn_output, attn_weights) if output_attentions else attn_output
+
+
+class DeepseekV2MLPAuto(nn.Layer):
+    def __init__(self, config: DeepseekV2Config, hidden_size=None, intermediate_size=None):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
+        self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
+
+        self.gate_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)
+        self.up_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)
+        self.down_proj = Linear(self.intermediate_size, self.hidden_size, bias_attr=False)
+
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+
+
+class DeepseekV2MoEAuto(MoELayer):
+    """
+    A mixed expert module containing shared experts.
+    """
+
+    def __init__(self, config: DeepseekV2Config):
+        gate = MoEGate(
+            config=config,
+            num_experts=config.n_routed_experts,
+            expert_hidden_size=config.hidden_size,
+            top_k=config.num_experts_per_tok,
+            topk_method=config.topk_method,
+            n_group=config.n_group,
+            topk_group=config.topk_group,
+            norm_topk_prob=config.norm_topk_prob,
+            routed_scaling_factor=config.routed_scaling_factor,
+            drop_tokens=False,
+        )
+
+        super().__init__(
+            config=config,
+            moe_num_experts=config.n_routed_experts,
+            expert_class=DeepseekV2MLPAuto,
+            expert_kwargs={"config": config, "intermediate_size": config.moe_intermediate_size},
+            gate=gate,
+            capacity=2.0,
+        )
+        self.alpha = config.aux_loss_alpha
+        if config.n_shared_experts is not None:
+            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
+            self.shared_experts = DeepseekV2MLPAuto(config=config, intermediate_size=intermediate_size)
+
+    def forward(self, hidden_states):
+        final_hidden_states, l_aux, l_zloss = super().forward(hidden_states)
+        if self.training and self.alpha > 0.0:
+            final_hidden_states = AddAuxiliaryLoss.apply(final_hidden_states, l_aux)
+
+        if self.config.n_shared_experts is not None:
+            shared_expert_output = self.shared_experts(hidden_states)
+            final_hidden_states = final_hidden_states + shared_expert_output
+        return final_hidden_states
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->DeepseekV2
+class DeepseekV2AttentionAuto(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: DeepseekV2Config, layerwise_recompute: bool = False):
+        super().__init__()
+        self.config = config
+        self.attention_dropout = config.attention_dropout
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.q_lora_rank = config.q_lora_rank
+        self.qk_rope_head_dim = config.qk_rope_head_dim
+        self.kv_lora_rank = config.kv_lora_rank
+        self.v_head_dim = config.v_head_dim
+        self.qk_nope_head_dim = config.qk_nope_head_dim
+        self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
+
+        self.is_causal = True
+
+        self.seq_length = config.seq_length
+
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.layerwise_recompute = layerwise_recompute
+        self.recompute_granularity = config.recompute_granularity
+
+        # Note (@DrownFish19): For tensor parallel we consider that q_a_proj and kv_a_proj_with_mqa
+        # are the small weight and cannot achieve performance gain. So we use the original
+        # linear layers. We use the tensor parallel linear layers for q_proj，q_b_proj and kv_b_proj
+        # for which are the large weight and can achieve performance gain.
+
+        # fmt: off
+        # for without tensor parallel
+        if self.q_lora_rank is None:
+            self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.q_head_dim, bias_attr=False)
+        else:
+            self.q_a_proj = nn.Linear(self.hidden_size, config.q_lora_rank, bias_attr=config.attention_bias)
+            self.q_a_layernorm = DeepseekV2RMSNorm(config=config, hidden_size=config.q_lora_rank)
+            self.q_b_proj = nn.Linear(config.q_lora_rank, self.num_heads * self.q_head_dim, bias_attr=False)
+
+        self.kv_a_proj_with_mqa = nn.Linear(self.hidden_size, config.kv_lora_rank + config.qk_rope_head_dim, bias_attr=config.attention_bias)
+        self.kv_a_layernorm = DeepseekV2RMSNorm(config=config, hidden_size=config.kv_lora_rank)
+        self.kv_b_proj = nn.Linear(config.kv_lora_rank, self.num_heads * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim), bias_attr=False)
+
+        self.o_proj = nn.Linear(self.num_heads * self.v_head_dim, self.hidden_size, bias_attr=config.attention_bias)
+        # fmt: on
+
+        self._init_rope()
+
+        self.softmax_scale = self.q_head_dim ** (-0.5)
+        if self.config.rope_scaling is not None:
+            mscale_all_dim = self.config.rope_scaling.get("mscale_all_dim", 0)
+            scaling_factor = self.config.rope_scaling["factor"]
+            if mscale_all_dim:
+                mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
+                self.softmax_scale = self.softmax_scale * mscale * mscale
+
+        self.attn_func = scaled_dot_product_attention
+
+    def _init_rope(self):
+        if self.config.rope_scaling is None:
+            self.rotary_emb = DeepseekV2RotaryEmbedding(
+                self.qk_rope_head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.rope_theta,
+            )
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            scaling_factor = self.config.rope_scaling["factor"]
+            if scaling_type == "linear":
+                self.rotary_emb = DeepseekV2LinearScalingRotaryEmbedding(
+                    self.qk_rope_head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "dynamic":
+                self.rotary_emb = DeepseekV2DynamicNTKScalingRotaryEmbedding(
+                    self.qk_rope_head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "yarn":
+                kwargs = {
+                    key: self.config.rope_scaling[key]
+                    for key in [
+                        "original_max_position_embeddings",
+                        "beta_fast",
+                        "beta_slow",
+                        "mscale",
+                        "mscale_all_dim",
+                    ]
+                    if key in self.config.rope_scaling
+                }
+                self.rotary_emb = DeepseekV2YarnRotaryEmbedding(
+                    self.qk_rope_head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                    **kwargs,
+                )
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.v_head_dim]).transpose([1, 0, 2, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
+        **kwargs,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+        bsz, q_len, _ = hidden_states.shape
+
+        # DeepSeekV2 q_lora_rank=1536
+        # DeepSeekV2-lite q_lora_rank=None
+        if self.q_lora_rank is None:
+            q = self.q_proj(hidden_states)
+        else:
+            q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
+        q = q.reshape([bsz, q_len, self.num_heads, self.q_head_dim])
+        q_nope, q_pe = paddle.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], axis=-1)
+
+        # DeepSeekV2 kv_lora_rank+qk_rope_head_dim=512+64
+        compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
+        compressed_kv, k_pe = paddle.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], axis=-1)
+        k_pe = k_pe.reshape([bsz, q_len, 1, self.qk_rope_head_dim])
+
+        # self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim = 128+64
+        # self.num_heads * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim) = config.qk_nope_head_dim + self.v_head_dim = 128+128
+        kv = self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).reshape(
+            [bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim]
+        )
+
+        k_nope, value_states = paddle.split(kv, [self.qk_nope_head_dim, self.v_head_dim], axis=-1)
+        kv_seq_len = value_states.shape[1]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-3]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        cos = cos[None, :, None, :]
+        sin = sin[None, :, None, :]
+        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
+
+        query_states = paddle.empty([bsz, q_len, self.num_heads, self.q_head_dim], dtype=self.config.dtype)
+        query_states = paddle.concat([q_nope, q_pe], axis=-1)
+        # query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
+        # query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
+
+        key_states = paddle.empty([bsz, q_len, self.num_heads, self.q_head_dim], dtype=self.config.dtype)
+        # input[0]'s shape = [1, 2048, 16, 128], input[1]'s shape = [1, 2048, 1, 64].
+        key_states = paddle.concat([k_nope, k_pe.expand([bsz, q_len, self.num_heads, k_pe.shape[-1]])], axis=-1)
+
+        # key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
+        # key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
+
+        # [bs, seq_len, num_head, head_dim]
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = paddle.concat([past_key_value[0], key_states], axis=1)
+            value_states = paddle.concat([past_key_value[1], value_states], axis=1)
+        past_key_value = (key_states, value_states) if use_cache else None
+
+        has_gradient = not (query_states.stop_gradient and key_states.stop_gradient and value_states.stop_gradient)
+        if (
+            self.enable_recompute
+            and self.layerwise_recompute
+            and has_gradient
+            and self.recompute_granularity == "core_attn"
+        ):
+            outputs = recompute(
+                self.attn_func,
+                query_states,
+                self.config,
+                key_states,
+                value_states,
+                attention_mask,
+                output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                softmax_scale=self.softmax_scale,
+                training=self.training,
+                use_reentrant=self.config.recompute_use_reentrant,
+            )
+        else:
+            outputs = self.attn_func(
+                query_states,
+                self.config,
+                key_states,
+                value_states,
+                attention_mask,
+                output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                softmax_scale=self.softmax_scale,
+                training=self.training,
+            )
+        if output_attentions:
+            attn_output, attn_weights = outputs
+        else:
+            attn_output = outputs
+
+        # if sequence_parallel is true, out shape are [q_len / n, bs, num_head * head_dim]
+        # else their shape are [bs, q_len, num_head * head_dim], n is mp parallelism.
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class DeepseekV2DecoderLayerAuto(nn.Layer):
+    def __init__(self, config: DeepseekV2Config, layer_idx: int, layerwise_recompute: bool = False):
+        super().__init__()
+        self.config = config
+
+        self.enable_recompute = False
+        self.layerwise_recompute = layerwise_recompute
+        self.recompute_granularity = config.recompute_granularity
+
+        self.hidden_size = config.hidden_size
+
+        self.self_attn = DeepseekV2AttentionAuto(config=config, layerwise_recompute=layerwise_recompute)
+
+        self.mlp = (
+            DeepseekV2MoEAuto(config)
+            if (
+                config.n_routed_experts is not None
+                and layer_idx >= config.first_k_dense_replace
+                and layer_idx % config.moe_layer_freq == 0
+            )
+            else DeepseekV2MLPAuto(config)
+        )
+        self.input_layernorm = DeepseekV2RMSNorm(config)
+        self.post_attention_layernorm = DeepseekV2RMSNorm(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
+        **kwargs,
+    ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_axis)`
+            attention_mask (`paddle.Tensor`, *optional*):
+                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
+                query_sequence_length, key_sequence_length)` if default attention is used.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(paddle.Tensor)`, *optional*): cached past key and value projection states
+        """
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        has_gradient = not hidden_states.stop_gradient
+        if (
+            self.enable_recompute
+            and self.layerwise_recompute
+            and has_gradient
+            and self.recompute_granularity == "full_attn"
+        ):
+            hidden_states, self_attn_weights, present_key_value = recompute(
+                self.self_attn,
+                hidden_states=hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                **kwargs,
+            )
+        else:
+            hidden_states, self_attn_weights, present_key_value = self.self_attn(
+                hidden_states=hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                **kwargs,
+            )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
+        return outputs
+
+
+class DeepseekV2PretrainedModelAuto(PretrainedModel):
+    config_class = DeepseekV2Config
+    base_model_prefix = "deepseek_v2"
+    _no_split_modules = ["DeepseekV2DecoderLayerAuto"]
+
+
+@register_base_model
+class DeepseekV2ModelAuto(DeepseekV2PretrainedModelAuto):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`DeepseekV2DecoderLayerAuto`]
+
+    Args:
+        config: DeepseekV2Config
+    """
+
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+
+        self.config = config
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.recompute_granularity = config.recompute_granularity
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+
+        self.layers = nn.LayerList(
+            [
+                DeepseekV2DecoderLayerAuto(config, layer_idx, layer_idx not in self.no_recompute_layers)
+                for layer_idx in range(config.num_hidden_layers)
+            ]
+        )
+        self.norm = DeepseekV2RMSNorm(config)
+
+        self.enable_recompute = False
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @staticmethod
+    def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values_length, dtype):
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            if len(attention_mask.shape) == 2:
+                expanded_attn_mask = _expand_2d_mask(attention_mask, dtype, tgt_length=input_shape[-1])
+                # For decoding phase in generation, seq_length = 1, we don't need to add causal mask
+                if input_shape[-1] > 1:
+                    combined_attention_mask = _make_causal_mask(
+                        input_shape,
+                        past_key_values_length=past_key_values_length,
+                    )
+                    expanded_attn_mask = expanded_attn_mask & combined_attention_mask
+            # [bsz, seq_len, seq_len] -> [bsz, 1, seq_len, seq_len]
+            elif len(attention_mask.shape) == 3:
+                expanded_attn_mask = attention_mask.unsqueeze(1).astype("bool")
+            # if attention_mask is already 4-D, do nothing
+            else:
+                expanded_attn_mask = attention_mask
+        else:
+            expanded_attn_mask = _make_causal_mask(
+                input_shape,
+                past_key_values_length=past_key_values_length,
+            )
+        # Convert bool attention_mask to float attention mask, which will be added to attention_scores later
+        if get_env_device() == "xpu":
+            x = paddle.to_tensor(0.0, dtype="float32")
+            y = paddle.to_tensor(-1.7005809656952787e38, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask, x, y)
+        else:
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), 0.0, paddle.finfo(dtype).min).astype(
+                dtype
+            )
+        return expanded_attn_mask
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices: Optional[Tensor] = None,
+        **kwargs,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape[:2]
+        elif inputs_embeds is not None:
+            batch_size, seq_length = inputs_embeds.shape[:2]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if self.enable_recompute and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers."
+                )
+                use_cache = False
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.layers))
+        # NOTE: to make cache can be clear in-time
+        past_key_values = list(past_key_values)
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values[0] is not None:
+            past_key_values_length = past_key_values[0][0].shape[1]
+            seq_length_with_past += past_key_values_length
+
+        if position_ids is None:
+            position_ids = paddle.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=paddle.int64
+            )
+            position_ids = position_ids.unsqueeze(0)
+
+        if inputs_embeds is None:
+            # [bs, seq_len, dim]
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        # embed positions
+        if attn_mask_startend_row_indices is not None or get_use_casual_mask():
+            attention_mask = None
+        else:
+            # [bs, seq_len]
+            attention_mask = (
+                paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+                if attention_mask is None
+                else attention_mask
+            )
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), past_key_values_length, inputs_embeds.dtype
+            )  # [bs, 1, seq_len, seq_len]
+            if self.config.use_flash_attention:
+                attention_mask = None if is_casual_mask(attention_mask) else attention_mask
+
+        # embed positions
+        hidden_states = inputs_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, (decoder_layer) in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            layer_outputs = decoder_layer(
+                hidden_states=hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            )
+
+            # NOTE: clear outdate cache after it has been used for memory saving
+            past_key_value = past_key_values[idx] = None
+            if type(layer_outputs) is tuple:
+                hidden_states = layer_outputs[0]
+            else:
+                hidden_states = layer_outputs
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+class DeepseekV2LMHeadAuto(nn.Layer):
+    def __init__(self, config: DeepseekV2Config):
+        super(DeepseekV2LMHeadAuto, self).__init__()
+
+        self.config = config
+
+        self.weight = self.create_parameter(
+            shape=[config.hidden_size, config.vocab_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.XavierNormal(1.0),
+        )
+
+    def forward(self, hidden_states, tensor_parallel_output=None):
+        if tensor_parallel_output is None:
+            tensor_parallel_output = self.config.tensor_parallel_output
+        logits = paddle.matmul(hidden_states, self.weight)
+        return logits
+
+
+class DeepseekV2ForCausalLMAuto(DeepseekV2PretrainedModelAuto):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+        self.config = config
+        self.deepseek_v2 = DeepseekV2ModelAuto(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = DeepseekV2LMHeadAuto(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
+
+    def get_input_embeddings(self):
+        return self.deepseek_v2.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.deepseek_v2.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.deepseek_v2 = decoder
+
+    def get_decoder(self):
+        return self.deepseek_v2
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, DeepseekV2ForCausalLMAuto
+
+        >>> model = DeepseekV2ForCausalLMAuto.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        input_ids.stop_gradient = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.deepseek_v2(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+        )
+
+        hidden_states = outputs[0]
+
+        # if labels is None，means we need full output, instead of tensor_parallel_output
+        # tensor_parallel_output is together with ParallelCrossEntropy
+        tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
+
+        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+        return logits
+
+    def prepare_inputs_for_generation(
+        self, input_ids, use_cache=False, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        batch_size, seq_length = input_ids.shape
+        position_ids = kwargs.get("position_ids", paddle.arange(seq_length).expand((batch_size, seq_length)))
+        if past_key_values:
+            input_ids = input_ids[:, -1].unsqueeze(axis=-1)
+            position_ids = position_ids[:, -1].unsqueeze(-1)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": use_cache,
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    def _get_model_inputs_spec(self, dtype: str):
+        return {
+            "input_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "attention_mask": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "position_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        }
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, CausalLMOutputWithPast) and "past_key_values" in outputs:
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
+
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            # TODO: support attention mask for other models
+            attention_mask = model_kwargs["attention_mask"]
+            if len(attention_mask.shape) == 2:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )
+            elif len(attention_mask.shape) == 4:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([*attention_mask.shape[:3], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )[:, :, -1:, :]
+
+        return model_kwargs
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past
+
+    def auto_dist_config(self, prefix=""):
+        if prefix != "":
+            assert prefix.endswith(".")
+        config = {
+            "dp_config": {"sharding_level": 1, "offload": False, "exclude_layer": None},
+            "mp_config": {
+                "parallelize_plan": {
+                    f"{prefix}deepseek_v2.embed_tokens": dist.ColWiseParallel(gather_output=True),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.q_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.q_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.kv_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.o_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.shared_experts.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.shared_experts.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.shared_experts.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}lm_head.weight": dist.ColWiseParallel(),
+                }
+            },
+        }
+        return config
diff --git a/paddlenlp/transformers/deepseek_v2/modeling_pp.py b/paddlenlp/transformers/deepseek_v2/modeling_pp.py
new file mode 100644
index 000000000000..d6eec969926e
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v2/modeling_pp.py
@@ -0,0 +1,358 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import OrderedDict
+
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+)
+from paddle.distributed.fleet.recompute.recompute import recompute
+
+from ...utils.tools import get_env_device
+from ..model_utils import PipelinePretrainedModel
+from .modeling import (
+    DeepseekV2Config,
+    DeepseekV2DecoderLayer,
+    DeepseekV2LMHead,
+    DeepseekV2Model,
+    DeepseekV2PretrainedModel,
+    DeepseekV2PretrainingCriterion,
+    DeepseekV2RMSNorm,
+)
+
+__all__ = [
+    "DeepseekV2ForCausalLMPipe",
+]
+
+
+def parse_args(args):
+    if isinstance(args, tuple):
+        if len(args) == 4:
+            hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = args
+        elif len(args) == 3:
+            hidden_states, attention_mask, attn_mask_startend_row_indices = args
+            position_ids = None
+        elif len(args) == 2:
+            hidden_states, attention_mask = args
+            attn_mask_startend_row_indices, position_ids = None, None
+    else:
+        hidden_states = args
+        attention_mask, attn_mask_startend_row_indices, position_ids = None, None, None
+
+    if position_ids is not None:
+        position_ids.stop_gradient = True
+
+    if attention_mask is not None:
+        attention_mask.stop_gradient = True
+
+    if attn_mask_startend_row_indices is not None:
+        attn_mask_startend_row_indices.stop_gradient = True
+
+    return hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids
+
+
+def return_args(hidden_states, attention_mask=None, attn_mask_startend_row_indices=None, position_ids=None):
+    ret = (hidden_states,)
+
+    if attention_mask is not None:
+        ret += (attention_mask.clone(),)
+    if attn_mask_startend_row_indices is not None:
+        ret += (attn_mask_startend_row_indices.clone(),)
+    if position_ids is not None:
+        ret += (position_ids.clone(),)
+    if len(ret) == 1:
+        ret = ret[0]
+
+    return ret
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+class DeepseekV2EmbeddingPipe(nn.Layer):
+    def __init__(self, config: DeepseekV2Config):
+        super(DeepseekV2EmbeddingPipe, self).__init__()
+        self.config = config
+        self.sequence_parallel = config.sequence_parallel
+        self.hidden_size = config.hidden_size
+        if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self.embed_tokens, "weight")
+
+    def forward(self, args):
+        """_summary_
+
+        Args:
+            input (_type_): _description_
+
+        Returns:
+            _type_: _description_
+        """
+        input_ids, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        input_embeds = self.embed_tokens(input_ids)
+        if self.config.sequence_parallel:
+            from paddlenlp.transformers import ScatterOp
+
+            # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
+            bs, seq_len, hidden_size = input_embeds.shape
+            input_embeds = paddle.reshape_(input_embeds, [bs * seq_len, hidden_size])
+            # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism)
+            input_embeds = ScatterOp.apply(input_embeds)
+
+        batch_size, seq_length = input_ids.shape
+
+        if attention_mask is not None:
+            assert (
+                attn_mask_startend_row_indices is None
+            ), "attention_mask and attn_mask_startend_row_indices can not be set at same time"
+
+            attention_mask = DeepseekV2Model._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), 0, input_embeds.dtype
+            )
+            attention_mask.stop_gradient = True
+            if get_env_device() == "npu":
+                attention_mask = attention_mask.astype("bool")
+        elif get_env_device() == "npu":
+            attention_mask = paddle.tril(paddle.ones((seq_length, seq_length), dtype="bool"))
+            attention_mask.stop_gradient = True
+
+        return return_args(input_embeds, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class DeepseekV2DecoderLayerPipe(DeepseekV2DecoderLayer):
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+
+        has_gradient = not hidden_states.stop_gradient
+
+        if attention_mask is not None and attention_mask.dtype == paddle.int32:
+            attention_mask, attn_mask_startend_row_indices, position_ids = (
+                None,
+                attention_mask,
+                attn_mask_startend_row_indices,
+            )
+        elif attention_mask is not None and attention_mask.dtype == paddle.int64:
+            attention_mask, attn_mask_startend_row_indices, position_ids = None, None, attention_mask
+        elif attn_mask_startend_row_indices is not None and attn_mask_startend_row_indices.dtype == paddle.int64:
+            attn_mask_startend_row_indices, position_ids = None, attn_mask_startend_row_indices
+
+        if self.enable_recompute and self.config.recompute_granularity == "full" and has_gradient:
+            if attention_mask is not None or attn_mask_startend_row_indices is not None:
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attention_mask=attention_mask,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=False,
+                )
+            else:
+                # for pretrain
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=self.config.recompute_use_reentrant,
+                )
+        else:
+            hidden_states = super().forward(
+                hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            )
+
+        return return_args(hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class DeepseekV2RMSNormPipe(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.norm = DeepseekV2RMSNorm(config)
+
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        return self.norm(hidden_states)
+
+
+class DeepseekV2LMHeadPipe(DeepseekV2LMHead):
+    def __init__(self, config):
+        super(DeepseekV2LMHeadPipe, self).__init__(config)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self, "weight")
+
+
+class DeepseekV2ForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
+    """DeepseekV2ForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the DeepseekV2Model class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = DeepseekV2Config
+    _base_model = DeepseekV2PretrainedModel
+    _get_tensor_parallel_mappings = DeepseekV2PretrainedModel._get_tensor_parallel_mappings
+    _init_weights = DeepseekV2PretrainedModel._init_weights
+    _keys_to_ignore_on_load_unexpected = DeepseekV2PretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = DeepseekV2PretrainedModel._get_model_flops
+    _get_hardware_flops = DeepseekV2PretrainedModel._get_hardware_flops
+
+    _tied_weights_keys = ["lm_head.weight"]
+
+    # DONOT Add base_model_prefix !!!!
+
+    @classmethod
+    def _prepare_pipeline_inputs_func(cls, inputs):
+        first_stage_keys = ["input_ids", "attention_mask", "attn_mask_startend_row_indices", "position_ids"]
+        last_stage_keys = ["labels"]
+
+        def get_expected_keys(inputs, keys):
+            ret = tuple([inputs.pop(k) if k in inputs else None for k in keys])
+            if len(ret) == 1:
+                ret = ret[0]
+            return ret
+
+        if type(inputs) is dict or type(inputs) is OrderedDict:
+            return [
+                get_expected_keys(inputs, first_stage_keys),
+                get_expected_keys(inputs, last_stage_keys),
+            ]
+
+        keys = list(inputs[0].keys())
+        inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys}
+        return [
+            get_expected_keys(inputs_batch, first_stage_keys),
+            get_expected_keys(inputs_batch, last_stage_keys),
+        ]
+
+    def __init__(self, config: DeepseekV2Config):
+        self.config = config
+
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.recompute_granularity = self.config.recompute_granularity
+        self.pp_recompute_interval = self.config.pp_recompute_interval
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+        if self.recompute_granularity == "full":
+            assert len(self.no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support"
+
+        virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1)
+
+        def get_hcg():
+            return fleet.get_hybrid_communicate_group()
+
+        hcg = get_hcg()
+        tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1)
+        tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0)
+
+        # TODO: fix tensor_parallel_degree rewrite in here
+        config.tensor_parallel_degree = tensor_parallel_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "DeepseekV2_shared_weight",
+                    DeepseekV2EmbeddingPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                ),
+                self._base_model.base_model_prefix,
+            )
+        else:
+            self.add_sequential_layer(
+                LayerDesc(DeepseekV2EmbeddingPipe, config=config), self._base_model.base_model_prefix
+            )
+
+        for i in range(config.num_hidden_layers):
+            self.add_sequential_layer(
+                LayerDesc(
+                    DeepseekV2DecoderLayerPipe,
+                    config=config,
+                    layer_idx=i,
+                    layerwise_recompute=i not in self.no_recompute_layers,
+                ),
+                f"{self._base_model.base_model_prefix}.layers.{i}",
+            )
+        self.add_sequential_layer(LayerDesc(DeepseekV2RMSNormPipe, config=config), self._base_model.base_model_prefix)
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "DeepseekV2_shared_weight",
+                    DeepseekV2LMHeadPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                    **{"transpose_y": True},
+                ),
+                "lm_head",
+            )
+        else:
+            self.add_sequential_layer(LayerDesc(DeepseekV2LMHeadPipe, config=config), "lm_head")
+
+        recompute_interval = 0
+        if self.enable_recompute and self.recompute_granularity == "full":
+            assert self.config.pp_recompute_interval <= config.num_hidden_layers // (
+                virtual_pp_degree * get_hcg().topology().get_dim_size("pipe")
+            ), "pp recompute interval should smaller than num layers of each pp chunk"
+            recompute_interval = self.config.pp_recompute_interval
+
+        seg_method = "layer:DeepseekV2DecoderLayer"
+        if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        PipelineLayer.__init__(
+            self,
+            layers=self.get_sequential_layers(),
+            loss_fn=self.get_loss_fn(config),
+            topology=get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+        # You should call init here, since there is a  diamond inheritance problem
+        self.apply(self._init_weights)
+        # DON'T init PipelinePretrainedModel
+        # PipelinePretrainedModel.__init__(self.super(), config=config)
+
+    def get_loss_fn(self, config):
+        return DeepseekV2PretrainingCriterion(config)
diff --git a/paddlenlp/transformers/deepseek_v3/__init__.py b/paddlenlp/transformers/deepseek_v3/__init__.py
index 6709d0167aa8..a9e40981dc0d 100644
--- a/paddlenlp/transformers/deepseek_v3/__init__.py
+++ b/paddlenlp/transformers/deepseek_v3/__init__.py
@@ -14,3 +14,5 @@
 
 from .configuration import *
 from .modeling import *
+from .modeling_auto import *
+from .modeling_pp import *
diff --git a/paddlenlp/transformers/deepseek_v3/modeling.py b/paddlenlp/transformers/deepseek_v3/modeling.py
index 782568767bd2..8008aa2ce68d 100644
--- a/paddlenlp/transformers/deepseek_v3/modeling.py
+++ b/paddlenlp/transformers/deepseek_v3/modeling.py
@@ -27,14 +27,14 @@
 
 from ..deepseek_v2.modeling import (
     DeepseekV2ForSequenceClassification,
-    DeepSeekV2LMHead,
+    DeepseekV2LMHead,
     DeepseekV2Model,
     DeepseekV2PretrainedModel,
-    DeepSeekV2PretrainingCriterion,
+    DeepseekV2PretrainingCriterion,
 )
 from ..model_outputs import CausalLMOutputWithPast
 from ..model_utils import register_base_model
-from .configuration import DeepseekV2Config
+from .configuration import DeepseekV3Config
 
 __all__ = [
     "DeepseekV3ForCausalLM",
@@ -45,26 +45,26 @@
 
 
 class DeepseekV3PretrainedModel(DeepseekV2PretrainedModel):
-    config_class = DeepseekV2Config
+    config_class = DeepseekV3Config
     base_model_prefix = "deepseek_v3"
     _no_split_modules = ["DeepseekV2DecoderLayer"]
 
 
 @register_base_model
 class DeepseekV3Model(DeepseekV2Model):
-    def __init__(self, config: DeepseekV2Config):
+    def __init__(self, config: DeepseekV3Config):
         super().__init__(config)
 
 
 class DeepseekV3ForCausalLM(DeepseekV3PretrainedModel):
     _tied_weights_keys = ["lm_head.weight"]
 
-    def __init__(self, config: DeepseekV2Config):
+    def __init__(self, config: DeepseekV3Config):
         super().__init__(config)
         self.deepseek_v3 = DeepseekV3Model(config)
         self.vocab_size = config.vocab_size
-        self.lm_head = DeepSeekV2LMHead(config)
-        self.criterion = DeepSeekV2PretrainingCriterion(config)
+        self.lm_head = DeepseekV2LMHead(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
 
     def get_input_embeddings(self):
         return self.deepseek_v3.embed_tokens
diff --git a/paddlenlp/transformers/deepseek_v3/modeling_auto.py b/paddlenlp/transformers/deepseek_v3/modeling_auto.py
new file mode 100644
index 000000000000..1dff442f0f88
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/modeling_auto.py
@@ -0,0 +1,204 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle DeepSeek_V3 model."""
+
+from __future__ import annotations
+
+from typing import List, Optional, Tuple, Union
+
+import paddle
+
+try:
+    from paddle.incubate.nn.functional import fused_rotary_position_embedding
+except ImportError:
+    fused_rotary_position_embedding = None
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+
+import paddle.distributed as dist
+
+from ...utils.log import logger
+from ..deepseek_v2.modeling_auto import (
+    DeepseekV2LMHeadAuto,
+    DeepseekV2ModelAuto,
+    DeepseekV2PretrainedModelAuto,
+    DeepseekV2PretrainingCriterion,
+)
+from ..model_outputs import CausalLMOutputWithPast
+from ..model_utils import register_base_model
+from .configuration import DeepseekV2Config
+
+__all__ = [
+    "DeepseekV3LMHeadAuto",
+    "DeepseekV3ForCausalLMAuto",
+    "DeepseekV3ModelAuto",
+    "DeepseekV3PretrainedModelAuto",
+]
+
+
+class DeepseekV3PretrainedModelAuto(DeepseekV2PretrainedModelAuto):
+    config_class = DeepseekV2Config
+    base_model_prefix = "deepseek_v3"
+    _no_split_modules = ["DeepseekV2DecoderLayerAuto"]
+
+
+@register_base_model
+class DeepseekV3ModelAuto(DeepseekV2ModelAuto):
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+
+
+class DeepseekV3LMHeadAuto(DeepseekV2LMHeadAuto):
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+
+
+class DeepseekV3ForCausalLMAuto(DeepseekV3PretrainedModelAuto):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+        self.config = config
+        self.deepseek_v3 = DeepseekV3ModelAuto(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = DeepseekV3LMHeadAuto(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
+
+    def get_input_embeddings(self):
+        return self.deepseek_v3.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.deepseek_v3.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.deepseek_v3 = decoder
+
+    def get_decoder(self):
+        return self.deepseek_v3
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, DeepseekV3ForCausalLMAuto
+
+        >>> model = DeepseekV3ForCausalLMAuto.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        input_ids.stop_gradient = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.deepseek_v3(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+        )
+
+        hidden_states = outputs[0]
+
+        # if labels is None，means we need full output, instead of tensor_parallel_output
+        # tensor_parallel_output is together with ParallelCrossEntropy
+        tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
+
+        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+        return logits
+
+    def auto_dist_config(self, prefix=""):
+        if prefix != "":
+            assert prefix.endswith(".")
+        config = {
+            "dp_config": {"sharding_level": 1, "offload": False, "exclude_layer": None},
+            "mp_config": {
+                "parallelize_plan": {
+                    f"{prefix}deepseek_v3.embed_tokens": dist.ColWiseParallel(gather_output=True),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.q_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.q_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.kv_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.o_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.shared_experts.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.shared_experts.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.shared_experts.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}lm_head.weight": dist.ColWiseParallel(),
+                }
+            },
+        }
+        return config
diff --git a/paddlenlp/transformers/deepseek_v3/modeling_pp.py b/paddlenlp/transformers/deepseek_v3/modeling_pp.py
new file mode 100644
index 000000000000..e48a7dabc2d6
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/modeling_pp.py
@@ -0,0 +1,41 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from ..deepseek_v2.modeling_pp import DeepseekV2ForCausalLMPipe
+from .configuration import DeepseekV3Config
+from .modeling import DeepseekV3PretrainedModel
+
+__all__ = [
+    "DeepseekV3ForCausalLMPipe",
+]
+
+
+class DeepseekV3ForCausalLMPipe(DeepseekV2ForCausalLMPipe):
+    """DeepseekV2ForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the DeepseekV2Model class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = DeepseekV3Config
+    _base_model = DeepseekV3PretrainedModel
+    _get_tensor_parallel_mappings = DeepseekV3PretrainedModel._get_tensor_parallel_mappings
+    _init_weights = DeepseekV3PretrainedModel._init_weights
+    _keys_to_ignore_on_load_unexpected = DeepseekV3PretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = DeepseekV3PretrainedModel._get_model_flops
+    _get_hardware_flops = DeepseekV3PretrainedModel._get_hardware_flops
+    _tied_weights_keys = ["lm_head.weight"]
+
+    # DONOT Add base_model_prefix !!!!
diff --git a/paddlenlp/transformers/gemma/modeling.py b/paddlenlp/transformers/gemma/modeling.py
index 1aa75ece7a21..e7cfb6fa6856 100644
--- a/paddlenlp/transformers/gemma/modeling.py
+++ b/paddlenlp/transformers/gemma/modeling.py
@@ -55,7 +55,7 @@
 from .. import linear_utils
 from ..linear_utils import Linear
 from ..segment_parallel_utils import ReshardLayer
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import (
     GEMMA_PRETRAINED_INIT_CONFIGURATION,
     GEMMA_PRETRAINED_RESOURCE_FILES_MAP,
@@ -898,6 +898,37 @@ class GemmaPretrainedModel(PretrainedModel):
     _keys_to_ignore_on_load_unexpected = []
     _keep_in_fp32_modules = ["inv_freq", "rotary_emb", "cos_cached", "sin_cached"]
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_name_mappings(cls, config: GemmaConfig) -> List[StateDictNameMapping]:
         mappings: list[StateDictNameMapping] = []
@@ -1075,39 +1106,6 @@ def __init__(self, config: GemmaConfig):
 
         self.gradient_checkpointing = False
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embed_tokens
 
@@ -1560,11 +1558,30 @@ def forward(
         # tensor_parallel_output is togather with ParallelCrossEntropy
         tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
 
-        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/gemma/modeling_pp.py b/paddlenlp/transformers/gemma/modeling_pp.py
index 8839248a28c4..66f4a2c200ec 100644
--- a/paddlenlp/transformers/gemma/modeling_pp.py
+++ b/paddlenlp/transformers/gemma/modeling_pp.py
@@ -237,6 +237,8 @@ class GemmaForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_tensor_parallel_mappings = GemmaPretrainedModel._get_tensor_parallel_mappings
     _init_weights = GemmaPretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = GemmaPretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = GemmaPretrainedModel._get_model_flops
+    _get_hardware_flops = GemmaPretrainedModel._get_hardware_flops
 
     # DONOT Add base_model_prefix !!!!
 
diff --git a/paddlenlp/transformers/gpt/modeling.py b/paddlenlp/transformers/gpt/modeling.py
index 1a5bbd3ed698..16cf34aaa24d 100644
--- a/paddlenlp/transformers/gpt/modeling.py
+++ b/paddlenlp/transformers/gpt/modeling.py
@@ -53,7 +53,7 @@
     TokenClassifierOutput,
 )
 from ..model_utils import dy2st_nocheck_guard_context
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import (
     GPT_PRETRAINED_INIT_CONFIGURATION,
     GPT_PRETRAINED_RESOURCE_FILES_MAP,
@@ -805,6 +805,37 @@ class GPTPretrainedModel(PretrainedModel):
     pretrained_init_configuration = GPT_PRETRAINED_INIT_CONFIGURATION
     pretrained_resource_files_map = GPT_PRETRAINED_RESOURCE_FILES_MAP
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_tensor_parallel_mappings(cls, config, is_split=True):
 
@@ -1106,39 +1137,6 @@ def __init__(self, config: GPTConfig):
             decoder_layers,
         )
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embeddings.word_embeddings
 
diff --git a/paddlenlp/transformers/gpt/modeling_pp.py b/paddlenlp/transformers/gpt/modeling_pp.py
index 7734e8a990ed..02ee09151b85 100644
--- a/paddlenlp/transformers/gpt/modeling_pp.py
+++ b/paddlenlp/transformers/gpt/modeling_pp.py
@@ -167,6 +167,9 @@ class GPTForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     pretrained_init_configuration = GPTPretrainedModel.pretrained_init_configuration
     pretrained_resource_files_map = GPTPretrainedModel.pretrained_resource_files_map
 
+    _get_model_flops = GPTPretrainedModel._get_model_flops
+    _get_hardware_flops = GPTPretrainedModel._get_hardware_flops
+
     # NO base_model_prefix !!!!
 
     def __init__(
diff --git a/paddlenlp/transformers/llama/fusion_ops.py b/paddlenlp/transformers/llama/fusion_ops.py
index c2a95541fbbd..62f3660a5bfe 100644
--- a/paddlenlp/transformers/llama/fusion_ops.py
+++ b/paddlenlp/transformers/llama/fusion_ops.py
@@ -177,8 +177,10 @@ def fusion_flash_attention(
     npu_is_casual=False,
     skip_recompute=False,
 ):
-    bsz, q_len, num_heads, head_dim = query_states.shape
-    _, kv_seq_len, _, _ = value_states.shape
+    # Note:
+    # 1. The head_dim of query_states and key_states should be the same. And the head_dim of value_states should be used for reshape.
+    bsz, q_len, num_heads, _ = query_states.shape
+    _, kv_seq_len, _, head_dim = value_states.shape
     version = paddle.version.full_version
     if version != "0.0.0" and version <= "2.5.2":
         if alibi is not None:
diff --git a/paddlenlp/transformers/llama/modeling.py b/paddlenlp/transformers/llama/modeling.py
index dc3318b621a2..c0308a7b1297 100755
--- a/paddlenlp/transformers/llama/modeling.py
+++ b/paddlenlp/transformers/llama/modeling.py
@@ -80,7 +80,7 @@ def swiglu(x, y=None):
 from .. import linear_utils
 from ..linear_utils import Linear
 from ..segment_parallel_utils import ReshardLayer
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import (
     LLAMA_PRETRAINED_INIT_CONFIGURATION,
     LLAMA_PRETRAINED_RESOURCE_FILES_MAP,
@@ -259,17 +259,20 @@ def scaled_dot_product_attention(
         key_states = paddle.transpose(key_states, [0, 2, 1, 3])
         value_states = paddle.transpose(value_states, [0, 2, 1, 3])
 
-        # matmul and devide by sqrt(head_dim)
-        if get_env_device() == "intel_hpu":
-            # optimize div(const) to mul(const) for better performance
-            attn_weights = paddle.matmul(query_states * (1 / math.sqrt(head_dim)), key_states.transpose([0, 1, 3, 2]))
+        # Add pre divided factor to fix nan under float16.
+        if paddle.in_dynamic_mode() and query_states.dtype == paddle.float16:
+            pre_divided_factor = 32
         else:
-            attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
+            pre_divided_factor = 1
+
+        attn_weights = paddle.matmul(
+            query_states * (1 / (math.sqrt(head_dim) * pre_divided_factor)), key_states.transpose([0, 1, 3, 2])
+        )
 
         # then add alibi bias
         if alibi is not None:
             alibi = alibi.reshape([bsz, num_heads, 1, -1])
-            attn_weights = attn_weights + alibi
+            attn_weights = attn_weights + alibi / pre_divided_factor
 
         if paddle.in_dynamic_mode() and attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
             raise ValueError(
@@ -297,7 +300,9 @@ def scaled_dot_product_attention(
             attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
         else:
             with paddle.amp.auto_cast(False):
-                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+                attn_weights = F.softmax(attn_weights.astype("float32") * pre_divided_factor, axis=-1).astype(
+                    query_states.dtype
+                )
 
         attn_output = paddle.matmul(attn_weights, value_states)
         attn_output = attn_output.transpose([0, 2, 1, 3])
@@ -1294,6 +1299,37 @@ class LlamaPretrainedModel(PretrainedModel):
     pretrained_resource_files_map = LLAMA_PRETRAINED_RESOURCE_FILES_MAP
     _keys_to_ignore_on_load_unexpected = [r"self_attn.rotary_emb.inv_freq"]
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_name_mappings(cls, config: LlamaConfig) -> list[StateDictNameMapping]:
         mappings: list[StateDictNameMapping] = []
@@ -1536,39 +1572,6 @@ def __init__(self, config: LlamaConfig):
 
         self.gradient_checkpointing = False
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embed_tokens
 
@@ -2123,11 +2126,30 @@ def forward(
 
         hidden_states = outputs[0]  # [bs, seq_len, dim]
 
-        logits = self.lm_head(hidden_states)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states)
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/llama/modeling_auto.py b/paddlenlp/transformers/llama/modeling_auto.py
index a629cf1ec955..3edc22d601fd 100644
--- a/paddlenlp/transformers/llama/modeling_auto.py
+++ b/paddlenlp/transformers/llama/modeling_auto.py
@@ -52,7 +52,9 @@ def swiglu(x, y=None):
     CausalLMOutputWithCrossAttentions,
 )
 from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+from paddlenlp.utils.tools import get_env_device
 
+from . import fusion_ops
 from .configuration import (
     LLAMA_PRETRAINED_INIT_CONFIGURATION,
     LLAMA_PRETRAINED_RESOURCE_FILES_MAP,
@@ -69,7 +71,6 @@ def swiglu(x, y=None):
     build_alibi_tensor,
     get_triangle_upper_mask,
     repeat_kv,
-    rms_norm_fused,
 )
 
 try:
@@ -218,7 +219,9 @@ def __init__(self, config, ipp):
 
     def forward(self, hidden_states):
         if self.config.use_fused_rms_norm:
-            return rms_norm_fused(hidden_states, self.weight, self.variance_epsilon)
+            return fusion_ops.fusion_rms_norm(
+                hidden_states, self.weight, self.variance_epsilon, self.config.use_fast_layer_norm
+            )
 
         with paddle.amp.auto_cast(False):
             variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
@@ -308,7 +311,7 @@ def __init__(self, config: LlamaConfig, layerwise_recompute: bool = False, ipp:
         self.ipp = ipp
 
         self.use_fused_rope = config.use_fused_rope
-        if self.use_fused_rope:
+        if self.use_fused_rope and get_env_device() not in ["npu", "mlu", "xpu", "gcu", "intel_hpu"]:
             if "gpu" not in paddle.device.get_device() or fused_rotary_position_embedding is None:
                 warnings.warn(
                     "Enable fuse rope in the config, but fuse rope is not available. "
@@ -935,7 +938,22 @@ def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values
         else:
             expanded_attn_mask = _make_causal_mask(input_shape, past_key_values_length=past_key_values_length)
         # Convert bool attention_mask to float attention mask, which will be added to attention_scores later
-        expanded_attn_mask = paddle.where(expanded_attn_mask, 0.0, paddle.finfo(dtype).min).astype(dtype)
+        if get_env_device() in ["npu", "mlu", "intel_hpu"]:
+            x = paddle.to_tensor(0.0, dtype="float32")
+            y = paddle.to_tensor(paddle.finfo(dtype).min, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), x, y).astype(dtype)
+        elif get_env_device() == "xpu":
+            x = paddle.to_tensor(0.0, dtype="float32")
+            y = paddle.to_tensor(-1.7005809656952787e38, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), x, y)
+        elif get_env_device() == "gcu":
+            min_val = paddle.finfo(dtype).min
+            x = paddle.to_tensor(0.0, dtype=dtype)
+            y = paddle.to_tensor(min_val, dtype=dtype)
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), x, y).astype(dtype)
+        else:
+            expanded_attn_mask = paddle.where(expanded_attn_mask, 0.0, paddle.finfo(dtype).min)
+            expanded_attn_mask = expanded_attn_mask.astype(dtype)
         return expanded_attn_mask
 
     def forward(
@@ -1166,8 +1184,27 @@ def forward(self, prediction_scores, masked_lm_labels):
                     masked_lm_labels.unsqueeze(2),
                 )
 
-            masked_lm_loss = paddle.masked_select(masked_lm_loss, masked_lm_loss > 0).astype("float32")
-            loss = paddle.mean(masked_lm_loss)
+            # XPU dose not support allgather mask with bool dtype, so we use LocalLayer here.
+            if get_env_device() == "xpu":
+
+                class LocalLossLayer(paddle.distributed.LocalLayer):
+                    def __init__(self, out_dist_attrs):
+                        super().__init__(out_dist_attrs)
+
+                    def forward(self, x, mask):
+                        masked_lm_loss = paddle.masked_select(x, mask).astype("float32")
+                        loss = paddle.mean(masked_lm_loss)
+                        return loss
+
+                out_dist_attrs = [
+                    (masked_lm_loss.process_mesh, [dist.Partial(dist.ReduceType.kRedSum), dist.Replicate()]),
+                ]
+                loss_func = LocalLossLayer(out_dist_attrs)
+                loss = loss_func(masked_lm_loss, masked_lm_loss > 0)
+            else:
+                masked_lm_loss = paddle.masked_select(masked_lm_loss, masked_lm_loss > 0).astype("float32")
+                loss = paddle.mean(masked_lm_loss)
+
         return loss
 
 
@@ -1175,6 +1212,7 @@ class LlamaLMHeadAuto(nn.Layer):
     def __init__(self, config: LlamaConfig):
         super(LlamaLMHeadAuto, self).__init__()
         self.config = config
+
         vocab_size = config.vocab_size
         self.weight = self.create_parameter(
             shape=[config.hidden_size, vocab_size],
diff --git a/paddlenlp/transformers/llama/modeling_pp.py b/paddlenlp/transformers/llama/modeling_pp.py
index bcf927bd1e96..a3f21326626c 100644
--- a/paddlenlp/transformers/llama/modeling_pp.py
+++ b/paddlenlp/transformers/llama/modeling_pp.py
@@ -350,6 +350,9 @@ class LlamaForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_fuse_or_split_param_mappings = LlamaPretrainedModel._get_fuse_or_split_param_mappings
     _init_weights = LlamaPretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = LlamaPretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = LlamaPretrainedModel._get_model_flops
+    _get_hardware_flops = LlamaPretrainedModel._get_hardware_flops
+
     _tied_weights_keys = ["lm_head.weight"]
 
     # DONOT Add base_model_prefix !!!!
diff --git a/paddlenlp/transformers/llm_embed/__init__.py b/paddlenlp/transformers/llm_embed/__init__.py
new file mode 100644
index 000000000000..0f0d00141b52
--- /dev/null
+++ b/paddlenlp/transformers/llm_embed/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/transformers/llm_embed/modeling.py b/paddlenlp/transformers/llm_embed/modeling.py
new file mode 100644
index 000000000000..b50128e5c8f2
--- /dev/null
+++ b/paddlenlp/transformers/llm_embed/modeling.py
@@ -0,0 +1,298 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+from tqdm import tqdm
+
+from ...utils.log import logger
+from .. import AutoConfig, AutoModel, PretrainedModel
+from ..model_outputs import ModelOutput
+
+
+@dataclass
+class EncoderOutput(ModelOutput):
+    q_reps: Optional[paddle.Tensor] = None
+    p_reps: Optional[paddle.Tensor] = None
+    loss: Optional[paddle.Tensor] = None
+    scores: Optional[paddle.Tensor] = None
+
+
+__all__ = ["BiEncoderModel"]
+
+
+class BiEncoderModel(PretrainedModel):
+    def __init__(
+        self,
+        model_name_or_path: str = None,
+        dtype: str = "float16",
+        normalized: bool = False,
+        sentence_pooling_method: str = "cls",
+        negatives_cross_device: bool = False,
+        temperature: float = 1.0,
+        use_inbatch_neg: bool = True,
+        margin: float = 0.3,
+        matryoshka_dims: Optional[List[int]] = None,
+        matryoshka_loss_weights: Optional[List[float]] = None,
+        query_instruction: Optional[str] = None,
+        document_instruction: Optional[str] = None,
+        eval_batch_size: int = 8,
+        tokenizer=None,
+        max_seq_length: int = 4096,
+    ):
+        super().__init__()
+        self.model = AutoModel.from_pretrained(model_name_or_path, dtype=dtype, convert_from_torch=True)
+        self.model_config = AutoConfig.from_pretrained(model_name_or_path)
+        self.cross_entropy = nn.CrossEntropyLoss(reduction="mean")
+
+        self.normalized = normalized
+        self.sentence_pooling_method = sentence_pooling_method
+        self.temperature = temperature
+        self.use_inbatch_neg = use_inbatch_neg
+        self.config = self.model_config
+        self.margin = margin
+        self.matryoshka_dims = matryoshka_dims
+
+        self.query_instruction = query_instruction
+        self.document_instruction = document_instruction
+        self.eval_batch_size = eval_batch_size
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+
+        if self.matryoshka_dims:
+            self.matryoshka_loss_weights = (
+                matryoshka_loss_weights if matryoshka_loss_weights else [1] * len(self.matryoshka_dims)
+            )
+        else:
+            self.matryoshka_loss_weights = None
+
+        if not normalized:
+            self.temperature = 1.0
+            logger.info("reset temperature = 1.0 due to using inner product to compute similarity")
+
+        self.negatives_cross_device = negatives_cross_device
+        if self.negatives_cross_device:
+            if not dist.is_initialized():
+                raise ValueError("Distributed training has not been initialized for representation all gather.")
+            self.process_rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+
+    def sentence_embedding(self, hidden_state, mask):
+        if self.sentence_pooling_method == "mean":
+            s = paddle.sum(hidden_state * mask.unsqueeze(-1).float(), axis=1)
+            d = mask.sum(axis=1, keepdim=True).float()
+            return s / d
+        elif self.sentence_pooling_method == "cls":
+            return hidden_state[:, 0]
+        elif self.sentence_pooling_method == "last":
+            # return hidden_state[:, -1] # this is for padding side is left
+            sequence_lengths = mask.sum(axis=1)
+            last_token_indices = sequence_lengths - 1
+            embeddings = hidden_state[paddle.arange(hidden_state.shape[0]), last_token_indices]
+            return embeddings
+        else:
+            raise ValueError(f"Invalid sentence pooling method: {self.sentence_pooling_method}")
+
+    def get_model_config(
+        self,
+    ):
+        return self.model_config.to_dict()
+
+    def encode(self, features):
+        psg_out = self.model(**features, return_dict=True, output_hidden_states=True)
+        p_reps = self.sentence_embedding(psg_out.hidden_states[-1], features["attention_mask"])
+        return p_reps
+
+    def compute_similarity(self, q_reps, p_reps):
+        # q_reps [batch_size, embedding_dim]
+        # p_reps [batch_size, embedding_dim]
+        return paddle.matmul(q_reps, p_reps.transpose([1, 0]))
+
+    def hard_negative_loss(self, q_reps, p_reps):
+        scores = self.compute_similarity(q_reps, p_reps)
+        scores = scores / self.temperature
+        scores = scores.reshape([q_reps.shape[0], -1])
+
+        target = paddle.arange(scores.shape[0], dtype="int64")
+        target = target * (p_reps.shape[0] // q_reps.shape[0])
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def in_batch_negative_loss(self, q_reps, p_reps):
+        # In batch negatives
+        scores = self.compute_similarity(q_reps, p_reps)
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(shape=[q_reps.shape[0]], fill_value=self.margin, dtype=q_reps.dtype)
+        scores = scores - paddle.diag(margin_diag)
+        # Scale cosine to ease training converge
+        scores = scores / self.temperature
+        target = paddle.arange(0, q_reps.shape[0], dtype="int64")
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def forward(
+        self,
+        query: Dict[str, paddle.Tensor] = None,
+        passage: Dict[str, paddle.Tensor] = None,
+        teacher_score: paddle.Tensor = None,
+    ):
+        q_reps = self.encode(query)
+        p_reps = self.encode(passage)
+
+        # For non-matryoshka loss, we normalize the representations
+        if not self.matryoshka_dims:
+            if self.normalized:
+                q_reps = paddle.nn.functional.normalize(q_reps, axis=-1)
+                p_reps = paddle.nn.functional.normalize(p_reps, axis=-1)
+
+        if self.training:
+            # Cross device negatives
+            if self.negatives_cross_device:
+                q_reps = self._dist_gather_tensor(q_reps)
+                p_reps = self._dist_gather_tensor(p_reps)
+
+            if self.matryoshka_dims:
+                loss = 0.0
+                scores = 0.0
+                for loss_weight, dim in zip(self.matryoshka_loss_weights, self.matryoshka_dims):
+                    reduced_q = q_reps[:, :dim]
+                    reduced_d = p_reps[:, :dim]
+                    if self.normalized:
+                        reduced_q = paddle.nn.functional.normalize(reduced_q, axis=-1)
+                        reduced_d = paddle.nn.functional.normalize(reduced_d, axis=-1)
+
+                    if self.use_inbatch_neg:
+                        dim_score, dim_loss = self.in_batch_negative_loss(reduced_q, reduced_d)
+                    else:
+                        dim_score, dim_loss = self.hard_negative_loss(reduced_q, reduced_d)
+                    scores += dim_score
+                    loss += loss_weight * dim_loss
+
+            elif self.use_inbatch_neg:
+                scores, loss = self.in_batch_negative_loss(q_reps, p_reps)
+            else:
+                scores, loss = self.hard_negative_loss(q_reps, p_reps)
+
+        else:
+            scores = self.compute_similarity(q_reps, p_reps)
+            loss = None
+        return EncoderOutput(
+            loss=loss,
+            scores=scores,
+            q_reps=q_reps,
+            p_reps=p_reps,
+        )
+
+    def compute_loss(self, scores, target):
+        return self.cross_entropy(scores, target)
+
+    def _dist_gather_tensor(self, t: Optional[paddle.Tensor]):
+        if t is None:
+            return None
+
+        all_tensors = [paddle.empty_like(t) for _ in range(self.world_size)]
+        dist.all_gather(all_tensors, t)
+
+        all_tensors[self.process_rank] = t
+        all_tensors = paddle.concat(all_tensors, axis=0)
+
+        return all_tensors
+
+    def save_pretrained(self, output_dir: str, **kwargs):
+        state_dict = self.model.state_dict()
+        state_dict = type(state_dict)({k: v.clone().cpu() for k, v in state_dict.items()})
+        self.model.save_pretrained(output_dir, state_dict=state_dict)
+
+    @paddle.no_grad()
+    def encode_sentences(self, sentences: List[str], **kwargs) -> np.ndarray:
+        self.model.eval()
+        all_embeddings = []
+        for start_index in tqdm(range(0, len(sentences), self.eval_batch_size), desc="Batches"):
+            sentences_batch = sentences[start_index : start_index + self.eval_batch_size]
+
+            inputs = self.tokenizer(
+                sentences_batch,
+                padding=True,
+                truncation=True,
+                return_tensors="pd",
+                max_length=self.max_seq_length,
+                return_attention_mask=True,
+            )
+            outputs = self.model(
+                input_ids=inputs.input_ids,
+                attention_mask=inputs.attention_mask,
+                return_dict=True,
+                output_hidden_states=True,
+            )
+            last_hidden_state = outputs.hidden_states[-1]
+
+            if self.sentence_pooling_method == "last":
+                if self.tokenizer.padding_side == "right":
+                    sequence_lengths = inputs.attention_mask.sum(axis=1)
+                    last_token_indices = sequence_lengths - 1
+                    embeddings = last_hidden_state[paddle.arange(last_hidden_state.shape[0]), last_token_indices]
+                elif self.tokenizer.padding_side == "left":
+                    embeddings = last_hidden_state[:, -1]
+                else:
+                    raise NotImplementedError(f"Padding side {self.tokenizer.padding_side} not supported.")
+            elif self.sentence_pooling_method == "cls":
+                embeddings = last_hidden_state[:, 1]
+            elif self.sentence_pooling_method == "mean":
+                s = paddle.sum(last_hidden_state * inputs.attention_mask.unsqueeze(-1), axis=1)
+                d = inputs.attention_mask.sum(axis=1, keepdim=True)
+                embeddings = s / d
+            else:
+                raise NotImplementedError(f"Pooling method {self.pooling_method} not supported.")
+
+            embeddings = paddle.nn.functional.normalize(embeddings, p=2, axis=-1)
+
+            all_embeddings.append(embeddings.cpu().numpy().astype("float32"))
+
+        return np.concatenate(all_embeddings, axis=0)
+
+    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
+        """
+        This function will be used to encode queries for retrieval task
+        if there is a instruction for queries, we will add it to the query text
+        """
+        if self.query_instruction is not None:
+            input_texts = [f"{self.query_instruction}{query}" for query in queries]
+        else:
+            input_texts = queries
+        return self.encode_sentences(input_texts)
+
+    def encode_corpus(self, corpus: List[Union[Dict[str, str], str]], **kwargs) -> np.ndarray:
+        """
+        This function will be used to encode corpus for retrieval task
+        if there is a instruction for docs, we will add it to the doc text
+        """
+        if isinstance(corpus[0], dict):
+            if self.document_instruction is not None:
+                input_texts = [
+                    "{}{} {}".format(self.document_instruction, doc.get("title", ""), doc["text"]).strip()
+                    for doc in corpus
+                ]
+            else:
+                input_texts = ["{} {}".format(doc.get("title", ""), doc["text"]).strip() for doc in corpus]
+        else:
+            if self.document_instruction is not None:
+                input_texts = [f"{self.document_instruction}{doc}" for doc in corpus]
+            else:
+                input_texts = corpus
+        return self.encode_sentences(input_texts)
diff --git a/paddlenlp/transformers/model_utils.py b/paddlenlp/transformers/model_utils.py
index 811f13361486..857471f8209c 100644
--- a/paddlenlp/transformers/model_utils.py
+++ b/paddlenlp/transformers/model_utils.py
@@ -424,7 +424,7 @@ def load_state_dict(
         with safe_open(checkpoint_file, framework="np") as f:
             metadata = f.metadata()
         if metadata is None:
-            metadata = {"format", "np"}
+            metadata = {"format": "np"}
 
         if metadata.get("format", "np") not in ["pd", "np"]:
             raise OSError(
@@ -1161,7 +1161,7 @@ def set_inference_config(cls, config, predictor_args, **kwargs):
         tensor_parallel_degree = kwargs.pop("tensor_parallel_degree", 1)
         tensor_parallel_rank = kwargs.pop("tensor_parallel_rank", 0)
 
-        if predictor_args.mode == "dynamic" or predictor_args.speculate_method in ["eagle"]:
+        if predictor_args.mode == "dynamic" or predictor_args.speculate_method in ["eagle", "mtp"]:
             config.tensor_parallel_degree = tensor_parallel_degree
             config.tensor_parallel_rank = tensor_parallel_rank
             config.model_name_or_path = predictor_args.model_name_or_path
@@ -1203,11 +1203,12 @@ def set_inference_config(cls, config, predictor_args, **kwargs):
             config.speculate_max_ngram_size = predictor_args.speculate_max_ngram_size
             config.speculate_verify_window = predictor_args.speculate_verify_window
             config.speculate_max_candidate_len = predictor_args.speculate_max_candidate_len
-            if predictor_args.speculate_method == "eagle":
-                config.decode_strategy = "draft_model_sample"
-            else:
-                config.decode_strategy = "speculate_decoding"
-            config.return_full_hidden_states = predictor_args.return_full_hidden_states
+            if predictor_args.speculate_method is not None:
+                if config.get("speculate_model_type", "None") in ["eagle", "mtp"]:
+                    config.decode_strategy = "draft_model_sample"
+                else:
+                    config.decode_strategy = "speculate_decoding"
+        config.return_full_hidden_states = predictor_args.return_full_hidden_states
 
     @classmethod
     def confirm_inference_model(cls, predictor_args, **kwargs):
@@ -1291,18 +1292,16 @@ def get_memory_footprint(self, return_buffers=True):
         return mem
 
     def get_model_flops(self, *args, **kwargs):
-        base_model = getattr(self, self.base_model_prefix, self)
-        if base_model is not self:
-            return base_model.get_model_flops()
+        if hasattr(self, "_get_model_flops"):
+            return self._get_model_flops()
 
-        raise NotImplementedError(f"model of {type(base_model)} has not implemented the `get_model_flops`")
+        raise NotImplementedError(f"model of {type(self)} has not implemented the `_get_model_flops`")
 
     def get_hardware_flops(self, *args, **kwargs):
-        base_model = getattr(self, self.base_model_prefix, self)
-        if base_model is not self:
-            return base_model.get_hardware_flops()
+        if hasattr(self, "_get_hardware_flops"):
+            return self._get_hardware_flops()
 
-        raise NotImplementedError(f"model of {type(base_model)} has not implemented the `get_hardware_flops`")
+        raise NotImplementedError(f"model of {type(self)} has not implemented the `_get_hardware_flops`")
 
     def get_input_embeddings(self) -> nn.Embedding:
         """get input embedding of model
diff --git a/paddlenlp/transformers/moe_gate.py b/paddlenlp/transformers/moe_gate.py
index 8118ba60f7ac..995226de893b 100644
--- a/paddlenlp/transformers/moe_gate.py
+++ b/paddlenlp/transformers/moe_gate.py
@@ -69,7 +69,11 @@ def _one_hot_to_int64(self, x, num_classes):
 
     @paddle.no_grad()
     def _capacity(
-        self, gates: paddle.Tensor, capacity_factor: float, max_capacity: int, min_capacity: int
+        self,
+        gates: paddle.Tensor,
+        capacity_factor: float,
+        max_capacity: int,
+        min_capacity: int,
     ) -> paddle.Tensor:
         """Calculate the capacity for each expert based on the gates and capacity factor.
 
@@ -107,6 +111,7 @@ def _cal_aux_loss(self, gates, mask):
             paddle.Tensor: The value of auxiliary loss.
 
         """
+        # TODO: @DrownFish19 update aux_loss for Qwen2MoE and DeepSeekV2&V3
         me = paddle.mean(gates, axis=0)
         ce = paddle.mean(mask.cast("float32"), axis=0)
         if self.global_aux_loss:
@@ -131,7 +136,7 @@ def _cal_z_loss(self, logits) -> paddle.Tensor:
         Returns:
             paddle.Tensor: The z loss value.
         """
-        l_zloss = logits.exp().sum(1).log().square().mean()
+        l_zloss = paddle.logsumexp(logits, axis=1).square().mean()
         return l_zloss
 
     def _cal_orthogonal_loss(self) -> paddle.Tensor:
@@ -175,8 +180,14 @@ def __init__(self, config, num_experts, expert_hidden_size, **kwargs):
         self.top2_2nd_expert_sampling = kwargs.pop("top2_2nd_expert_sampling", True)
 
         self.drop_policy = kwargs.pop("drop_policy", "probs")
+        # Qwen2MoE: greedy
+        # DeepSeekV2&V3: group_limited_greedy for training, and noaux_tc for inference
+        self.topk_method = kwargs.pop("topk_method", "greedy")
         self.top_k = kwargs.pop("top_k", 2)
+        self.n_group = kwargs.pop("n_group", 1)  # for group_limited_greedy
+        self.topk_group = kwargs.pop("topk_group", 1)  # for group_limited_greedy
         self.norm_topk_prob = kwargs.pop("norm_topk_prob", False)
+        self.routed_scaling_factor = kwargs.pop("routed_scaling_factor", 1.0)
 
     def _priority(self, topk_idx: paddle.Tensor, capacity: int) -> paddle.Tensor:
         """_summary_
@@ -228,7 +239,7 @@ def _priority(self, topk_idx: paddle.Tensor, capacity: int) -> paddle.Tensor:
 
         return dispatch_mask
 
-    def topk_naive(self, scores: paddle.Tensor, k: int) -> Tuple[paddle.Tensor, paddle.Tensor]:
+    def _topk_greedy(self, scores: paddle.Tensor, k: int) -> Tuple[paddle.Tensor, paddle.Tensor]:
         """_summary_
 
         Args:
@@ -240,10 +251,10 @@ def topk_naive(self, scores: paddle.Tensor, k: int) -> Tuple[paddle.Tensor, padd
             topk_weight: [bsz*seq_len, k]
             topk_idx: [bsz*seq_len, k]
         """
-        topk_weight, topk_idx = paddle.topk(scores, k=k, axis=-1, sorted=False)
+        topk_weight, topk_idx = paddle.topk(scores, k=k, axis=-1, sorted=True)
         return topk_weight, topk_idx
 
-    def topk_group(
+    def _topk_group_limited_greedy(
         self, scores: paddle.Tensor, k: int, n_group: int, topk_group: int
     ) -> Tuple[paddle.Tensor, paddle.Tensor]:
         """_summary_
@@ -275,6 +286,43 @@ def topk_group(
 
         return topk_weight, topk_idx
 
+    def _topk_noaux_tc(
+        self, scores: paddle.Tensor, k: int, n_group: int, topk_group: int
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """_summary_
+
+        Args:
+            scores (paddle.Tensor): [bsz*seq_len, n_experts]
+            k (int): select the top k experts in each group
+            n_groups (int): the number of groups for all experts
+            topk_group (int): the number of groups selected
+
+        Returns:
+            Tuple[paddle.Tensor, paddle.Tensor]: topk_weight, topk_idx
+            topk_weight: [bsz*seq_len, k]
+            topk_idx: [bsz*seq_len, k]
+
+        Note: the group size is normal greater than the number of k
+        """
+        bsz_seq_len, n_experts = scores.shape
+        assert n_experts % n_group == 0, "n_experts must be divisible by n_groups"
+
+        assert self.e_score_correction_bias is not None, "e_score_correction_bias is None"
+        scores_for_choice = scores.reshape([bsz_seq_len, -1]) + self.e_score_correction_bias.unsqueeze(0)
+        group_scores = (
+            scores_for_choice.reshape([bsz_seq_len, self.n_group, -1]).topk(2, axis=-1)[0].sum(axis=-1)
+        )  # fmt:skip [n, n_group]
+        group_idx = paddle.topk(group_scores, k=topk_group, axis=-1, sorted=False)[1]  # [n, top_k_group]
+        group_mask = paddle.zeros_like(group_scores).put_along_axis(group_idx, paddle.to_tensor(1.0), axis=-1)  # fmt:skip
+        score_mask = (
+            group_mask.unsqueeze(-1).expand([bsz_seq_len, n_group, n_experts // n_group]).reshape([bsz_seq_len, -1])
+        )  # [n, e]
+        tmp_scores = scores_for_choice * score_mask  # [n, e]
+        topk_weight, topk_idx = paddle.topk(tmp_scores, k=k, axis=-1, sorted=False)
+        topk_weight = scores.take_along_axis(topk_idx, axis=1) if not self.training else topk_weight
+
+        return topk_weight, topk_idx
+
     def top1gating(
         self,
         logits: paddle.Tensor,
@@ -432,7 +480,22 @@ def topkgating(
         l_zloss = self._cal_z_loss(gates)
 
         # get topk gates
-        top_gate, top_idx = paddle.topk(gates, k=self.top_k, axis=1)
+        if self.topk_method == "greedy":
+            top_gate, top_idx = self._topk_greedy(gates, k=self.top_k)
+        elif self.topk_method == "group_limited_greedy":
+            top_gate, top_idx = self._topk_group_limited_greedy(
+                gates, k=self.top_k, n_group=self.n_group, topk_group=self.topk_group
+            )
+        elif self.topk_method == "noaux_tc":
+            top_gate, top_idx = self._topk_noaux_tc(
+                gates, k=self.top_k, n_group=self.n_group, topk_group=self.topk_group
+            )
+            # norm gate to sum 1
+        if self.top_k > 1 and self.norm_topk_prob:
+            denominator = top_gate.sum(axis=-1, keepdim=True) + 1e-20
+            top_gate = top_gate / denominator
+        top_gate = top_gate * self.routed_scaling_factor
+
         # get topk mask
         mask = paddle.zeros_like(gates).put_along_axis(top_idx, paddle.to_tensor(1.0), axis=1)
         l_aux = self._cal_aux_loss(gates, mask)
@@ -441,7 +504,12 @@ def topkgating(
 
         if self.drop_tokens:
             # Calculate configured capacity and remove locations outside capacity from mask
-            capacity = self._capacity(gates, self.capacity_factor * self.top_k, self.max_capacity, self.min_capacity)
+            capacity = self._capacity(
+                gates,
+                self.capacity_factor * self.top_k,
+                self.max_capacity,
+                self.min_capacity,
+            )
 
             # update mask and locations by capacity
             if self.drop_policy == "probs":
@@ -462,13 +530,21 @@ def topkgating(
             token_priority = self._priority(top_idx, capacity)
 
         # normalize gates
-        gates_masked = gates * mask
-        gates_s = paddle.sum(gates_masked, axis=-1, keepdim=True)
-        denom_s = paddle.clip(gates_s, min=paddle.finfo(gates_masked.dtype).eps)
-        if self.norm_topk_prob:
-            gates_masked = gates_masked / denom_s
+        if self.training:
+            gates_masked = gates * mask
+            gates_s = paddle.sum(gates_masked, axis=-1, keepdim=True)
+            denom_s = paddle.clip(gates_s, min=paddle.finfo(gates_masked.dtype).eps)
+            if self.norm_topk_prob:
+                gates_masked = gates_masked / denom_s
+            combine_weights = paddle.einsum(
+                "se,sec->sec", gates_masked, token_priority.cast(paddle.get_default_dtype())
+            )
+        else:
+            topk_masked_gates = paddle.zeros_like(gates).put_along_axis(top_idx, top_gate, axis=1)
+            combine_weights = paddle.einsum(
+                "se,sec->sec", topk_masked_gates, token_priority.cast(paddle.get_default_dtype())
+            )
 
-        combine_weights = paddle.einsum("se,sec->sec", gates_masked, token_priority.cast(paddle.get_default_dtype()))
         dispatch_mask = combine_weights.cast(paddle.bool)
 
         return capacity, combine_weights, dispatch_mask, exp_counts, l_aux, l_zloss
diff --git a/paddlenlp/transformers/moe_layer.py b/paddlenlp/transformers/moe_layer.py
index fa198da7b0b6..90d4feae6c72 100644
--- a/paddlenlp/transformers/moe_layer.py
+++ b/paddlenlp/transformers/moe_layer.py
@@ -177,7 +177,7 @@ def __init__(
         self.experts = nn.LayerList([])
         for i in range(self.moe_num_experts):
             if i // self.moe_num_experts_per_device == self.moe_rank:
-                self.experts.append(expert_class(expert_kwargs))
+                self.experts.append(expert_class(**expert_kwargs))
             else:
                 self.experts.append(None)
 
diff --git a/paddlenlp/transformers/nv_embed/__init__.py b/paddlenlp/transformers/nv_embed/__init__.py
new file mode 100644
index 000000000000..0f0d00141b52
--- /dev/null
+++ b/paddlenlp/transformers/nv_embed/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/transformers/nv_embed/modeling.py b/paddlenlp/transformers/nv_embed/modeling.py
new file mode 100644
index 000000000000..98004ac9428c
--- /dev/null
+++ b/paddlenlp/transformers/nv_embed/modeling.py
@@ -0,0 +1,530 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+import tqdm
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.log import logger
+from .. import AutoTokenizer, MistralModel, PretrainedConfig, PretrainedModel
+from ..model_outputs import BaseModelOutputWithPast, ModelOutput
+
+__all__ = ["NVEncodeModel"]
+
+
+@dataclass
+class EncoderOutput(ModelOutput):
+    q_reps: Optional[paddle.Tensor] = None
+    p_reps: Optional[paddle.Tensor] = None
+    loss: Optional[paddle.Tensor] = None
+    scores: Optional[paddle.Tensor] = None
+
+
+def scaled_dot_product_attention(q, k, v):  # [bs, len, num_heads, dim]
+    matmul_qk = paddle.matmul(q.transpose([0, 2, 1, 3]), k.transpose([0, 2, 3, 1]))
+    dk = paddle.to_tensor(k.shape[-1], dtype=paddle.float32)
+    scaled_attention_logits = matmul_qk / paddle.sqrt(dk)
+    attention_weights = paddle.nn.functional.softmax(scaled_attention_logits, axis=-1)  # [bs, num_heads, q_len, k_len]
+    output = paddle.matmul(attention_weights, v.transpose([0, 2, 1, 3]))  # [bs, num_heads, q_len, dim]
+    output = output.transpose([0, 2, 1, 3])  # [bs, q_len, num_heads, dim]
+    return output
+
+
+def _make_bidirection_mask(
+    input_ids_shape: paddle.shape,
+    dtype: paddle.dtype,
+    past_key_values_length: int = 0,
+):
+    """
+    Make bidirection mask used for sliding window attention
+    """
+    bsz, tgt_len = input_ids_shape
+
+    tensor = paddle.full(
+        (tgt_len, tgt_len),
+        fill_value=1,
+    )
+    mask = paddle.tril(tensor, diagonal=0)
+    mask = paddle.ones_like(mask)  # here is for bidirection attention
+    mask = paddle.log(mask).astype(dtype)
+
+    if past_key_values_length > 0:
+        mask = paddle.concat([paddle.zeros([tgt_len, past_key_values_length], dtype=dtype), mask], axis=-1)
+    return mask[None, None, :, :].expand([bsz, 1, tgt_len, tgt_len + past_key_values_length])
+
+
+def _expand_mask(mask: paddle.Tensor, dtype: paddle.dtype, tgt_len):
+    expanded_mask = mask
+    if len(mask.shape) == 2:
+        """
+        Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+        """
+        bsz, src_len = mask.shape
+        tgt_len = tgt_len if tgt_len is not None else src_len
+
+        expanded_mask = mask[:, None, None, :].expand([bsz, 1, tgt_len, src_len]).astype(dtype)
+    elif len(mask.shape) == 3:
+        """
+        Expands attention_mask from `[bsz, tgt_seq_len, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+        """
+        expanded_mask = mask.unsqueeze(1).astype(dtype)
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return paddle.where(inverted_mask > 0.5, paddle.full_like(inverted_mask, paddle.finfo(dtype).min), inverted_mask)
+
+
+class LatentModel(PretrainedModel):
+    config_class = PretrainedConfig
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.cross_attend_blocks_0_fn_to_kv = paddle.nn.Linear(
+            in_features=config.hidden_size, out_features=2 * config.max_position_embeddings, bias_attr=False
+        )
+        self.cross_attend_blocks_0_fn_to_out = paddle.nn.Linear(
+            in_features=config.max_position_embeddings, out_features=config.hidden_size, bias_attr=False
+        )
+        self.cross_attend_blocks_0_fn_to_q = paddle.nn.Linear(
+            in_features=config.hidden_size, out_features=config.max_position_embeddings, bias_attr=False
+        )
+        self.cross_attend_blocks_0_norm = paddle.nn.LayerNorm(config.hidden_size)
+        self.cross_attend_blocks_0_norm_context = paddle.nn.LayerNorm(config.hidden_size)
+
+        self.cross_attend_blocks_1_fn_net_0 = paddle.nn.Linear(
+            in_features=config.hidden_size, out_features=config.max_position_embeddings
+        )
+        self.cross_attend_blocks_1_fn_net_2 = paddle.nn.Linear(
+            in_features=config.max_position_embeddings // 2, out_features=config.hidden_size
+        )
+        self.cross_attend_blocks_1_norm = paddle.nn.LayerNorm(config.hidden_size)
+
+        self.latents = paddle.nn.Linear(in_features=config.hidden_size, out_features=512, bias_attr=False)
+
+    def forward(self, last_hidden_states, pool_mask):
+        one = paddle.eye(
+            num_rows=self.config.hidden_size,
+            num_columns=self.config.hidden_size,
+            dtype=self.latents.weight.dtype,
+        )
+        self_latents_weight_T = self.latents(one).T
+        # latents = repeat(self_latents_weight_T, "d h -> b d h", b=last_hidden_states.shape[0]) # from einops import repeat
+        latents = paddle.tile(self_latents_weight_T, repeat_times=last_hidden_states.shape[0]).reshape(
+            [self_latents_weight_T.shape[0], last_hidden_states.shape[0], self_latents_weight_T.shape[1]]
+        )
+        latents = latents.transpose([1, 0, 2])
+
+        normed_x = self.cross_attend_blocks_0_norm(last_hidden_states)
+        normed_context = self.cross_attend_blocks_0_norm_context(latents)
+
+        q = self.cross_attend_blocks_0_fn_to_q(normed_x)
+        kv = self.cross_attend_blocks_0_fn_to_kv(normed_context)
+        k = kv[:, :, : self.config.max_position_embeddings]
+        v = kv[:, :, self.config.max_position_embeddings :]
+
+        # q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b n h d", h=self.config.num_key_value_heads), (q, k, v)) # from einops import rearrange
+        q = q.reshape(
+            [q.shape[0], q.shape[1], self.config.num_key_value_heads, q.shape[2] // self.config.num_key_value_heads]
+        )
+        k = k.reshape(
+            [k.shape[0], k.shape[1], self.config.num_key_value_heads, k.shape[2] // self.config.num_key_value_heads]
+        )
+        v = v.reshape(
+            [v.shape[0], v.shape[1], self.config.num_key_value_heads, v.shape[2] // self.config.num_key_value_heads]
+        )
+
+        # k.stop_gradient = False
+        # v.stop_gradient = False
+        # out = paddle.nn.functional.scaled_dot_product_attention(q, k, v) # if use this, must set k and v stop_gradient to False
+        out = scaled_dot_product_attention(q, k, v)  # if use this, no need to manually set k and v
+        # out = rearrange(out, "b n h d -> b n (h d)", h=self.config.num_key_value_heads) # from einops import rearrange
+        out = out.reshape([out.shape[0], out.shape[1], out.shape[2] * out.shape[3]])
+
+        out_of_layer1 = self.cross_attend_blocks_0_fn_to_out(out) + last_hidden_states
+
+        normed_x = self.cross_attend_blocks_1_norm(out_of_layer1)
+
+        before_geglu = self.cross_attend_blocks_1_fn_net_0(normed_x)
+
+        x_in_gegle = before_geglu[:, :, : self.config.max_position_embeddings // 2]
+        gate_in_geglu = before_geglu[:, :, self.config.max_position_embeddings // 2 :]
+        x_after_geglu = x_in_gegle * paddle.nn.functional.gelu(gate_in_geglu)
+
+        after_geglu = self.cross_attend_blocks_1_fn_net_2(x_after_geglu)
+
+        out_of_layer2 = after_geglu + out_of_layer1
+
+        pool_mask = pool_mask.astype(out_of_layer2.dtype)
+        s = paddle.sum(
+            out_of_layer2 * pool_mask.unsqueeze(-1),
+            axis=1,
+            dtype=str(self.cross_attend_blocks_1_fn_net_2.weight.dtype).split(".")[-1],
+        )
+        d = paddle.sum(
+            pool_mask, axis=1, keepdim=True, dtype=str(self.cross_attend_blocks_1_fn_net_2.weight.dtype).split(".")[-1]
+        )
+        hiddens = s / d
+        hiddens = paddle.nn.functional.normalize(hiddens, p=2, axis=-1)
+
+        return hiddens
+
+
+class NVEncodeModel(MistralModel):
+    def __init__(
+        self,
+        config,
+        tokenizer_path,
+        query_instruction,
+        document_instruction,
+        eval_batch_size=999,
+        normalized=True,
+        negatives_cross_device=False,
+        temperature_=1,
+        margin=0.01,
+        use_inbatch_neg=True,
+        matryoshka_dims=None,
+        matryoshka_loss_weights=None,
+    ):
+        super().__init__(config)  # get mistral model structure
+
+        self.latent_model = LatentModel(config=config)  # get latent model structure
+
+        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, padding_side="right")
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+
+        self.query_instruction = query_instruction
+        self.document_instruction = document_instruction
+
+        self.eval_batch_size = eval_batch_size
+
+        self.normalized = normalized
+        self.negatives_cross_device = negatives_cross_device
+        if self.negatives_cross_device:
+            if not dist.is_initialized():
+                raise ValueError("Distributed training has not been initialized for representation all gather.")
+            self.process_rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+        self.temperature = temperature_
+        self.margin = margin
+        self.use_inbatch_neg = use_inbatch_neg
+        self.matryoshka_dims = matryoshka_dims
+        self.matryoshka_loss_weights = matryoshka_loss_weights
+
+        self.cross_entropy = nn.CrossEntropyLoss(reduction="mean")
+
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+
+        combined_attention_mask = _make_bidirection_mask(
+            input_shape,
+            inputs_embeds.dtype,
+            past_key_values_length=past_key_values_length,
+        )
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1])
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+
+        return combined_attention_mask
+
+    def get_model_config(
+        self,
+    ):
+        return self.model_config.to_dict()
+
+    def encode(self, features, instruction_len):
+        last_hidden_states = self.m_forward(**features)[0]  # get bs*len*4096
+        pool_mask = features["attention_mask"]
+        pool_mask[:, :instruction_len] = 0
+        embeddings = self.latent_model.forward(last_hidden_states, pool_mask)
+        embeddings = paddle.nn.functional.normalize(embeddings, p=2, axis=1)
+        return embeddings
+
+    def compute_similarity(self, q_reps, p_reps):
+        # q_reps [batch_size, embedding_dim]
+        # p_reps [batch_size, embedding_dim]
+        return paddle.matmul(q_reps, p_reps.transpose([1, 0]))
+
+    def hard_negative_loss(self, q_reps, p_reps):
+        scores = self.compute_similarity(q_reps, p_reps)
+        scores = scores / self.temperature
+        scores = scores.reshape([q_reps.shape[0], -1])
+
+        target = paddle.arange(scores.shape[0], dtype="int64")
+        target = target * (p_reps.shape[0] // q_reps.shape[0])
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def in_batch_negative_loss(self, q_reps, p_reps):
+        # In batch negatives
+        scores = self.compute_similarity(q_reps, p_reps)
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(shape=[q_reps.shape[0]], fill_value=self.margin, dtype=q_reps.dtype)
+        scores = scores - paddle.diag(margin_diag)
+        # Scale cosine to ease training converge
+        scores = scores / self.temperature
+        target = paddle.arange(0, q_reps.shape[0], dtype="int64")
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def forward(
+        self,
+        query: Dict[str, paddle.Tensor] = None,
+        passage: Dict[str, paddle.Tensor] = None,
+        teacher_score: paddle.Tensor = None,
+    ):
+        instruction_len = len(self.tokenizer.encode(self.query_instruction, add_special_tokens=False)["input_ids"])
+        q_reps = self.encode(query, instruction_len)
+        instruction_len = len(self.tokenizer.encode(self.document_instruction, add_special_tokens=False)["input_ids"])
+        p_reps = self.encode(passage, instruction_len)
+
+        # For non-matryoshka loss, we normalize the representations
+        if not self.matryoshka_dims:
+            if self.normalized:
+                q_reps = paddle.nn.functional.normalize(q_reps, axis=-1)
+                p_reps = paddle.nn.functional.normalize(p_reps, axis=-1)
+
+        if self.training:
+            # Cross device negatives
+            if self.negatives_cross_device:
+                q_reps = self._dist_gather_tensor(q_reps)
+                p_reps = self._dist_gather_tensor(p_reps)
+
+            if self.matryoshka_dims:
+                loss = 0.0
+                scores = 0.0
+                for loss_weight, dim in zip(self.matryoshka_loss_weights, self.matryoshka_dims):
+                    reduced_q = q_reps[:, :dim]
+                    reduced_d = p_reps[:, :dim]
+                    if self.normalized:
+                        reduced_q = paddle.nn.functional.normalize(reduced_q, axis=-1)
+                        reduced_d = paddle.nn.functional.normalize(reduced_d, axis=-1)
+
+                    if self.use_inbatch_neg:
+                        dim_score, dim_loss = self.in_batch_negative_loss(reduced_q, reduced_d)
+                    else:
+                        dim_score, dim_loss = self.hard_negative_loss(reduced_q, reduced_d)
+                    scores += dim_score
+                    loss += loss_weight * dim_loss
+
+            elif self.use_inbatch_neg:
+                scores, loss = self.in_batch_negative_loss(q_reps, p_reps)
+            else:
+                scores, loss = self.hard_negative_loss(q_reps, p_reps)
+
+        else:
+            scores = self.compute_similarity(q_reps, p_reps)
+            loss = None
+        return EncoderOutput(
+            loss=loss,
+            scores=scores,
+            q_reps=q_reps,
+            p_reps=p_reps,
+        )
+
+    def compute_loss(self, scores, target):
+        return self.cross_entropy(scores, target)
+
+    def _dist_gather_tensor(self, t: Optional[paddle.Tensor]):
+        if t is None:
+            return None
+
+        all_tensors = [paddle.empty_like(t) for _ in range(self.world_size)]
+        dist.all_gather(all_tensors, t)
+
+        all_tensors[self.process_rank] = t
+        all_tensors = paddle.concat(all_tensors, axis=0)
+
+        return all_tensors
+
+    def save_pretrained(self, output_dir: str, **kwargs):
+        state_dict = self.model.state_dict()
+        state_dict = type(state_dict)({k: v.clone().cpu() for k, v in state_dict.items()})
+        self.model.save_pretrained(output_dir, state_dict=state_dict)
+
+    def m_forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+
+        if position_ids is None:
+            position_ids = paddle.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=paddle.int64
+            )
+            position_ids = position_ids.unsqueeze(0).expand((batch_size, seq_length))
+        else:
+            position_ids = position_ids.reshape([-1, seq_length]).astype("int64")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask,
+            (batch_size, seq_length),
+            inputs_embeds,
+            past_key_values_length,
+        )
+
+        hidden_states = inputs_embeds
+
+        if self.enable_recompute and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, decoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            has_gradient = not hidden_states.stop_gradient
+            if self.enable_recompute and has_gradient:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+    @paddle.no_grad()
+    def encode_sentences(self, sentences: List[str], instruction_len, **kwargs) -> np.ndarray:
+        all_embeddings = []
+        for start_index in tqdm.tqdm(list(range(0, len(sentences), self.eval_batch_size)), desc="Batches"):
+
+            sentences_batch = sentences[start_index : start_index + self.eval_batch_size]
+            inputs = self.tokenizer(
+                sentences_batch,
+                max_length=4096,
+                padding=True,
+                return_attention_mask=True,
+                return_token_type_ids=False,
+                return_tensors="pd",
+                truncation=True,
+            )
+            last_hidden_states = self.m_forward(**inputs)[0]  # get bs*len*4096
+            pool_mask = inputs["attention_mask"]
+            pool_mask[:, :instruction_len] = 0
+
+            embeddings = self.latent_model.forward(last_hidden_states, pool_mask)
+            embeddings = paddle.nn.functional.normalize(embeddings, p=2, axis=1)
+
+            all_embeddings.append(embeddings.cpu().numpy().astype("float32"))
+
+        return np.concatenate(all_embeddings, axis=0)
+
+    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
+        input_texts = [self.query_instruction + q + self.tokenizer.eos_token for q in queries]
+        instruction_len = len(self.tokenizer.encode(self.query_instruction, add_special_tokens=False)["input_ids"])
+        return self.encode_sentences(input_texts, instruction_len)
+
+    def encode_corpus(self, corpus: List[Union[Dict[str, str], str]], **kwargs) -> np.ndarray:
+        if isinstance(corpus[0], dict):
+            input_texts = ["{} {}".format(doc.get("title", ""), doc["text"]).strip() for doc in corpus]
+        else:
+            input_texts = corpus
+
+        input_texts = [self.document_instruction + doc + self.tokenizer.eos_token for doc in input_texts]
+        instruction_len = len(self.tokenizer.encode(self.document_instruction, add_special_tokens=False)["input_ids"])
+        return self.encode_sentences(input_texts, instruction_len)
diff --git a/paddlenlp/transformers/qwen/modeling.py b/paddlenlp/transformers/qwen/modeling.py
index 2f465e9c3d8c..6f44737bc45a 100755
--- a/paddlenlp/transformers/qwen/modeling.py
+++ b/paddlenlp/transformers/qwen/modeling.py
@@ -59,7 +59,7 @@ def swiglu(x, y=None):
 from .. import linear_utils
 from ..linear_utils import Linear
 from ..model_outputs import ModelOutput
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import QWenConfig
 
 try:
@@ -281,7 +281,15 @@ def _attn(self, query, key, value, attention_mask=None):
             # [bz, sql, nh, hid] ==> [bz, nh, sql hdim]
             value = value.transpose([0, 2, 1, 3])
 
-            attn_weights = paddle.matmul(query / math.sqrt(head_dim), key.transpose([0, 1, 3, 2]))
+            # Add pre divided factor to fix nan under float16.
+            if paddle.in_dynamic_mode() and query.dtype == paddle.float16:
+                pre_divided_factor = 32
+            else:
+                pre_divided_factor = 1
+
+            attn_weights = paddle.matmul(
+                query / (math.sqrt(head_dim) * pre_divided_factor), key.transpose([0, 1, 3, 2])
+            )
 
             if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
                 raise ValueError(
@@ -292,7 +300,7 @@ def _attn(self, query, key, value, attention_mask=None):
             if attention_mask is None:
                 attention_mask = get_triangle_upper_mask(attn_weights)
             attn_weights = attn_weights + attention_mask
-            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(value.dtype)
+            attn_weights = F.softmax(attn_weights.astype("float32") * pre_divided_factor, axis=-1).astype(value.dtype)
 
             attn_weights = self.attn_dropout(attn_weights)
             attn_output = paddle.matmul(attn_weights, value)
@@ -555,6 +563,37 @@ class QWenPretrainedModel(PretrainedModel):
     def __init__(self, *inputs, **kwargs):
         super().__init__(*inputs, **kwargs)
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_tensor_parallel_mappings(cls, config, is_split=True):
 
@@ -744,39 +783,6 @@ def __init__(self, config):
         )
         self.ln_f = QWenRMSNorm(config)
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.wte
 
@@ -1167,26 +1173,38 @@ def forward(
         )
         hidden_states = transformer_outputs[0]
 
-        lm_logits = self.lm_head(hidden_states)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(lm_logits, labels)
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
 
-        # lm_logits = self.lm_head(hidden_states)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states)
 
-        # loss = None
-        # if labels is not None:
-        #     loss_fct = nn.CrossEntropyLoss()
-        #     loss = loss_fct(lm_logits, labels)
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
-            output = (lm_logits,) + transformer_outputs[1:]
+            output = (logits,) + transformer_outputs[1:]
             return ((loss,) + output) if loss is not None else output
 
         return CausalLMOutputWithPast(
             loss=loss,
-            logits=lm_logits,
+            logits=logits,
             past_key_values=transformer_outputs.past_key_values,
             hidden_states=transformer_outputs.hidden_states,
             attentions=transformer_outputs.attentions,
diff --git a/paddlenlp/transformers/qwen/modeling_pp.py b/paddlenlp/transformers/qwen/modeling_pp.py
index 0f3d285ce465..613f197f825b 100644
--- a/paddlenlp/transformers/qwen/modeling_pp.py
+++ b/paddlenlp/transformers/qwen/modeling_pp.py
@@ -143,6 +143,8 @@ class QWenForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_tensor_parallel_mappings = QWenPretrainedModel._get_tensor_parallel_mappings
     _init_weights = QWenPretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = QWenPretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = QWenPretrainedModel._get_model_flops
+    _get_hardware_flops = QWenPretrainedModel._get_hardware_flops
 
     # DONOT Add base_model_prefix !!!!
 
diff --git a/paddlenlp/transformers/qwen2/modeling.py b/paddlenlp/transformers/qwen2/modeling.py
index 9dcc11c9786a..71a1d2abf321 100644
--- a/paddlenlp/transformers/qwen2/modeling.py
+++ b/paddlenlp/transformers/qwen2/modeling.py
@@ -59,7 +59,7 @@
     TokenClassifierOutput,
 )
 from ..model_utils import PretrainedModel, register_base_model
-from ..utils import caculate_llm_flops, logger
+from ..utils import caculate_llm_per_token_flops, logger
 from .configuration import Qwen2Config
 
 try:
@@ -202,8 +202,15 @@ def scaled_dot_product_attention(
         key_states = paddle.transpose(key_states, [0, 2, 1, 3])
         value_states = paddle.transpose(value_states, [0, 2, 1, 3])
 
-        # matmul and divide by sqrt(head_dim)
-        attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
+        # Add pre divided factor to fix nan under float16.
+        if paddle.in_dynamic_mode() and query_states.dtype == paddle.float16:
+            pre_divided_factor = 32
+        else:
+            pre_divided_factor = 1
+
+        attn_weights = paddle.matmul(
+            query_states / (math.sqrt(head_dim) * pre_divided_factor), key_states.transpose([0, 1, 3, 2])
+        )
 
         if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
             raise ValueError(
@@ -213,6 +220,7 @@ def scaled_dot_product_attention(
 
         if attention_mask is None:
             attention_mask = get_triangle_upper_mask(attn_weights)
+
         attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len])
         if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]:
             raise ValueError(
@@ -220,11 +228,16 @@ def scaled_dot_product_attention(
             )
 
         attn_weights = attn_weights + attention_mask
+
         if not paddle.in_dynamic_mode():
-            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+            attn_weights = F.softmax(attn_weights * pre_divided_factor, axis=-1, dtype="float32").astype(
+                query_states.dtype
+            )
         else:
             with paddle.amp.auto_cast(False):
-                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+                attn_weights = F.softmax(
+                    attn_weights.astype("float32") * pre_divided_factor, axis=-1, dtype="float32"
+                ).astype(query_states.dtype)
 
         attn_weights = F.dropout(attn_weights, p=config.attention_dropout, training=training)
 
@@ -1013,6 +1026,37 @@ def _get_fuse_or_split_param_mappings(cls, config: Qwen2Config, is_fuse=False):
                     final_actions[keys] = partial(fn, split_nums=2)
         return final_actions
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     def _init_weights(self, layer):
         """Initialization hook"""
         if self.config.tensor_parallel_degree > 1:
@@ -1113,39 +1157,6 @@ def __init__(self, config: Qwen2Config):
         )
         self.norm = Qwen2RMSNorm(config)
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embed_tokens
 
@@ -1623,11 +1634,30 @@ def forward(
         # tensor_parallel_output is together with ParallelCrossEntropy
         tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
 
-        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/qwen2/modeling_pp.py b/paddlenlp/transformers/qwen2/modeling_pp.py
index bab8c25e7965..a60a4db257ad 100644
--- a/paddlenlp/transformers/qwen2/modeling_pp.py
+++ b/paddlenlp/transformers/qwen2/modeling_pp.py
@@ -234,6 +234,9 @@ class Qwen2ForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_tensor_parallel_mappings = Qwen2PretrainedModel._get_tensor_parallel_mappings
     _init_weights = Qwen2PretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = Qwen2PretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = Qwen2PretrainedModel._get_model_flops
+    _get_hardware_flops = Qwen2PretrainedModel._get_hardware_flops
+
     _tied_weights_keys = ["lm_head.weight"]
 
     # DONOT Add base_model_prefix !!!!
diff --git a/paddlenlp/transformers/qwen2_moe/__init__.py b/paddlenlp/transformers/qwen2_moe/__init__.py
index 2f2acfa9b339..d68171b98ec8 100644
--- a/paddlenlp/transformers/qwen2_moe/__init__.py
+++ b/paddlenlp/transformers/qwen2_moe/__init__.py
@@ -15,3 +15,4 @@
 from ..qwen2.tokenizer import *
 from .configuration import *
 from .modeling import *
+from .modeling_pp import *
diff --git a/paddlenlp/transformers/qwen2_moe/modeling.py b/paddlenlp/transformers/qwen2_moe/modeling.py
index 732635966770..76d84d26f0cb 100644
--- a/paddlenlp/transformers/qwen2_moe/modeling.py
+++ b/paddlenlp/transformers/qwen2_moe/modeling.py
@@ -12,13 +12,14 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Paddle Qwen2Moe model."""
+"""Paddle Qwen2Moe model."""
+
 from __future__ import annotations
 
 import math
 import warnings
 from functools import partial
-from typing import Optional, Tuple
+from typing import List, Optional, Tuple, Union
 
 import paddle
 import paddle.distributed.fleet.meta_parallel as mpu
@@ -28,14 +29,19 @@
 from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
 from paddle.distributed.fleet.utils import recompute
 
-from ...utils.log import logger
+from paddlenlp.utils.tools import get_env_device
+
 from .. import linear_utils
 from ..activations import ACT2FN
 from ..conversion_utils import StateDictNameMapping, init_name_mappings
+from ..linear_utils import Linear
+from ..llama import fusion_ops
+from ..llama.modeling import get_use_casual_mask
 from ..model_outputs import MoECausalLMOutputWithPast, MoEModelOutputWithPast
 from ..model_utils import PretrainedModel, register_base_model
 from ..moe_gate import PretrainedMoEGate
 from ..moe_layer import MoELayer
+from ..utils import logger
 from .configuration import Qwen2MoeConfig
 
 try:
@@ -219,8 +225,10 @@ def scaled_dot_product_attention(
     value_states,
     attention_mask,
     output_attentions,
+    attn_mask_startend_row_indices=None,
     training=True,
     sequence_parallel=False,
+    skip_recompute=False,
 ):
     bsz, q_len, num_heads, head_dim = query_states.shape
     _, kv_seq_len, _, _ = value_states.shape
@@ -229,40 +237,25 @@ def scaled_dot_product_attention(
         # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
         # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
 
-        version = paddle.version.full_version
-        if version != "0.0.0" and version <= "2.5.2":
-            attn_output, attn_weights = flash_attention(
-                query_states,
-                key_states,
-                value_states,
-                causal=True,
-                return_softmax=output_attentions,
-            )
-        else:
-            attn_output = F.scaled_dot_product_attention(
-                query_states,
-                key_states,
-                value_states,
-                attn_mask=attention_mask,
-                is_causal=attention_mask is None,
-                dropout_p=config.attention_dropout if training else 0.0,
-                training=training,
-            )
-            attn_weights = None
-
-        if sequence_parallel:
-            attn_output = attn_output.reshape([bsz * q_len, head_dim * num_heads])
-        else:
-            attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads])
-        return (attn_output, attn_weights) if output_attentions else attn_output
+        return fusion_ops.fusion_flash_attention(
+            query_states,
+            config,
+            key_states,
+            value_states,
+            attention_mask,
+            output_attentions,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            sequence_parallel=sequence_parallel,
+            skip_recompute=skip_recompute,
+        )
     else:
         #  [ bz, seqlen, nhead, head_dim] -> [bs, nhead, seq_len, head_dim]
         query_states = paddle.transpose(query_states, [0, 2, 1, 3])
-        # merge with the next tranpose
+        # merge with the next transpose
         key_states = paddle.transpose(key_states, [0, 2, 1, 3])
         value_states = paddle.transpose(value_states, [0, 2, 1, 3])
 
-        # matmul and devide by sqrt(head_dim)
+        # matmul and divide by sqrt(head_dim)
         attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
 
         if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
@@ -356,14 +349,15 @@ def __init__(self, config: Qwen2MoeConfig):
             mark_as_sequence_parallel_parameter(self.weight)
 
     def forward(self, hidden_states):
+        if self.config.use_fused_rms_norm:
+            return fusion_ops.fusion_rms_norm(hidden_states, self.weight, self.variance_epsilon, False)
+
         if paddle.in_dynamic_mode():
             with paddle.amp.auto_cast(False):
-                hidden_states = hidden_states.astype("float32")
-                variance = hidden_states.pow(2).mean(-1, keepdim=True)
+                variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
                 hidden_states = paddle.rsqrt(variance + self.variance_epsilon) * hidden_states
         else:
-            hidden_states = hidden_states.astype("float32")
-            variance = hidden_states.pow(2).mean(-1, keepdim=True)
+            variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
             hidden_states = paddle.rsqrt(variance + self.variance_epsilon) * hidden_states
 
         if self.weight.dtype in [paddle.float16, paddle.bfloat16]:
@@ -436,6 +430,8 @@ def __init__(self, config: Qwen2MoeConfig, is_shared=False):
         self.intermediate_size = (
             config.moe_intermediate_size if not is_shared else config.shared_expert_intermediate_size
         )
+        self.fuse_attention_ffn = config.fuse_attention_ffn
+
         self.tensor_parallel_degree = config.tensor_parallel_degree
 
         if config.sequence_parallel:
@@ -446,18 +442,26 @@ def __init__(self, config: Qwen2MoeConfig, is_shared=False):
             RowParallelLinear = linear_utils.RowParallelLinear
 
         if config.tensor_parallel_degree > 1:
-            self.gate_proj = ColumnParallelLinear(
-                self.hidden_size,
-                self.intermediate_size,
-                gather_output=False,
-                has_bias=False,
-            )
-            self.up_proj = ColumnParallelLinear(
-                self.hidden_size,
-                self.intermediate_size,
-                gather_output=False,
-                has_bias=False,
-            )
+            if self.fuse_attention_ffn:
+                self.gate_up_fused_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size * 2,
+                    gather_output=False,
+                    has_bias=False,
+                )
+            else:
+                self.gate_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size,
+                    gather_output=False,
+                    has_bias=False,
+                )
+                self.up_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size,
+                    gather_output=False,
+                    has_bias=False,
+                )
             self.down_proj = RowParallelLinear(
                 self.intermediate_size,
                 self.hidden_size,
@@ -465,14 +469,36 @@ def __init__(self, config: Qwen2MoeConfig, is_shared=False):
                 has_bias=False,
             )
         else:
-            self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w1
-            self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w3
-            self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias_attr=False)  # w2
+            if self.fuse_attention_ffn:
+                self.gate_up_fused_proj = Linear(self.hidden_size, self.intermediate_size * 2, bias_attr=False)
+            else:
+                self.gate_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w1
+                self.up_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w3
+            self.down_proj = Linear(self.intermediate_size, self.hidden_size, bias_attr=False)  # w2
 
-        self.act_fn = ACT2FN[config.hidden_act]
+        if config.hidden_act == "silu":
+            self.act_fn = fusion_ops.swiglu
+            self.fuse_swiglu = True
+        else:
+            self.act_fn = ACT2FN[config.hidden_act]
+            self.fuse_swiglu = False
 
     def forward(self, x):
-        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        if self.fuse_attention_ffn:
+            x = self.gate_up_fused_proj(x)
+            if self.fuse_swiglu:
+                y = None
+            else:
+                x, y = x.chunk(2, axis=-1)
+        else:
+            x, y = self.gate_proj(x), self.up_proj(x)
+
+        if self.fuse_swiglu:
+            x = self.act_fn(x, y)
+        else:
+            x = self.act_fn(x) * y
+
+        return self.down_proj(x)
 
 
 def repeat_kv(hidden_states: paddle.Tensor, n_rep: int) -> paddle.Tensor:
@@ -515,6 +541,8 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
         self.seq_length = config.seq_length
         self.sequence_parallel = config.sequence_parallel
 
+        self.fuse_attention_qkv = config.fuse_attention_qkv
+
         # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
         # Enable_recompute defaults to False and is controlled by Trainer
         self.enable_recompute = False
@@ -533,7 +561,7 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
 
         self.use_fused_rope = config.use_fused_rope
         if self.use_fused_rope:
-            if "gpu" not in paddle.device.get_device() or fused_rotary_position_embedding is None:
+            if get_env_device() not in ["gpu", "xpu"] or fused_rotary_position_embedding is None:
                 warnings.warn(
                     "Enable fuse rope in the config, but fuse rope is not available. "
                     "Will disable fuse rope. Try using latest gpu version of Paddle."
@@ -548,19 +576,30 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
             RowParallelLinear = linear_utils.RowParallelLinear
 
         if config.tensor_parallel_degree > 1:
-            self.q_proj = ColumnParallelLinear(self.hidden_size, self.hidden_size, has_bias=True, gather_output=False)
-            self.k_proj = ColumnParallelLinear(
-                self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False
-            )
-            self.v_proj = ColumnParallelLinear(
-                self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False
-            )
+            if self.fuse_attention_qkv:
+                self.qkv_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.hidden_size + 2 * self.config.num_key_value_heads * self.head_dim,
+                    has_bias=True,
+                    gather_output=False,
+                )
+            else:
+                self.q_proj = ColumnParallelLinear(
+                    self.hidden_size, self.hidden_size, has_bias=True, gather_output=False
+                )
+                self.k_proj = ColumnParallelLinear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False)  # fmt:skip
+                self.v_proj = ColumnParallelLinear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False)  # fmt:skip
             self.o_proj = RowParallelLinear(self.hidden_size, self.hidden_size, has_bias=False, input_is_parallel=True)
         else:
-            self.q_proj = nn.Linear(self.hidden_size, self.hidden_size, bias_attr=True)
-            self.k_proj = nn.Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
-            self.v_proj = nn.Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
-            self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias_attr=False)
+            if self.fuse_attention_qkv:
+                self.qkv_proj = Linear(
+                    self.hidden_size, self.hidden_size + 2 * self.config.num_key_value_heads * self.head_dim
+                )
+            else:
+                self.q_proj = Linear(self.hidden_size, self.hidden_size, bias_attr=True)
+                self.k_proj = Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
+                self.v_proj = Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
+            self.o_proj = Linear(self.hidden_size, self.hidden_size, bias_attr=False)
 
         self.rotary_emb = Qwen2MoeRotaryEmbedding(
             self.head_dim,
@@ -568,6 +607,8 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
             base=self.rope_theta,
         )
 
+        self.attn_func = scaled_dot_product_attention
+
     def forward(
         self,
         hidden_states,
@@ -576,26 +617,45 @@ def forward(
         attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: bool = False,
         use_cache: bool = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
         """Input shape: Batch x Time x Channel"""
         # [bs, seq_len, num_head * head_dim] -> [seq_len / n, bs, num_head * head_dim] (n is model parallelism)
 
-        batch_size, seq_len, _ = hidden_states.shape
-
-        query_states = self.q_proj(hidden_states)
-        key_states = self.k_proj(hidden_states)
-        value_states = self.v_proj(hidden_states)
-
-        if self.sequence_parallel:
-            target_query_shape = [-1, self.seq_length, self.num_heads, self.head_dim]
-            target_key_value_shape = [-1, self.seq_length, self.num_key_value_heads, self.head_dim]
+        if self.fuse_attention_qkv:
+            mix_layer = self.qkv_proj(hidden_states)
+            if self.sequence_parallel:
+                target_shape = [
+                    -1,
+                    self.seq_length,
+                    self.num_key_value_heads,
+                    (self.num_key_value_groups + 2) * self.head_dim,
+                ]
+            else:
+                target_shape = [0, 0, self.num_key_value_heads, (self.num_key_value_groups + 2) * self.head_dim]
+            mix_layer = paddle.reshape_(mix_layer, target_shape)
+            query_states, key_states, value_states = paddle.split(
+                mix_layer,
+                num_or_sections=[self.num_key_value_groups * self.head_dim, self.head_dim, self.head_dim],
+                axis=-1,
+            )
+            if self.gqa_or_mqa:
+                query_states = paddle.reshape_(query_states, [0, 0, self.num_heads, self.head_dim])
         else:
-            target_query_shape = [0, 0, self.num_heads, self.head_dim]
-            target_key_value_shape = [0, 0, self.num_key_value_heads, self.head_dim]
-        query_states = query_states.reshape(shape=target_query_shape)
-        key_states = key_states.reshape(shape=target_key_value_shape)
-        value_states = value_states.reshape(shape=target_key_value_shape)
+            query_states = self.q_proj(hidden_states)
+            key_states = self.k_proj(hidden_states)
+            value_states = self.v_proj(hidden_states)
+
+            if self.sequence_parallel:
+                target_query_shape = [-1, self.seq_length, self.num_heads, self.head_dim]
+                target_key_value_shape = [-1, self.seq_length, self.num_key_value_heads, self.head_dim]
+            else:
+                target_query_shape = [0, 0, self.num_heads, self.head_dim]
+                target_key_value_shape = [0, 0, self.num_key_value_heads, self.head_dim]
+            query_states = query_states.reshape(shape=target_query_shape)
+            key_states = key_states.reshape(shape=target_key_value_shape)
+            value_states = value_states.reshape(shape=target_key_value_shape)
 
         kv_seq_len = key_states.shape[-3]
 
@@ -626,8 +686,10 @@ def forward(
 
         # TODO(wj-Mcat): use broadcast strategy when n_kv_heads = 1
         # repeat k/v heads if n_kv_heads < n_heads
-        key_states = repeat_kv(key_states, self.num_key_value_groups)
-        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        paddle_version = float(paddle.__version__[:3])
+        if not self.config.use_flash_attention or ((paddle_version != 0.0) and (paddle_version <= 2.6)):
+            key_states = repeat_kv(key_states, self.num_key_value_groups)
+            value_states = repeat_kv(value_states, self.num_key_value_groups)
 
         has_gradient = not (query_states.stop_gradient and key_states.stop_gradient and value_states.stop_gradient)
         if (
@@ -637,27 +699,29 @@ def forward(
             and self.recompute_granularity == "core_attn"
         ):
             outputs = recompute(
-                scaled_dot_product_attention,
+                self.attn_func,
                 query_states,
                 self.config,
                 key_states,
                 value_states,
                 attention_mask,
                 output_attentions,
-                self.training,
-                self.sequence_parallel,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                training=self.training,
+                sequence_parallel=self.sequence_parallel,
                 use_reentrant=self.config.recompute_use_reentrant,
             )
         else:
-            outputs = scaled_dot_product_attention(
+            outputs = self.attn_func(
                 query_states,
                 self.config,
                 key_states,
                 value_states,
                 attention_mask,
                 output_attentions,
-                self.training,
-                self.sequence_parallel,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                training=self.training,
+                sequence_parallel=self.sequence_parallel,
             )
         if output_attentions:
             attn_output, attn_weights = outputs
@@ -729,7 +793,7 @@ def __init__(self, config: Qwen2MoeConfig):
             config,
             moe_num_experts=config.num_experts,
             expert_class=Qwen2MoeMLP,
-            expert_kwargs=config,
+            expert_kwargs={"config": config},
             gate=gate,
             capacity=2.0,
         )
@@ -776,12 +840,13 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = False):
     def forward(
         self,
         hidden_states: paddle.Tensor,
-        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        position_ids: Optional[paddle.Tensor] = None,
         attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: Optional[bool] = False,
         output_router_logits: Optional[bool] = False,
         past_key_value: Optional[Tuple[paddle.Tensor]] = None,
         use_cache: Optional[bool] = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
         """
@@ -822,6 +887,7 @@ def forward(
                 attention_mask,
                 output_attentions,
                 use_cache,
+                attn_mask_startend_row_indices,
                 use_reentrant=self.config.recompute_use_reentrant,
             )
         else:
@@ -832,6 +898,7 @@ def forward(
                 attention_mask,
                 output_attentions,
                 use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
             )
 
         if type(outputs) is tuple:
@@ -999,6 +1066,66 @@ def get_tensor_parallel_split_mappings(num_layers, num_experts):
 
         return mappings
 
+    @classmethod
+    def _get_fuse_or_split_param_mappings(cls, config: Qwen2MoeConfig, is_fuse=False):
+        # return parameter fuse utils
+        from paddlenlp.transformers.conversion_utils import split_or_fuse_func
+
+        fn = split_or_fuse_func(is_fuse=is_fuse)
+
+        # last key is fused key, other keys are to be fused.
+        fuse_qkv_keys = [
+            (
+                "layers.0.self_attn.q_proj.weight",
+                "layers.0.self_attn.k_proj.weight",
+                "layers.0.self_attn.v_proj.weight",
+                "layers.0.self_attn.qkv_proj.weight",
+            ),
+            (
+                "layers.0.self_attn.q_proj.bias",
+                "layers.0.self_attn.k_proj.bias",
+                "layers.0.self_attn.v_proj.bias",
+                "layers.0.self_attn.qkv_proj.bias",
+            ),
+        ]
+
+        fuse_gate_up_keys = (
+            "layers.0.mlp.gate_proj.weight",
+            "layers.0.mlp.up_proj.weight",
+            "layers.0.mlp.gate_up_fused_proj.weight",
+        )
+        num_heads = config.num_attention_heads
+        num_key_value_heads = getattr(config, "num_key_value_heads", num_heads)
+        fuse_attention_qkv = getattr(config, "fuse_attention_qkv", False)
+        fuse_attention_ffn = getattr(config, "fuse_attention_ffn", False)
+
+        final_actions = {}
+        if is_fuse:
+            if fuse_attention_qkv:
+                for i in range(config.num_hidden_layers):
+                    for fuse_keys in fuse_qkv_keys:
+                        keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_keys])
+                        final_actions[keys] = partial(
+                            fn, is_qkv=True, num_heads=num_heads, num_key_value_heads=num_key_value_heads
+                        )
+            if fuse_attention_ffn:
+                for i in range(config.num_hidden_layers):
+                    keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_gate_up_keys])
+                    final_actions[keys] = fn
+        else:
+            if not fuse_attention_qkv:
+                for i in range(config.num_hidden_layers):
+                    for fuse_keys in fuse_qkv_keys:
+                        keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_keys])
+                        final_actions[keys] = partial(
+                            fn, split_nums=3, is_qkv=True, num_heads=num_heads, num_key_value_heads=num_key_value_heads
+                        )
+            if not fuse_attention_ffn:
+                for i in range(config.num_hidden_layers):
+                    keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_gate_up_keys])
+                    final_actions[keys] = partial(fn, split_nums=2)
+        return final_actions
+
     def _init_weights(self, layer):
         """Initialization hook"""
         if self.config.tensor_parallel_degree > 1:
@@ -1009,11 +1136,11 @@ def _init_weights(self, layer):
                 nn.Linear,
                 nn.Embedding,
                 mpu.VocabParallelEmbedding,
-                mpu.ColumnParallelLinear,
                 mpu.RowParallelLinear,
-                Qwen2MoeLMHead,
-                linear_utils.ColumnSequenceParallelLinear,
+                mpu.ColumnParallelLinear,
                 linear_utils.RowSequenceParallelLinear,
+                linear_utils.ColumnSequenceParallelLinear,
+                Qwen2MoeLMHead,
             ),
         ):
             # In the dygraph mode, use the `set_value` to reset the parameter directly,
@@ -1087,7 +1214,10 @@ def __init__(self, config: Qwen2MoeConfig):
 
         self.layers = nn.LayerList(
             [
-                Qwen2MoeDecoderLayer(config, layerwise_recompute=layer_idx not in self.no_recompute_layers)
+                Qwen2MoeDecoderLayer(
+                    config=config,
+                    layerwise_recompute=layer_idx not in self.no_recompute_layers,
+                )
                 for layer_idx in range(config.num_hidden_layers)
             ]
         )
@@ -1138,6 +1268,7 @@ def recompute_training_full(
         output_router_logits: bool,
         past_key_value: Tensor,
         use_cache: bool,
+        attn_mask_startend_row_indices=None,
     ):
         def create_custom_forward(module):
             def custom_forward(*inputs):
@@ -1154,6 +1285,7 @@ def custom_forward(*inputs):
             output_router_logits,
             past_key_value,
             use_cache,
+            attn_mask_startend_row_indices,
             use_reentrant=self.config.recompute_use_reentrant,
         )
 
@@ -1161,21 +1293,19 @@ def custom_forward(*inputs):
 
     def forward(
         self,
-        input_ids=None,
-        position_ids=None,
-        attention_mask=None,
-        inputs_embeds=None,
-        use_cache=None,
-        past_key_values=None,
-        output_attentions=False,
-        output_hidden_states=None,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
         output_router_logits: Optional[bool] = None,
-        return_dict=False,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
         **kwargs,
-    ):
-        if self.sequence_parallel and use_cache:
-            raise ValueError("We currently only support sequence parallel without cache.")
-
+    ) -> Union[Tuple, MoEModelOutputWithPast]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
 
         output_router_logits = (
@@ -1185,7 +1315,6 @@ def forward(
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
         )
         use_cache = use_cache if use_cache is not None else self.config.use_cache
-
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
         # retrieve input_ids and inputs_embeds
@@ -1209,6 +1338,7 @@ def forward(
             cache_length = past_key_values[0][0].shape[1]
             seq_length_with_past += cache_length
         if inputs_embeds is None:
+            # [bs, seq_len, dim]
             inputs_embeds = self.embed_tokens(input_ids)
 
         if self.sequence_parallel:
@@ -1219,20 +1349,24 @@ def forward(
             inputs_embeds = ScatterOp.apply(inputs_embeds)
 
         # embed positions
-        if attention_mask is None:
+        if attn_mask_startend_row_indices is not None or get_use_casual_mask():
+            attention_mask = None
+        else:
             # [bs, seq_len]
-            attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+            attention_mask = (
+                paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+                if attention_mask is None
+                else attention_mask
+            )
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype
+            )  # [bs, 1, seq_len, seq_len]
+            if self.config.use_flash_attention:
+                attention_mask = None if is_casual_mask(attention_mask) else attention_mask
 
         if position_ids is None:
             position_ids = paddle.arange(seq_length, dtype="int64").expand((batch_size, seq_length))
 
-        attention_mask = self._prepare_decoder_attention_mask(
-            attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype
-        )  # [bs, 1, seq_len, seq_len]
-        if self.config.use_flash_attention:
-            is_casual = is_casual_mask(attention_mask)
-            if is_casual:
-                attention_mask = None
         hidden_states = inputs_embeds
 
         # decoder layers
@@ -1262,6 +1396,7 @@ def forward(
                     output_router_logits,
                     past_key_value,
                     use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
             else:
                 layer_outputs = decoder_layer(
@@ -1272,6 +1407,7 @@ def forward(
                     output_router_logits,
                     past_key_value,
                     use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
 
             # NOTE: clear outdate cache after it has been used for memory saving
@@ -1334,7 +1470,7 @@ def forward(self, prediction_scores, masked_lm_labels):
         if self.enable_parallel_cross_entropy:
             if prediction_scores.shape[-1] == self.config.vocab_size:
                 warnings.warn(
-                    f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
+                    f"enable_parallel_cross_entropy, the vocab_size should be splitted: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
                 )
                 self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=self.ignore_index)
 
@@ -1342,8 +1478,16 @@ def forward(self, prediction_scores, masked_lm_labels):
             masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
 
             # skip ignore_index which loss == 0
-            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
-            loss = paddle.mean(masked_lm_loss)
+            # masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
+            # loss = paddle.mean(masked_lm_loss)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
 
         return loss
 
@@ -1472,26 +1616,35 @@ def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder
             model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
 
         if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            # TODO: support attention mask for other models
             attention_mask = model_kwargs["attention_mask"]
-            model_kwargs["attention_mask"] = paddle.concat(
-                [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)], axis=-1
-            )
+            if len(attention_mask.shape) == 2:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )
+            elif len(attention_mask.shape) == 4:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([*attention_mask.shape[:3], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )[:, :, -1:, :]
 
         return model_kwargs
 
     def forward(
         self,
-        input_ids=None,
-        position_ids=None,
-        attention_mask=None,
-        inputs_embeds=None,
-        labels=None,
-        use_cache=False,
-        past_key_values=None,
-        output_attentions=None,
-        output_hidden_states=None,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
         output_router_logits: Optional[bool] = None,
-        return_dict=None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
     ):
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
@@ -1502,6 +1655,13 @@ def forward(
         )
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
         outputs = self.qwen2_moe(
             input_ids=input_ids,  # [bs, seq_len]
@@ -1514,19 +1674,39 @@ def forward(
             output_hidden_states=output_hidden_states,
             output_router_logits=output_router_logits,
             return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
         )
 
         hidden_states = outputs[0]  # [bs, seq_len, dim]
 
         # if labels is None，means we need full output, instead of tensor_parallel_output
-        # tensor_parallel_output is togather with ParallelCrossEntropy
+        # tensor_parallel_output is together with ParallelCrossEntropy
         tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
 
-        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         aux_loss = None
         if output_router_logits:
diff --git a/paddlenlp/transformers/qwen2_moe/modeling_pp.py b/paddlenlp/transformers/qwen2_moe/modeling_pp.py
new file mode 100644
index 000000000000..a4194a9d0c69
--- /dev/null
+++ b/paddlenlp/transformers/qwen2_moe/modeling_pp.py
@@ -0,0 +1,354 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import OrderedDict
+
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+)
+from paddle.distributed.fleet.recompute.recompute import recompute
+
+from ...utils.tools import get_env_device
+from ..model_utils import PipelinePretrainedModel
+from .modeling import (
+    Qwen2MoeConfig,
+    Qwen2MoeDecoderLayer,
+    Qwen2MoeLMHead,
+    Qwen2MoeModel,
+    Qwen2MoePretrainedModel,
+    Qwen2MoePretrainingCriterion,
+    Qwen2MoeRMSNorm,
+)
+
+__all__ = [
+    "Qwen2MoeForCausalLMPipe",
+]
+
+
+def parse_args(args):
+    if isinstance(args, tuple):
+        if len(args) == 4:
+            hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = args
+        elif len(args) == 3:
+            hidden_states, attention_mask, attn_mask_startend_row_indices = args
+            position_ids = None
+        elif len(args) == 2:
+            hidden_states, attention_mask = args
+            attn_mask_startend_row_indices, position_ids = None, None
+    else:
+        hidden_states = args
+        attention_mask, attn_mask_startend_row_indices, position_ids = None, None, None
+
+    if position_ids is not None:
+        position_ids.stop_gradient = True
+
+    if attention_mask is not None:
+        attention_mask.stop_gradient = True
+
+    if attn_mask_startend_row_indices is not None:
+        attn_mask_startend_row_indices.stop_gradient = True
+
+    return hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids
+
+
+def return_args(hidden_states, attention_mask=None, attn_mask_startend_row_indices=None, position_ids=None):
+    ret = (hidden_states,)
+
+    if attention_mask is not None:
+        ret += (attention_mask.clone(),)
+    if attn_mask_startend_row_indices is not None:
+        ret += (attn_mask_startend_row_indices.clone(),)
+    if position_ids is not None:
+        ret += (position_ids.clone(),)
+    if len(ret) == 1:
+        ret = ret[0]
+
+    return ret
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+class Qwen2MoeEmbeddingPipe(nn.Layer):
+    """Extends QWenEmbeddings to forward attention_mask through the pipeline."""
+
+    def __init__(self, config: Qwen2MoeConfig):
+        super(Qwen2MoeEmbeddingPipe, self).__init__()
+        self.config = config
+        self.sequence_parallel = config.sequence_parallel
+        self.hidden_size = config.hidden_size
+        if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self.embed_tokens, "weight")
+
+    def forward(self, args):
+        """_summary_
+
+        Args:
+            input (_type_): _description_
+
+        Returns:
+            _type_: _description_
+        """
+        input_ids, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        input_embeds = self.embed_tokens(input_ids)
+        if self.config.sequence_parallel:
+            from paddlenlp.transformers import ScatterOp
+
+            # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
+            bs, seq_len, hidden_size = input_embeds.shape
+            input_embeds = paddle.reshape_(input_embeds, [bs * seq_len, hidden_size])
+            # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism)
+            input_embeds = ScatterOp.apply(input_embeds)
+
+        batch_size, seq_length = input_ids.shape
+
+        if attention_mask is not None:
+            assert (
+                attn_mask_startend_row_indices is None
+            ), "attention_mask and attn_mask_startend_row_indices can not be set at same time"
+
+            attention_mask = Qwen2MoeModel._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), 0, input_embeds.dtype
+            )
+            attention_mask.stop_gradient = True
+            if get_env_device() == "npu":
+                attention_mask = attention_mask.astype("bool")
+        elif get_env_device() == "npu":
+            attention_mask = paddle.tril(paddle.ones((seq_length, seq_length), dtype="bool"))
+            attention_mask.stop_gradient = True
+
+        return return_args(input_embeds, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class Qwen2MoeDecoderLayerPipe(Qwen2MoeDecoderLayer):
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+
+        has_gradient = not hidden_states.stop_gradient
+
+        if attention_mask is not None and attention_mask.dtype == paddle.int32:
+            attention_mask, attn_mask_startend_row_indices, position_ids = (
+                None,
+                attention_mask,
+                attn_mask_startend_row_indices,
+            )
+        elif attention_mask is not None and attention_mask.dtype == paddle.int64:
+            attention_mask, attn_mask_startend_row_indices, position_ids = None, None, attention_mask
+        elif attn_mask_startend_row_indices is not None and attn_mask_startend_row_indices.dtype == paddle.int64:
+            attn_mask_startend_row_indices, position_ids = None, attn_mask_startend_row_indices
+
+        if self.enable_recompute and self.config.recompute_granularity == "full" and has_gradient:
+            if attention_mask is not None or attn_mask_startend_row_indices is not None:
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attention_mask=attention_mask,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=False,
+                )
+            else:
+                # for pretrain
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=self.config.recompute_use_reentrant,
+                )
+        else:
+            hidden_states = super().forward(
+                hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            )
+
+        return return_args(hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class Qwen2MoeRMSNormPipe(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.norm = Qwen2MoeRMSNorm(config)
+
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        return self.norm(hidden_states)
+
+
+class Qwen2MoeLMHeadPipe(Qwen2MoeLMHead):
+    def __init__(self, config, transpose_y=False):
+        super(Qwen2MoeLMHeadPipe, self).__init__(config)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self, "weight")
+
+
+class Qwen2MoeForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
+    """QWenForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the QWenModel class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = Qwen2MoeConfig
+
+    _get_tensor_parallel_mappings = Qwen2MoePretrainedModel._get_tensor_parallel_mappings
+    _init_weights = Qwen2MoePretrainedModel._init_weights
+    _keys_to_ignore_on_load_unexpected = Qwen2MoePretrainedModel._keys_to_ignore_on_load_unexpected
+    _tied_weights_keys = ["lm_head.weight"]
+
+    # DONOT Add base_model_prefix !!!!
+
+    @classmethod
+    def _prepare_pipeline_inputs_func(cls, inputs):
+        first_stage_keys = ["input_ids", "attention_mask", "attn_mask_startend_row_indices", "position_ids"]
+        last_stage_keys = ["labels"]
+
+        def get_expected_keys(inputs, keys):
+            ret = tuple([inputs.pop(k) if k in inputs else None for k in keys])
+            if len(ret) == 1:
+                ret = ret[0]
+            return ret
+
+        if type(inputs) is dict or type(inputs) is OrderedDict:
+            return [
+                get_expected_keys(inputs, first_stage_keys),
+                get_expected_keys(inputs, last_stage_keys),
+            ]
+
+        keys = list(inputs[0].keys())
+        inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys}
+        return [
+            get_expected_keys(inputs_batch, first_stage_keys),
+            get_expected_keys(inputs_batch, last_stage_keys),
+        ]
+
+    def __init__(self, config: Qwen2MoeConfig):
+        self.config = config
+
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.recompute_granularity = self.config.recompute_granularity
+        self.pp_recompute_interval = self.config.pp_recompute_interval
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+        if self.recompute_granularity == "full":
+            assert len(self.no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support"
+
+        virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1)
+
+        def get_hcg():
+            return fleet.get_hybrid_communicate_group()
+
+        hcg = get_hcg()
+        tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1)
+        tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0)
+
+        # TODO: fix tensor_parallel_degree rewrite in here
+        config.tensor_parallel_degree = tensor_parallel_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "qwen2moe_shared_weight",
+                    Qwen2MoeEmbeddingPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                ),
+                "qwen2_moe",
+            )
+        else:
+            self.add_sequential_layer(LayerDesc(Qwen2MoeEmbeddingPipe, config=config), "qwen2_moe")
+
+        for i in range(config.num_hidden_layers):
+            self.add_sequential_layer(
+                LayerDesc(
+                    Qwen2MoeDecoderLayerPipe,
+                    config=config,
+                    layerwise_recompute=i not in self.no_recompute_layers,
+                ),
+                f"qwen2_moe.layers.{i}",
+            )
+        self.add_sequential_layer(LayerDesc(Qwen2MoeRMSNormPipe, config=config), "qwen2_moe")
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "qwen2moe_shared_weight",
+                    Qwen2MoeLMHeadPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                    **{"transpose_y": True},
+                ),
+                "lm_head",
+            )
+        else:
+            self.add_sequential_layer(LayerDesc(Qwen2MoeLMHeadPipe, config=config), "lm_head")
+
+        recompute_interval = 0
+        if self.enable_recompute and self.recompute_granularity == "full":
+            assert self.config.pp_recompute_interval <= config.num_hidden_layers // (
+                virtual_pp_degree * get_hcg().topology().get_dim_size("pipe")
+            ), "pp recompute interval should smaller than num layers of each pp chunk"
+            recompute_interval = self.config.pp_recompute_interval
+
+        seg_method = "layer:Qwen2MoeDecoderLayer"
+        if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        PipelineLayer.__init__(
+            self,
+            layers=self.get_sequential_layers(),
+            loss_fn=self.get_loss_fn(config),
+            topology=get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+        # You should call init here, since there is a  diamond inheritance problem
+        self.apply(self._init_weights)
+        # DON'T init PipelinePretrainedModel
+        # PipelinePretrainedModel.__init__(self.super(), config=config)
+
+    def get_loss_fn(self, config):
+        return Qwen2MoePretrainingCriterion(config)
diff --git a/paddlenlp/transformers/semantic_search/modeling.py b/paddlenlp/transformers/semantic_search/modeling.py
index c16808e21770..0ba34bd94641 100644
--- a/paddlenlp/transformers/semantic_search/modeling.py
+++ b/paddlenlp/transformers/semantic_search/modeling.py
@@ -282,8 +282,9 @@ def matching_v2(self, input_ids, token_type_ids=None, position_ids=None, attenti
             input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
         )
         pooled_output = self.ernie.dropout(sequence_output[:, 0])
-        probs = self.ernie.classifier(pooled_output)
-        return probs
+        cls_embedding = self.ernie.classifier(pooled_output)
+        probs = F.softmax(cls_embedding, axis=1)
+        return probs[:, 1]
 
     def matching_v3(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
         """Use the pooled_output as the feature for listwise prediction, eg. ERNIE-Search"""
diff --git a/paddlenlp/transformers/tokenizer_utils.py b/paddlenlp/transformers/tokenizer_utils.py
index 2517e65c3db9..db885c865cbc 100644
--- a/paddlenlp/transformers/tokenizer_utils.py
+++ b/paddlenlp/transformers/tokenizer_utils.py
@@ -788,7 +788,9 @@ def _encode_chat_inputs(
             ans.append(ans_roundi)
 
         non_learnable_parts = self._extract_non_learnable_parts(origin_msg, ans)
-        assert len(non_learnable_parts) == len(ans)
+        assert len(non_learnable_parts) == len(
+            ans
+        ), f"Get non_learnable_parts len: {len(non_learnable_parts)}, but ans len: {len(ans)}."
 
         conversation_ids = []
         for i in range(len(non_learnable_parts)):
@@ -1879,33 +1881,6 @@ def _decode(
         else:
             return text
 
-    def decode_token(
-        self,
-        all_input_ids: List[int],
-        prefix_offset: int = 0,
-        read_offset: int = 0,
-    ) -> Tuple[str, int, int]:
-        """tokenizer decoding for the streaming generation use case. This method can be overrided for tokenizer that doesn't follow this API"""
-        # The prefix text is necessary only to defeat cleanup algorithms in the decode
-        # which decide to add a space or not depending on the surrounding ids.
-        prefix_text = self.decode(
-            all_input_ids[prefix_offset:read_offset], skip_special_tokens=False, clean_up_tokenization_spaces=False
-        )
-        new_text = self.decode(
-            all_input_ids[prefix_offset:], skip_special_tokens=False, clean_up_tokenization_spaces=False
-        )
-
-        if len(new_text) > len(prefix_text) and not prefix_text.endswith("�") and not new_text.endswith("�"):
-            # utf-8 char at the end means it's a potential unfinished byte sequence
-            # from byte fallback tokenization.
-            # If it's in the middle, it's probably a real invalid id generated
-            # by the model
-            prefix_index = new_text.index(prefix_text)
-            new_text = new_text[prefix_index + len(prefix_text) :]
-            return new_text, read_offset, len(all_input_ids)
-        else:
-            return "", prefix_offset, read_offset
-
 
 class BPETokenizer(PretrainedTokenizer):
     """
diff --git a/paddlenlp/transformers/tokenizer_utils_base.py b/paddlenlp/transformers/tokenizer_utils_base.py
index cf1d3391b5c1..79f1d490988c 100644
--- a/paddlenlp/transformers/tokenizer_utils_base.py
+++ b/paddlenlp/transformers/tokenizer_utils_base.py
@@ -967,6 +967,11 @@ def add_tokens(
 
         return self._add_tokens(new_tokens, special_tokens=special_tokens)
 
+    @classmethod
+    def _add_extra_special_tokens(cls, extra_sp_token: Union[str, AddedToken]):
+        if extra_sp_token not in cls.SPECIAL_TOKENS_ATTRIBUTES:
+            cls.SPECIAL_TOKENS_ATTRIBUTES.append(extra_sp_token)
+
     def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
         raise NotImplementedError
 
@@ -1213,7 +1218,13 @@ def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]:
         """
         set_attr = {}
         for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
-            attr_value = getattr(self, "_" + attr)
+            try:
+                attr_value = getattr(self, "_" + attr)
+            except:
+                try:
+                    attr_value = getattr(self, attr)
+                except:
+                    continue
             if attr_value:
                 set_attr[attr] = (
                     type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
@@ -1233,7 +1244,13 @@ def special_tokens_map_extended(self) -> Dict[str, Union[str, AddedToken, List[U
         """
         set_attr = {}
         for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
-            attr_value = getattr(self, "_" + attr)
+            try:
+                attr_value = getattr(self, "_" + attr)
+            except:
+                try:
+                    attr_value = getattr(self, attr)
+                except:
+                    continue
             if attr_value:
                 set_attr[attr] = attr_value
         return set_attr
@@ -1744,6 +1761,7 @@ def convert_added_tokens(obj):
                 elif isinstance(value, list):
                     value = [AddedToken(**token) if isinstance(token, dict) else token for token in value]
                 setattr(tokenizer, key, value)
+                cls._add_extra_special_tokens(key)
 
         # Add supplementary tokens.
         special_tokens = tokenizer.all_special_tokens
@@ -1858,8 +1876,8 @@ def convert_added_tokens(obj: Union[AddedToken, Any], add_type_field=True):
         # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
         tokenizer_class = self.__class__.__name__
         # Remove the Fast at the end unless we have a special `PreTrainedTokenizerFast`
-        if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast":
-            tokenizer_class = tokenizer_class[:-4]
+        # if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast":
+        #     tokenizer_class = tokenizer_class[:-4]
         tokenizer_config["tokenizer_class"] = tokenizer_class
 
         with io.open(tokenizer_config_file, "w", encoding="utf-8") as f:
@@ -3426,6 +3444,33 @@ def convert_tokens_to_string(self, tokens: List[str]) -> str:
         """
         raise NotImplementedError
 
+    def decode_token(
+        self,
+        all_input_ids: List[int],
+        prefix_offset: int = 0,
+        read_offset: int = 0,
+    ) -> Tuple[str, int, int]:
+        """tokenizer decoding for the streaming generation use case. This method can be overrided for tokenizer that doesn't follow this API"""
+        # The prefix text is necessary only to defeat cleanup algorithms in the decode
+        # which decide to add a space or not depending on the surrounding ids.
+        prefix_text = self.decode(
+            all_input_ids[prefix_offset:read_offset], skip_special_tokens=False, clean_up_tokenization_spaces=False
+        )
+        new_text = self.decode(
+            all_input_ids[prefix_offset:], skip_special_tokens=False, clean_up_tokenization_spaces=False
+        )
+
+        if len(new_text) > len(prefix_text) and not prefix_text.endswith("�") and not new_text.endswith("�"):
+            # utf-8 char at the end means it's a potential unfinished byte sequence
+            # from byte fallback tokenization.
+            # If it's in the middle, it's probably a real invalid id generated
+            # by the model
+            prefix_index = new_text.index(prefix_text)
+            new_text = new_text[prefix_index + len(prefix_text) :]
+            return new_text, read_offset, len(all_input_ids)
+        else:
+            return "", prefix_offset, read_offset
+
     def batch_decode(
         self,
         sequences: Union[List[int], List[List[int]], "np.ndarray", "paddle.Tensor"],
diff --git a/paddlenlp/transformers/utils.py b/paddlenlp/transformers/utils.py
index cac920da240b..7970ba752d67 100644
--- a/paddlenlp/transformers/utils.py
+++ b/paddlenlp/transformers/utils.py
@@ -962,12 +962,11 @@ def __repr__(self):
         return msg
 
 
-def caculate_llm_flops(
+def caculate_llm_per_token_flops(
     hidden_size,
     intermediate_size,
     layer_num,
     vocab_size,
-    batch_size=1,
     seq_length=None,
     recompute=False,
     recompute_granularity=None,
@@ -1002,4 +1001,4 @@ def caculate_llm_flops(
 
     # 2 for mul + add in matmul
     # 1 for forward, 2 for backwards since we caluate gradients for input_x and input_y
-    return 2 * batch_size * (layer_num * (flops_per_transformer * 3 + flops_recompute_transformer) + 3 * flops_loggits)
+    return 2 * (layer_num * (flops_per_transformer * 3 + flops_recompute_transformer) + 3 * flops_loggits) / seq_length
diff --git a/paddlenlp/trl/llm_utils.py b/paddlenlp/trl/llm_utils.py
index d5fa8dc76354..c19496909295 100644
--- a/paddlenlp/trl/llm_utils.py
+++ b/paddlenlp/trl/llm_utils.py
@@ -34,9 +34,12 @@
 from paddlenlp.transformers import (
     AutoTokenizer,
     ChatGLMv2Tokenizer,
+    DeepseekV2ForCausalLMPipe,
+    DeepseekV3ForCausalLMPipe,
     LlamaForCausalLMPipe,
     PretrainedConfig,
     Qwen2ForCausalLMPipe,
+    Qwen2MoeForCausalLMPipe,
 )
 from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
 from paddlenlp.utils.log import logger
@@ -210,7 +213,7 @@ def get_lora_target_modules(model):
             ".*w2.*",
             ".*w3.*",
         ]
-    elif model.base_model_prefix == "qwen2_moe":
+    elif model.base_model_prefix == "qwen2_moe" or isinstance(model, Qwen2MoeForCausalLMPipe):
         target_modules = [
             ".*q_proj.*",
             ".*k_proj.*",
@@ -221,6 +224,21 @@ def get_lora_target_modules(model):
             ".*up_proj.*",
             ".*down_proj.*",
         ]
+    elif model.base_model_prefix in ["deepseek_v2", "deepseek_v3"] or isinstance(
+        model, (DeepseekV2ForCausalLMPipe, DeepseekV3ForCausalLMPipe)
+    ):
+        target_modules = [
+            ".*q_proj.*",
+            ".*q_a_proj.*",
+            ".*q_b_proj.*",
+            ".*kv_a_proj_with_mqa.*",
+            ".*kv_b_proj.*",
+            ".*kv_b_proj.*",
+            ".*o_proj.*",
+            ".*mlp.gate_proj.*",
+            ".*mlp.up_proj.*",
+            ".*mlp.down_proj.*",
+        ]
     elif model.base_model_prefix == "yuan":
         target_modules = [
             ".*q_proj.*",
@@ -597,7 +615,9 @@ def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]
 
 
 def read_res(model_name_or_path: str, tensor_queue: mp.Queue, result_queue: mp.Queue, done_event: mp.Event):
-    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    from paddlenlp.utils.env import USE_FAST_TOKENIZER
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="left", use_fast=USE_FAST_TOKENIZER)
 
     paddle.device.set_device("cpu")
     paddle.disable_static()
@@ -628,7 +648,9 @@ def read_res(model_name_or_path: str, tensor_queue: mp.Queue, result_queue: mp.Q
 
 
 def speculate_read_res(model_name_or_path: str, tensor_queue: mp.Queue, result_queue: mp.Queue, done_event: mp.Event):
-    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    from paddlenlp.utils.env import USE_FAST_TOKENIZER
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=USE_FAST_TOKENIZER)
     paddle.device.set_device("cpu")
     paddle.disable_static()
     outputs = []
diff --git a/paddlenlp/utils/env.py b/paddlenlp/utils/env.py
index c139327b9ebd..ac7396a48828 100644
--- a/paddlenlp/utils/env.py
+++ b/paddlenlp/utils/env.py
@@ -20,6 +20,13 @@
 """
 import os
 
+try:
+    from paddle.base.framework import use_pir_api
+
+    pir_enabled = use_pir_api()
+except ImportError:
+    pir_enabled = False
+
 
 def _get_user_home():
     return os.path.expanduser("~")
@@ -132,3 +139,12 @@ def _get_bool_env(env_key: str, default_value: str) -> bool:
 MAX_BSZ = 512
 SPECULATE_MAX_BSZ = 256
 MAX_DRAFT_TOKENS = 6
+
+if pir_enabled:
+    PADDLE_INFERENCE_MODEL_SUFFIX = ".json"
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX = ".pdiparams"
+else:
+    PADDLE_INFERENCE_MODEL_SUFFIX = ".pdmodel"
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX = ".pdiparams"
+
+USE_FAST_TOKENIZER: bool = _get_bool_env("USE_FAST_TOKENIZER", "false")
diff --git a/scripts/codestyle/check_dead_links.py b/scripts/codestyle/check_dead_links.py
index 1bf7ea85f8f0..343d69384429 100644
--- a/scripts/codestyle/check_dead_links.py
+++ b/scripts/codestyle/check_dead_links.py
@@ -35,6 +35,8 @@ def find_dead_links(directory):
     dead_links = []
 
     for root, dirs, files in os.walk(directory):
+        if "third_party" in root:
+            continue
         for file in files:
             if file.endswith((".md", ".rst")):
                 file_path = os.path.join(root, file)
diff --git a/scripts/distribute/run_ci.sh b/scripts/distribute/run_ci.sh
index c5d5de02c06b..04dfd54b3222 100644
--- a/scripts/distribute/run_ci.sh
+++ b/scripts/distribute/run_ci.sh
@@ -176,7 +176,7 @@ function execute_func_list(){
                 echo -e "\033[31m verification failed!"
                 let verification_fail_count++
                 galobal_verification_fail_arr+=("$func_name")
-            elif [ $result -eq 250 ]; then
+            elif [ $result -eq 250 ] || [ $result -eq 1 ]; then
                 if [ $execute_num -eq 1 ]; then
                     echo -e "\033[31m fist time execute failed, try again!"
                     let execute_num++
diff --git a/scripts/regression/ci_case.sh b/scripts/regression/ci_case.sh
index d9c233c31a1f..8baefaae4232 100644
--- a/scripts/regression/ci_case.sh
+++ b/scripts/regression/ci_case.sh
@@ -22,6 +22,17 @@ export CXX_COMPILER_PATH=$(which g++)
 export CC=$(which gcc)
 export CXX=$(which g++)
 
+export PADDLE_INFERENCE_MODEL_SUFFIX=$(python -c "
+import paddle
+try:
+    from paddle.base.framework import use_pir_api
+    pir_enabled = use_pir_api()
+except ImportError:
+    pir_enabled = False
+model_suffix = '.json' if pir_enabled else '.pdmodel'
+print(model_suffix)
+")
+
 if [ ! -d "model_logs" ]; then
     mkdir model_logs
 fi
@@ -32,17 +43,22 @@ fi
 print_info() {
     if [ $1 -ne 0 ]; then
         if [[ $2 =~ 'tests' ]]; then
-            mv ${nlp_dir}/unittest_logs/$3.log ${nlp_dir}/unittest_logs/$3_FAIL.log
+            cp ${nlp_dir}/unittest_logs/$3.log ${nlp_dir}/unittest_logs/$3_FAIL.log
             echo -e "\033[31m ${nlp_dir}/unittest_logs/$3_FAIL \033[0m"
             cat ${nlp_dir}/unittest_logs/$3_FAIL.log
         else
-            mv ${log_path}/$2 ${log_path}/$2_FAIL.log
+            cat ${log_path}/$2.log | grep -v "SKIPPED" | grep -v "PASSED" > ${log_path}/$2_FAIL.log
             echo -e "\033[31m ${log_path}/$2_FAIL \033[0m"
             cat ${log_path}/$2_FAIL.log
         fi
+        cp ${log_path}/$2_FAIL.log ${PPNLP_HOME}/upload/$2_FAIL.log.${AGILE_PIPELINE_BUILD_ID}.${AGILE_JOB_BUILD_ID}
+        cd ${PPNLP_HOME} && python upload.py ${PPNLP_HOME}/upload 'paddlenlp/PaddleNLP_CI/PaddleNLP_CI'
+        rm -rf upload/*
     elif [[ $2 =~ 'tests' ]]; then
+        tail -n 1 ${log_path}/$3.log
         echo -e "\033[32m ${log_path}/$3_SUCCESS \033[0m"
     else
+        tail -n 1 ${log_path}/$2.log
         echo -e "\033[32m ${log_path}/$2_SUCCESS \033[0m"
     fi
 }
@@ -363,7 +379,7 @@ lexical_analysis(){
     print_info $? lexical_analysis_predict
     # deploy
     time (python deploy/predict.py \
-        --model_file=infer_model/static_graph_params.pdmodel \
+        --model_file=infer_model/static_graph_params${PADDLE_INFERENCE_MODEL_SUFFIX} \
         --params_file=infer_model/static_graph_params.pdiparams \
         --data_dir lexical_analysis_dataset_tiny >${log_path}/lexical_analysis_deploy) >>${log_path}/lexical_analysis_deploy 2>&1
     print_info $? lexical_analysis_deploy
@@ -467,7 +483,7 @@ ernie-csc() {
     python export_model.py --params_path ./checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params >${log_path}/ernie-csc_export >>${log_path}/ernie-csc_export 2>&1
     print_info $? ernie-csc_export
     #python deploy
-    python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1
+    python predict.py --model_file infer_model/static_graph_params${PADDLE_INFERENCE_MODEL_SUFFIX} --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1
     print_info $? ernie-csc_deploy
 }
 
@@ -540,15 +556,25 @@ taskflow (){
     print_info $? taskflow
 }
 llm(){
-    cd ${nlp_dir}/csrc
-    echo "build paddlenlp_op"
-    python setup_cuda.py install
+    if git diff --numstat "$AGILE_COMPILE_BRANCH" | awk '{print $NF}' | grep -q '^csrc/'; then
+        echo "Found modifications in csrc, running setup_cuda.py install and uploading it to bos."
+        cd ${nlp_dir}/csrc
+        # python setup_cuda.py install
+        bash tools/build_wheel.sh python3.10 80
+        cp ./dist/p****.whl ${PPNLP_HOME}/upload/
+        cd ${PPNLP_HOME}
+        python upload.py ${PPNLP_HOME}/upload 'paddlenlp/wheels'
+        rm -rf upload/*
+    else
+        echo "No modifications in csrc, installing paddlenlp_ops wheel file..."
+        python -m pip install https://paddlenlp.bj.bcebos.com/wheels/paddlenlp_ops-0.0.0-py3-none-any.whl
+    fi
 
     sleep 5
     
     echo ' Testing all LLMs '
     cd ${nlp_dir}
-    python -m pytest tests/llm/test_*.py -vv --timeout=300 --alluredir=result >${log_path}/llm >>${log_path}/llm 2>&1
+    python -m pytest tests/llm/test_*.py -vv --timeout=300 --alluredir=result >${log_path}/llm.log >>${log_path}/llm.log 2>&1
     print_info $? llm
 }
 
diff --git a/scripts/regression/run_ci.sh b/scripts/regression/run_ci.sh
index d0bd15d9c67a..ed667d3fc436 100644
--- a/scripts/regression/run_ci.sh
+++ b/scripts/regression/run_ci.sh
@@ -141,11 +141,11 @@ for file_name in `git diff --numstat ${AGILE_COMPILE_BRANCH} |awk '{print $NF}'`
         done
         if [[ ${dir2} =~ "__init__" ]];then # 针对发版mini test
             P0case_list[${#P0case_list[*]}]=bert
-        elif [[ ${!all_P0case_dic[*]} =~ ${dir2} ]];then
+        elif [[ ${!all_P0case_dic[*]} == ${dir2} ]];then
             P0case_list[${#P0case_list[*]}]=${dir2}
         elif [[ ${dir2} =~ "transformers" ]];then
             P0case_list[${#P0case_list[*]}]=llm
-            if [[ ${!all_P0case_dic[*]} =~ ${dir3} ]];then
+            if [[ ${!all_P0case_dic[*]} == ${dir3} ]];then
                 P0case_list[${#P0case_list[*]}]=${dir3}
             fi
         elif [[ ${dir2} =~ "taskflow" ]];then
@@ -215,7 +215,7 @@ if [[ ${#Build_list[*]} -ne 0 ]];then
     cd /workspace
     rm -rf PaddleNLP_dev/build/*
     cd PaddleNLP_dev && git submodule update --init --recursive
-    cd /workspace && tar -zcvf PaddleNLP.tar.gz PaddleNLP_dev/
+    cd /workspace && tar -zcf PaddleNLP.tar.gz PaddleNLP_dev/
     mv PaddleNLP.tar.gz ${PPNLP_HOME}/upload
     cd ${PPNLP_HOME}
     python upload.py ${PPNLP_HOME}/upload 'paddlenlp/wheels'
@@ -306,8 +306,9 @@ if [[ ${#P0case_list[*]} -ne 0 ]] || [[ ${#APIcase_list[*]} -ne 0 ]];then
     fi
     cd ${nlp_dir}
     echo -e "\033[35m ---- Genrate Allure Report  \033[0m"
+    unset http_proxy && unset https_proxy
     cp scripts/regression/gen_allure_report.py ./
-    python gen_allure_report.py
+    python gen_allure_report.py > ${nlp_dir}/coverage_logs/gen_allure_report.log 2>&1
     echo -e "\033[35m ---- Report: https://xly.bce.baidu.com/ipipe/ipipe-report/report/${AGILE_JOB_BUILD_ID}/report/  \033[0m"
     ####################################
     # run coverage
diff --git a/scripts/regression/test_taskflow.py b/scripts/regression/test_taskflow.py
index e8e6c69e4461..686bedca5efe 100644
--- a/scripts/regression/test_taskflow.py
+++ b/scripts/regression/test_taskflow.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 """Test taskflow."""
 import os
+import unittest
 
 from paddlenlp import Taskflow
 
@@ -68,6 +69,7 @@ def test_corrector():
     corrector("遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。")
 
 
+@unittest.skip("dependency_parsing is not support for Paddle >= 2.6.1")
 def test_dependency_parsing():
     """
     test_dependency_parsing
diff --git a/scripts/unit_test/ci_unit.sh b/scripts/unit_test/ci_unit.sh
index 763fe47e5b72..d48622629465 100644
--- a/scripts/unit_test/ci_unit.sh
+++ b/scripts/unit_test/ci_unit.sh
@@ -16,6 +16,8 @@
 
 export paddle=$1
 export nlp_dir=/workspace/PaddleNLP
+export log_path=/workspace/PaddleNLP/unittest_logs
+mkdir -p /workspace/PaddleNLP/coverage_report
 cd $nlp_dir
 
 if [ ! -d "unittest_logs" ];then
@@ -32,15 +34,27 @@ install_requirements() {
     python -m pip install -r paddlenlp/experimental/autonlp/requirements.txt 
     python -m pip uninstall paddlepaddle paddlepaddle_gpu -y
     python -m pip install pillow -y
+    python -m pip install allure-pytest -y
     python -m pip install --no-cache-dir ${paddle}
-    python -c "import paddle;print('paddle')print(paddle.__version__);print(paddle.version.show())" >> ${log_path}/commit_info.txt
+    python -c "import paddle;print('paddle');print(paddle.__version__);print(paddle.version.show())" >> ${log_path}/commit_info.txt
 
     python setup.py bdist_wheel > /dev/null
     python -m pip install  dist/p****.whl
     python -c "import paddlenlp; print('paddlenlp commit:',paddlenlp.version.commit)" >> ${log_path}/commit_info.txt
-    cd csrc/
-    python setup_cuda.py install
-    cd ../
+    
+    if git diff --numstat "$AGILE_COMPILE_BRANCH" | awk '{print $NF}' | grep -q '^csrc/'; then
+        echo "Found modifications in csrc, running setup_cuda.py install and uploading it to bos."
+        cd ${nlp_dir}/csrc
+        # python setup_cuda.py install
+        bash tools/build_wheel.sh python3.10 80
+        # cp ./dist/p****.whl ${PPNLP_HOME}/upload/
+        # cd ${PPNLP_HOME}
+        # python upload.py ${PPNLP_HOME}/upload 'paddlenlp/wheels'
+        # rm -rf upload/*
+    else
+        echo "No modifications in csrc, installing paddlenlp_ops wheel file..."
+        python -m pip install https://paddlenlp.bj.bcebos.com/wheels/paddlenlp_ops-0.0.0-py3-none-any.whl
+    fi
 
     pip list 
 }
@@ -52,10 +66,39 @@ set_env() {
     export FLAGS_use_cuda_managed_memory=true
 }
 
+print_info() {
+    if [ $1 -ne 0 ]; then
+        cat ${log_path}/unittest.log | grep -v "Fail to fscanf: Success" \
+            | grep -v "SKIPPED" | grep -v "warning" > ${log_path}/unittest_FAIL.log
+        tail -n 1 ${log_path}/unittest.log >> ${log_path}/unittest_FAIL.log
+        echo -e "\033[31m ${log_path}/unittest_FAIL \033[0m"
+        cat ${log_path}/unittest_FAIL.log
+        cp ${log_path}/unittest_FAIL.log ${PPNLP_HOME}/upload/unittest_FAIL.log.${AGILE_PIPELINE_BUILD_ID}.${AGILE_JOB_BUILD_ID}
+        cd ${PPNLP_HOME} && python upload.py ${PPNLP_HOME}/upload 'paddlenlp/PaddleNLP_CI/PaddleNLP-CI-Unittest-GPU'
+        rm -rf upload/*
+    else
+        tail -n 1 ${log_path}/unittest.log
+        echo -e "\033[32m ${log_path}/unittest_SUCCESS \033[0m"
+    fi
+}
+
 install_requirements
 set_env
+cd ${nlp_dir}
+echo ' Testing all unittest cases '
 pytest -v -n 8 \
   --dist loadgroup \
   --retries 1 --retry-delay 1 \
-  --timeout 200 --durations 20 \
-  --cov paddlenlp --cov-report xml:coverage.xml
+  --timeout 200 --durations 20 --alluredir=result \
+  --cov paddlenlp --cov-report xml:coverage.xml > ${log_path}/unittest.log 2>&1
+exit_code=$?
+print_info $exit_code unittest
+
+cd ${nlp_dir}
+echo -e "\033[35m ---- Genrate Allure Report  \033[0m"
+unset http_proxy && unset https_proxy
+cp scripts/regression/gen_allure_report.py ./
+python gen_allure_report.py > ${nlp_dir}/coverage_report/gen_allure_report.log 2>&1
+echo -e "\033[35m ---- Report: https://xly.bce.baidu.com/ipipe/ipipe-report/report/${AGILE_JOB_BUILD_ID}/report/  \033[0m"
+
+exit $exit_code
\ No newline at end of file
diff --git a/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py b/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py
index cd60ae8ec765..0b1e9b6017a9 100644
--- a/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py
+++ b/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py
@@ -14,16 +14,12 @@
 
 import argparse
 import logging
-import sys
 
 import mteb
-import paddle
-from models.modeling import BiEncoderModel
-from models.modeling_nv import NVEncodeModel
 from mteb import MTEB
 
 from paddlenlp.peft import LoRAConfig, LoRAModel
-from paddlenlp.transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
+from paddlenlp.transformers import AutoTokenizer, BiEncoderModel, NVEncodeModel
 
 
 class MTEB_EvalModel:
diff --git a/slm/pipelines/examples/contrastive_training/train.py b/slm/pipelines/examples/contrastive_training/train.py
index d9d27e5fa01b..c18263dff95d 100644
--- a/slm/pipelines/examples/contrastive_training/train.py
+++ b/slm/pipelines/examples/contrastive_training/train.py
@@ -17,12 +17,10 @@
 from arguments import DataArguments, ModelArguments
 from arguments import RetrieverTrainingArguments as TrainingArguments
 from data import EmbedCollator, TrainDatasetForEmbedding
-from models.modeling import BiEncoderModel
-from models.modeling_nv import NVEncodeModel
 
 from paddlenlp.peft import LoRAConfig, LoRAModel
 from paddlenlp.trainer import PdArgumentParser, Trainer, get_last_checkpoint, set_seed
-from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.transformers import AutoTokenizer, BiEncoderModel, NVEncodeModel
 from paddlenlp.utils.log import logger
 
 
diff --git a/tests/llm/test_gradio.py b/tests/llm/test_gradio.py
index 731c5f9bf6d3..2ab830583b25 100644
--- a/tests/llm/test_gradio.py
+++ b/tests/llm/test_gradio.py
@@ -1,3 +1,4 @@
+#!/usr/bin/env python
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -46,10 +47,9 @@ def setUp(self):
         self.model_path = "__internal_testing__/micro-random-llama"
         command = (
             "cd ./llm && PYTHONPATH=../:$PYTHONPATH"
-            + ' {python} predict/flask_server.py --model_name_or_path {model_path} --port {port} --flask_port {flask_port} --src_length 1024 --dtype "float16"'.format(
-                flask_port=self.flask_port, port=self.port, model_path=self.model_path, python=sys.executable
-            )
-        )
+            + " {python} predict/flask_server.py --model_name_or_path {model_path} "
+            + '--port {port} --flask_port {flask_port} --src_length 1024 --dtype "float16"'
+        ).format(flask_port=self.flask_port, port=self.port, model_path=self.model_path, python=sys.executable)
         current_env = copy.copy(os.environ.copy())
         current_env.pop("http_proxy", None)
         current_env.pop("https_proxy", None)
@@ -58,7 +58,6 @@ def setUp(self):
 
         self.ui_process = subprocess.Popen(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=current_env)
         self.tokenizer = LlamaTokenizer.from_pretrained(self.model_path)
-
         return super().setUp()
 
     def tearDown(self):
@@ -79,13 +78,11 @@ def wait_until_server_is_ready(self):
         while True:
             if is_port_in_use(self.flask_port) and is_port_in_use(self.port):
                 break
-
             print("waiting for server ...")
             time.sleep(1)
 
     def get_gradio_ui_result(self, *args, **kwargs):
         _, _, file = self.client.predict(*args, **kwargs)
-
         with open(file, "r", encoding="utf-8") as f:
             content = json.load(f)
         return content[-1]["utterance"]
@@ -95,64 +92,57 @@ def test_argument(self):
         self.wait_until_server_is_ready()
 
         def get_response(data):
-            res = requests.post(f"http://localhost:{self.flask_port}/api/chat", json=data, stream=True)
+            res = requests.post(f"http://localhost:{self.flask_port}/v1/chat/completions", json=data, stream=True)
             result_ = ""
             for line in res.iter_lines():
-                print(line)
-                result = json.loads(line)
-                bot_response = result["result"]["response"]
-
-                if bot_response["utterance"].endswith("[END]"):
-                    bot_response["utterance"] = bot_response["utterance"][:-5]
-
-                result_ += bot_response["utterance"]
-
+                if not line:
+                    continue
+                decoded_line = line.decode("utf-8").strip()
+                # 如果返回行以 "data:" 开头，则去除该前缀
+                if decoded_line.startswith("data:"):
+                    data_str = decoded_line[len("data:") :].strip()
+                else:
+                    data_str = decoded_line
+                if data_str == "[DONE]":
+                    break
+                chunk = json.loads(data_str)
+                # 根据 OpenAI 的流式返回，每个 chunk 在 choices[0]["delta"] 中包含回复增量
+                delta = chunk["choices"][0]["delta"].get("content", "")
+                result_ += delta
             return result_
 
+        # 测试用例1：greedy search 模式（top_p 为1.0）
         data = {
-            "context": "你好",
-            "top_k": 1,
-            "top_p": 1.0,
+            "messages": [{"role": "user", "content": "你好"}],
             "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_length": 20,
-            "min_length": 1,
+            "max_tokens": 20,
+            "top_p": 1.0,
+            "stream": True,
         }
-        # Case 1: greedy search
-        # result_0 = get_response(data)
         result_1 = get_response(data)
 
-        # TODO(wj-Mcat): enable logit-comparision later
-        # assert result_0 == result_1
-
+        # 测试用例2：采样模式（top_p 为 0.7）
         data = {
-            "context": "你好",
-            "top_k": 0,
-            "top_p": 0.7,
+            "messages": [{"role": "user", "content": "你好"}],
             "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_length": 20,
-            "min_length": 1,
+            "max_tokens": 20,
+            "top_p": 0.7,
+            "stream": True,
         }
-
-        # Case 2: sampling
         result_2 = get_response(data)
-        # assert result_1 != result_2
 
-        # 测试长度应该保持一致
+        # 对生成文本的长度进行简单检测
         assert 10 <= len(self.tokenizer.tokenize(result_1)) <= 50
         assert 10 <= len(self.tokenizer.tokenize(result_2)) <= 50
 
+        # 测试用例3：更长的 max_tokens 参数
         data = {
-            "context": "你好",
-            "top_k": 1,
-            "top_p": 0.7,
+            "messages": [{"role": "user", "content": "你好"}],
             "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_length": 100,
-            "min_length": 1,
+            "max_tokens": 100,
+            "top_p": 0.7,
+            "stream": True,
         }
-        # Case 3: max_length
         result_3 = get_response(data)
         assert result_3 != result_2
         assert 70 <= len(self.tokenizer.tokenize(result_3)) <= 150
diff --git a/tests/mergekit/test_merge_model.py b/tests/mergekit/test_merge_model.py
index ce1692ec9369..9deb25a20292 100644
--- a/tests/mergekit/test_merge_model.py
+++ b/tests/mergekit/test_merge_model.py
@@ -20,12 +20,11 @@
 
 from paddlenlp.mergekit import MergeConfig, MergeModel
 from paddlenlp.transformers import AutoModel
-from tests.testing_utils import require_gpu
 
 
 class TestMergeModel(unittest.TestCase):
     @parameterized.expand([("slerp",), ("della",), ("dare_linear",), ("ties",)])
-    def test_merge_model(self, merge_method):
+    def test_merge_model_np(self, merge_method):
         with TemporaryDirectory() as tempdir:
             model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert", dtype="bfloat16")
             pd_path = os.path.join(tempdir, "pd_model")
@@ -68,34 +67,50 @@ def test_merge_model(self, merge_method):
             mergekit = MergeModel(merge_config)
             mergekit.merge_model()
 
-            # test safetensor only with pd
+    @parameterized.expand([("slerp",), ("della",), ("dare_linear",), ("ties",)])
+    def test_merge_model_pd(self, merge_method):
+        with TemporaryDirectory() as tempdir:
+            model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert", dtype="bfloat16")
+            pd_path = os.path.join(tempdir, "pd_model")
+            model.save_pretrained(pd_path)
+            safe_path = os.path.join(tempdir, "safe_model")
+            model.save_pretrained(safe_path, safe_serialization="safetensors")
+
+            # test mix
+            merge_config = MergeConfig(
+                merge_method=merge_method, model_path_list=[safe_path, pd_path], output_path=tempdir, tensor_type="pd"
+            )
+            mergekit = MergeModel(merge_config)
+            mergekit.merge_model()
+
+            # test mix with base model
             merge_config = MergeConfig(
                 merge_method=merge_method,
-                model_path_list=[safe_path, safe_path],
+                model_path_list=[safe_path, pd_path],
                 output_path=tempdir,
-                n_process=2,
+                base_model_path=safe_path,
                 tensor_type="pd",
             )
             mergekit = MergeModel(merge_config)
             mergekit.merge_model()
 
-    @parameterized.expand([("slerp",), ("della",), ("dare_linear",), ("ties",)])
-    @require_gpu(2)
-    def test_merge_model_gpu(self, merge_method):
-        with TemporaryDirectory() as tempdir:
-            model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert", dtype="bfloat16")
-            pd_path = os.path.join(tempdir, "pd_model")
-            model.save_pretrained(pd_path)
-            safe_path = os.path.join(tempdir, "safe_model")
-            model.save_pretrained(safe_path, safe_serialization="safetensors")
+            # test safetensor only
+            merge_config = MergeConfig(
+                merge_method=merge_method,
+                model_path_list=[safe_path, safe_path],
+                output_path=tempdir,
+                tensor_type="pd",
+            )
+            mergekit = MergeModel(merge_config)
+            mergekit.merge_model()
 
-            # test safetensor only with pd and gpu
+            # test safetensor only with base model
             merge_config = MergeConfig(
                 merge_method=merge_method,
                 model_path_list=[safe_path, safe_path],
                 output_path=tempdir,
-                device="gpu",
                 tensor_type="pd",
+                base_model_path=safe_path,
             )
             mergekit = MergeModel(merge_config)
             mergekit.merge_model()
diff --git a/tests/taskflow/test_text_classification.py b/tests/taskflow/test_text_classification.py
index eb2469d6b099..43ac3c0c361c 100644
--- a/tests/taskflow/test_text_classification.py
+++ b/tests/taskflow/test_text_classification.py
@@ -24,7 +24,6 @@
     PromptModelForSequenceClassification,
     SoftVerbalizer,
 )
-from paddlenlp.taskflow import Taskflow
 from paddlenlp.taskflow.text_classification import TextClassificationTask
 from paddlenlp.transformers import (
     AutoModelForMaskedLM,
@@ -145,60 +144,60 @@ def test_classification_task(self, batch_size, problem_type, model):
                 if model == "multi_label":
                     self.assertGreater(dygraph_pred["score"], dygraph_taskflow.multilabel_threshold)
 
-    @unittest.skip("numerical error")
-    @parameterized.expand(
-        [
-            (1, "multi_class", "finetune"),
-            (1, "multi_class", "prompt"),
-            (1, "multi_label", "finetune"),
-            (1, "multi_label", "prompt"),
-        ]
-    )
-    def test_taskflow_task(self, batch_size, problem_type, mode):
-        input_text = ["百度", "深度学习框架", "飞桨", "PaddleNLP"]
-        id2label = {
-            0: "negative",
-            1: "positive",
-        }
-        if mode == "finetune":
-            dygraph_model_path = self.finetune_dygraph_model_path
-            static_model_path = self.finetune_static_model_path
-        else:
-            dygraph_model_path = self.prompt_dygraph_model_path
-            static_model_path = self.prompt_static_model_path
-
-        dygraph_taskflow = Taskflow(
-            mode=mode,
-            task="text_classification",
-            task_path=dygraph_model_path,
-            id2label=id2label,
-            batch_size=batch_size,
-            device_id=0,
-            problem_type=problem_type,
-        )
-
-        dygraph_results = dygraph_taskflow(input_text)
-
-        self.assertEqual(len(dygraph_results), len(input_text))
-
-        static_taskflow = Taskflow(
-            mode=mode,
-            task="text_classification",
-            is_static_model=True,
-            task_path=static_model_path,
-            id2label=id2label,
-            batch_size=batch_size,
-            device_id=0,
-            problem_type=problem_type,
-        )
-
-        static_results = static_taskflow(input_text)
-        self.assertEqual(len(static_results), len(input_text))
-
-        for dygraph_result, static_result in zip(dygraph_results, static_results):
-            for dygraph_pred, static_pred in zip(dygraph_result["predictions"], static_result["predictions"]):
-                self.assertEqual(dygraph_pred["label"], static_pred["label"])
-                self.assertAlmostEqual(dygraph_pred["score"], static_pred["score"], delta=1e-6)
-                # if multi_label, all predictions should be greater than the threshold
-                if mode == "multi_label":
-                    self.assertGreater(dygraph_pred["score"], dygraph_taskflow.task_instance.multilabel_threshold)
+    # @unittest.skip("numerical error")
+    # @parameterized.expand(
+    #     [
+    #         (1, "multi_class", "finetune"),
+    #         (1, "multi_class", "prompt"),
+    #         (1, "multi_label", "finetune"),
+    #         (1, "multi_label", "prompt"),
+    #     ]
+    # )
+    # def test_taskflow_task(self, batch_size, problem_type, mode):
+    #     input_text = ["百度", "深度学习框架", "飞桨", "PaddleNLP"]
+    #     id2label = {
+    #         0: "negative",
+    #         1: "positive",
+    #     }
+    #     if mode == "finetune":
+    #         dygraph_model_path = self.finetune_dygraph_model_path
+    #         static_model_path = self.finetune_static_model_path
+    #     else:
+    #         dygraph_model_path = self.prompt_dygraph_model_path
+    #         static_model_path = self.prompt_static_model_path
+
+    #     dygraph_taskflow = Taskflow(
+    #         mode=mode,
+    #         task="text_classification",
+    #         task_path=dygraph_model_path,
+    #         id2label=id2label,
+    #         batch_size=batch_size,
+    #         device_id=0,
+    #         problem_type=problem_type,
+    #     )
+
+    #     dygraph_results = dygraph_taskflow(input_text)
+
+    #     self.assertEqual(len(dygraph_results), len(input_text))
+
+    #     static_taskflow = Taskflow(
+    #         mode=mode,
+    #         task="text_classification",
+    #         is_static_model=True,
+    #         task_path=static_model_path,
+    #         id2label=id2label,
+    #         batch_size=batch_size,
+    #         device_id=0,
+    #         problem_type=problem_type,
+    #     )
+
+    #     static_results = static_taskflow(input_text)
+    #     self.assertEqual(len(static_results), len(input_text))
+
+    #     for dygraph_result, static_result in zip(dygraph_results, static_results):
+    #         for dygraph_pred, static_pred in zip(dygraph_result["predictions"], static_result["predictions"]):
+    #             self.assertEqual(dygraph_pred["label"], static_pred["label"])
+    #             self.assertAlmostEqual(dygraph_pred["score"], static_pred["score"], delta=1e-6)
+    #             # if multi_label, all predictions should be greater than the threshold
+    #             if mode == "multi_label":
+    #                 self.assertGreater(dygraph_pred["score"], dygraph_taskflow.task_instance.multilabel_threshold)
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP1_MP4_PP1_VPP5_Sharding8_Stage1.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP4_MP2_PP4.sh
similarity index 94%
rename from tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP1_MP4_PP1_VPP5_Sharding8_Stage1.sh
rename to tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP4_MP2_PP4.sh
index 674336e66fbc..4de325035360 100644
--- a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP1_MP4_PP1_VPP5_Sharding8_Stage1.sh
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP4_MP2_PP4.sh
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 param="model_item=gpt-3-13b_pretrain "
-param+="run_mode=DP1_MP4_PP1_VPP5_Sharding8_Stage1 "
+param+="run_mode=DP4_MP2_PP4 "
 param+="device_num=N4C32 "
 param+="global_batch_size=128 "
 param+="nnodes=4 "
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json
index 4bae68aa098c..8a354b585224 100644
--- a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json
@@ -6,17 +6,20 @@
     "split": "949,50,1",
     "max_seq_length": 4096,
     "gradient_accumulation_steps": 32,
-    "tensor_parallel_degree": 4,
-    "pipeline_parallel_degree": 1,
-    "virtual_pp_degree": 5,
-    "sequence_parallel": 0,
+    "tensor_parallel_degree": 2,
+    "pipeline_parallel_degree": 4,
+    "virtual_pp_degree": 1,
+    "sequence_parallel": 1,
     "sharding": "stage1",
     "pipeline_parallel_config": "enable_sharding_comm_overlap enable_release_grads ",
     "tensor_parallel_config": "enable_mp_async_allreduce enable_sp_async_reduce_scatter enable_mp_skip_c_identity enable_mp_fused_linear_param_grad_add",
     "per_device_train_batch_size": 1,
     "use_flash_attention": true,
+    "use_fast_layer_norm": true,
     "use_fused_rms_norm": true,
     "fuse_attention_qkv": true,
+    "use_fused_linear": true,
+    "use_fused_dropout_add": true,
     "use_fused_rope": true,
     "fuse_attention_ffn": true,
     "enable_linear_fused_grad_add": true,
@@ -25,12 +28,12 @@
     "scale_loss": 1024,
     "learning_rate": 1e-05,
     "min_learning_rate": 5e-06,
-    "max_steps": 200,
+    "max_steps": 500,
     "save_steps": 5000,
     "weight_decay": 0.01,
     "warmup_ratio": 0.01,
     "max_grad_norm": 1.0,
-    "logging_steps": 2,
+    "logging_steps": 5,
     "dataloader_num_workers": 1,
     "eval_steps": 1000,
     "disable_tqdm": true,
diff --git a/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP2_1F1B_Sharding4_Stage1.sh b/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP1_Sharding8_Stage1.sh
similarity index 91%
rename from tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP2_1F1B_Sharding4_Stage1.sh
rename to tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP1_Sharding8_Stage1.sh
index ee1ebfa5c8de..80b151dc2658 100644
--- a/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP2_1F1B_Sharding4_Stage1.sh
+++ b/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP1_Sharding8_Stage1.sh
@@ -13,9 +13,9 @@
 # limitations under the License.
 
 param="model_item=baichuan-inc-baichuan-2-13b_pretrain_dy2st "
-param+="run_mode=DP1_MP4_PP2_1F1B_Sharding4_Stage1 "
+param+="run_mode=DP1_MP4_PP1_Sharding8_Stage1 "
 param+="device_num=N4C32 "
-param+="global_batch_size=128 "
+param+="global_batch_size=32 "
 param+="nnodes=4 "
 param+="model_type=baichuan2_13b "
 
diff --git a/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json b/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json
index 47112a243e62..1c95ef419434 100644
--- a/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json
+++ b/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json
@@ -5,17 +5,16 @@
     "output_dir": "./checkpoints/baichuan2_13b_ckpts",
     "split": "949,50,1",
     "to_static": true,
-    "pipeline_parallel_degree": 2,
+    "pipeline_parallel_degree": 1,
     "tensor_parallel_degree": 4,
-    "virtual_pp_degree": 2,
-    "pipeline_schedule_mode": "1F1B",
+    "virtual_pp_degree": 1,
     "weight_decay": 0.01,
     "warmup_ratio": 0.01,
-    "max_grad_norm": 0.0,
+    "max_grad_norm": 1.0,
     "learning_rate": 0.00003,
     "min_learning_rate": 0.000003,
-    "max_steps": 100,
-    "logging_steps": 1,
+    "max_steps": 200,
+    "logging_steps": 5,
     "eval_steps": 10000,
     "save_steps": 1000,
     "continue_training": 0,
@@ -25,11 +24,11 @@
     "disable_tqdm": true,
     "save_total_limit": 2,
     "device": "gpu",
-    "dataloader_num_workers": 4,
+    "dataloader_num_workers": 1,
     "distributed_dataloader": 0,
     "enable_auto_parallel": 1,
-    "per_device_train_batch_size": 1,
-    "gradient_accumulation_steps": 32,
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 2,
     "per_device_eval_batch_size": 1,
     "recompute": false,
     "recompute_use_reentrant": true,
@@ -46,8 +45,9 @@
     "use_fused_rope": true,
     "use_fused_rms_norm": true,
     "max_seq_length": 4096,
-    "sequence_parallel": false,
+    "sequence_parallel": 1,
     "sharding": "stage1",
+    "sharding_parallel_degree": 8,
     "sharding_parallel_config": "enable_tensor_fusion enable_overlap",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
diff --git a/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP1_1F1B_Sharding8_Stage1.sh b/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP4_MP2_PP4.sh
similarity index 94%
rename from tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP1_1F1B_Sharding8_Stage1.sh
rename to tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP4_MP2_PP4.sh
index 19095d700cfe..0728d03885a4 100644
--- a/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP1_1F1B_Sharding8_Stage1.sh
+++ b/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP4_MP2_PP4.sh
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 param="model_item=gpt-3-13b_pretrain_dy2st "
-param+="run_mode=DP1_MP4_PP1_1F1B_Sharding8_Stage1 "
+param+="run_mode=DP4_MP2_PP4 "
 param+="device_num=N4C32 "
 param+="global_batch_size=128 "
 param+="nnodes=4 "
diff --git a/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json b/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json
index fed29487dc88..c80f7403fe59 100644
--- a/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json
+++ b/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json
@@ -7,18 +7,20 @@
     "output_dir": "./checkpoints/gpt_pretrain_ckpts",
     "split": "949,50,1",
     "max_seq_length": 4096,
+    "tensor_parallel_degree": 2,
+    "pipeline_parallel_degree": 4,
     "per_device_train_batch_size": 1,
     "per_device_eval_batch_size": 1,
     "scale_loss": 1024,
     "learning_rate": 0.00001,
     "min_learning_rate": 0.000001,
-    "max_steps": 100,
+    "max_steps": 500,
     "save_steps": 50000,
     "weight_decay": 0.01,
     "warmup_ratio": 0.01,
-    "logging_steps": 1,
+    "logging_steps": 5,
     "continue_training": 0,
-    "dataloader_num_workers": 4,
+    "dataloader_num_workers": 1,
     "eval_steps": 100000,
     "report_to": "visualdl",
     "disable_tqdm": true,
@@ -26,14 +28,9 @@
     "do_eval": true,
     "device": "gpu",
     "model_type": "gpt",
-    "sharding": "stage1",
-    "tensor_parallel_degree": 4,
-    "pipeline_parallel_degree": 1,
-    "virtual_pp_degree": 2,
-    "pipeline_schedule_mode": "1F1B",
-    "virtual_pipeline_seg_method": "GPTDecoderLayerAuto",
-    "sequence_parallel": 0,
+    "sequence_parallel": 1,
     "use_flash_attention": 1,
+    "use_fast_layer_norm": 1,
     "fused_linear": 1,
     "fuse_attention_ffn": 1,
     "fuse_attention_qkv": 1,
@@ -45,13 +42,12 @@
     "recompute_granularity": "full",
     "pp_recompute_interval": 1,
     "gradient_accumulation_steps": 32,
-    "max_grad_norm": 0.1,
+    "max_grad_norm": 1.0,
     "bf16": 1,
     "fp16_opt_level": "O2",
     "amp_master_grad": true,
     "attention_probs_dropout_prob": 0.1,
     "hidden_dropout_prob": 0.1,
-    "sharding_parallel_config": "enable_tensor_fusion enable_overlap",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
     "pipeline_parallel_config": "enable_send_recv_overlap enable_split_backward"
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP1_PP4_VPP5_Sharding8_Stage2.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP1_PP4_VPP5_Sharding8_Stage2.sh
new file mode 100644
index 000000000000..5844f90fd419
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP1_PP4_VPP5_Sharding8_Stage2.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_item=intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st "
+param+="run_mode=DP1_MP1_PP4_VPP5_Sharding8_Stage2 "
+param+="device_num=N4C32 "
+param+="global_batch_size=32 "
+param+="nnodes=4 "
+param+="model_type=llama2_13b "
+param+='intermediate_api=intermediate_api_ '
+
+
+cd ./tests
+bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh"
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
new file mode 100644
index 000000000000..4ae528040fbb
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
@@ -0,0 +1,28 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_item=intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st "
+param+="run_mode=DP1_MP8_PP4_VPP5 "
+param+="device_num=N4C32 "
+param+="global_batch_size=32 "
+param+="nnodes=4 "
+param+="model_type=llama2_70b "
+param+='intermediate_api=intermediate_api_ '
+
+
+cd ./tests
+bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh"
+
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st_bs32_bf16_Sharding32_Stage2.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st_bs32_bf16_Sharding32_Stage2.sh
new file mode 100644
index 000000000000..1cdc5ebb4992
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st_bs32_bf16_Sharding32_Stage2.sh
@@ -0,0 +1,26 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_item=intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st "
+param+="run_mode=Sharding32_Stage2 "
+param+="device_num=N4C32 "
+param+="global_batch_size=32 "
+param+="nnodes=4 "
+param+="model_type=llama2_7b "
+param+='intermediate_api=intermediate_api_ '
+
+cd ./tests
+bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh"
diff --git a/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh
index 88b326057402..0a69e3cf54d9 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh
+++ b/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh
@@ -24,6 +24,10 @@ function _set_params(){
     fp_item="bf16"
     MODEL_TYPE=${model_type:-"llama2_7b"}
 
+    # for intermediate api
+    intermediate_api=${intermediate_api:-""}
+
+
     ip_lists=($(echo $TRAINER_INSTANCES | tr ',' ' '))
     master_ip=${ip_lists[0]}
     nnodes=${nnodes:-1}
@@ -174,17 +178,17 @@ function _train(){
         train_cmd="python -u -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 \
             --nnodes 1 --nproc_per_node 8 \
             --log_dir mylog run_pretrain_auto.py \
-            ./pretrain_config_${MODEL_TYPE}/pretrain-${MODEL_TYPE}.json"
+            ./pretrain_config_${MODEL_TYPE}/${intermediate_api}pretrain-${MODEL_TYPE}.json"
         ;;
     N4C32) echo "Run with: device_num=${device_num} run_mode=${run_mode}"
         train_cmd="python -u -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 \
             --log_dir mylog run_pretrain_auto.py \
-            ./pretrain_config_${MODEL_TYPE}/pretrain-${MODEL_TYPE}.json"
+            ./pretrain_config_${MODEL_TYPE}/${intermediate_api}pretrain-${MODEL_TYPE}.json"
         ;;
     *) echo "Run with: device_num=${device_num}, run_mode=${run_mode}"
         train_cmd="python -u -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 \
             --log_dir mylog run_pretrain_auto.py \
-            ./pretrain_config_${MODEL_TYPE}/pretrain-${MODEL_TYPE}.json"
+            ./pretrain_config_${MODEL_TYPE}/${intermediate_api}pretrain-${MODEL_TYPE}.json"
         ;;
     esac
     cd ../llm/auto_parallel/llama
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api-llama2_13b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api_pretrain-llama2_13b.json
similarity index 98%
rename from tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api-llama2_13b.json
rename to tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api_pretrain-llama2_13b.json
index 8c2d29feeb29..1582a6d30404 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api-llama2_13b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api_pretrain-llama2_13b.json
@@ -28,7 +28,7 @@
   "min_learning_rate": 3e-06,
   "warmup_steps": 30,
   "logging_steps": 10,
-  "max_steps": 100,
+  "max_steps": 500,
   "save_steps": 5000,
   "eval_steps": 1000,
   "weight_decay": 0.01,
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/intermediate_api_pretrain-llama2_70b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/intermediate_api_pretrain-llama2_70b.json
new file mode 100644
index 000000000000..f927c165788b
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/intermediate_api_pretrain-llama2_70b.json
@@ -0,0 +1,64 @@
+{
+    "model_name_or_path": "meta-llama/Llama-2-70b",
+    "tokenizer_name_or_path": "meta-llama/Llama-2-70b",
+    "input_dir": "./data",
+    "output_dir": "./checkpoints/llama2_pretrain_ckpts",
+    "weight_decay": 0.01,
+    "warmup_ratio": 0.01,
+    "max_grad_norm": 1.0,
+    "learning_rate": 3e-05,
+    "min_learning_rate": 3e-06,
+    "warmup_steps": 30,
+    "logging_steps": 10,
+    "max_steps": 500,
+    "save_steps": 5000,
+    "eval_steps": 1000,
+    "continue_training": 0,
+    "do_train": true,
+    "do_eval": false,
+    "do_predict": false,
+    "disable_tqdm": true,
+    "skip_profile_timer": true,
+    "save_total_limit": 2,
+    "device": "gpu",
+    "dataloader_num_workers": 1,
+    "distributed_dataloader": 0,
+    "enable_auto_parallel": true,
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 32,
+    "per_device_eval_batch_size": 32,
+    "recompute": false,
+    "recompute_use_reentrant": true,
+    "recompute_granularity": "full",
+    "pp_recompute_interval": 0,
+    "bf16": true,
+    "fp16_opt_level": "O2",
+    "amp_master_grad": true,
+    "amp_custom_black_list": ["reduce_sum", "c_softmax_with_cross_entropy"],
+    "amp_custom_white_list": ["lookup_table", "lookup_table_v2"],
+    "fuse_attention_ffn": true,
+    "fuse_attention_qkv": true,
+    "use_fused_rope": true,
+    "fused_linear_param_grad_add": true,
+    "fuse_sequence_parallel_allreduce": false,
+    "use_flash_attention": true,
+    "use_fused_rms_norm": true,
+    "sep_parallel_degree": 1,
+    "sequence_parallel": true,   
+    "pipeline_parallel_degree": 4,
+    "sharding_parallel_degree": 1,
+    "sharding": "stage1",
+    "tensor_parallel_degree": 8,
+    "virtual_pp_degree": 5,
+    "pipeline_schedule_mode": "VPP", 
+    "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
+    "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
+    "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
+    "max_seq_length": 4096,
+    "to_static": true,
+    "eliminate_transpose": 1,
+    "fuse_allreduce_split_to_reducescatter": 1,
+    "sequence_parallel_config": "enable_allreduce_avg_in_gradinent_scale",
+    "model_type": "llama_network",
+    "use_intermediate_api": true
+}
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api-llama2_7b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api_pretrain-llama2_7b.json
similarity index 100%
rename from tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api-llama2_7b.json
rename to tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api_pretrain-llama2_7b.json
diff --git a/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh b/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs32_bf16_DP1_MP2_Sharding16_Stage1.sh
similarity index 91%
rename from tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh
rename to tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs32_bf16_DP1_MP2_Sharding16_Stage1.sh
index 50d990884957..b2e775139346 100644
--- a/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh
+++ b/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs32_bf16_DP1_MP2_Sharding16_Stage1.sh
@@ -13,9 +13,9 @@
 # limitations under the License.
 
 param="model_item=qwen-14b_pretrain_dy2st "
-param+="run_mode=DP1_MP2_PP4_1F1B_Sharding4_Stage1 "
+param+="run_mode=DP1_MP2_Sharding16_Stage1 "
 param+="device_num=N4C32 "
-param+="global_batch_size=128 "
+param+="global_batch_size=32 "
 param+="nnodes=4 "
 param+="model_type=qwen_14b "
 
diff --git a/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json b/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json
index 8525e89b413b..1950b15933ec 100644
--- a/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json
+++ b/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json
@@ -4,28 +4,29 @@
     "input_dir": "./data",
     "output_dir": "./checkpoints/qwen_pretrain_ckpts",
     "per_device_train_batch_size": 1,
-    "gradient_accumulation_steps": 32,
-    "per_device_eval_batch_size": 16,
+    "gradient_accumulation_steps": 2,
+    "per_device_eval_batch_size": 1,
     "sharding": "stage1",
     "tensor_parallel_degree": 2,
-    "pipeline_parallel_degree": 4,
-    "virtual_pp_degree": 5,
-    "pipeline_schedule_mode": "1F1B",
+    "pipeline_parallel_degree": 1,
+    "sharding_parallel_degree": 16,
+    "sequence_parallel": 1,
+    "virtual_pp_degree": 1,
     "virtual_pipeline_seg_method": "QWenBlockAuto",
     "use_flash_attention": true,
-    "use_fused_rms_norm": false,
+    "use_fused_rms_norm": true,
     "use_fused_rope": true,
     "fused_linear": 1,
     "fuse_attention_ffn": 1,
     "fuse_attention_qkv": 1,
     "fused_linear_param_grad_add": 1,
     "max_seq_length": 4096,
-    "learning_rate": 0.00003,
-    "min_learning_rate": 0.000003,
+    "learning_rate": 1e-05,
+    "min_learning_rate": 5e-06,
     "scale_loss": 1024,
     "warmup_steps": 30,
-    "logging_steps": 1,
-    "max_steps": 100,
+    "logging_steps": 5,
+    "max_steps": 200,
     "save_steps": 1000,
     "eval_steps": 10000,
     "weight_decay": 0.01,
@@ -33,8 +34,8 @@
     "fp16_opt_level": "O2",
     "amp_master_grad": true,
     "warmup_ratio": 0.01,
-    "max_grad_norm": 0.0,
-    "dataloader_num_workers": 4,
+    "max_grad_norm": 1.0,
+    "dataloader_num_workers": 1,
     "continue_training": 0,
     "do_train": true,
     "do_eval": false,
@@ -47,7 +48,6 @@
     "save_total_limit": 2,
     "enable_auto_parallel": 1,
     "to_static": 1,
-    "auto_parallel_resume_form_hybrid_parallel": true,
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
     "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
diff --git a/tests/testing_utils.py b/tests/testing_utils.py
index dbcfda9fe38b..59007307133b 100644
--- a/tests/testing_utils.py
+++ b/tests/testing_utils.py
@@ -511,7 +511,7 @@ def require_paddle_up_to_2_gpus(test_case):
 def require_gpu(min_gpus: int = 1):
     def actual_decorator(func):
         gpu_count = paddle.device.cuda.device_count()
-
+        print("gpu count: ", gpu_count)
         if gpu_count < min_gpus:
             return unittest.skip(f"test requires {min_gpus} GPUs")(func)
 
diff --git a/tests/trainer/test_moe_unified_checkpoint.py b/tests/trainer/test_moe_unified_checkpoint.py
new file mode 100644
index 000000000000..618e2b2f3daf
--- /dev/null
+++ b/tests/trainer/test_moe_unified_checkpoint.py
@@ -0,0 +1,176 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import pytest
+
+from paddlenlp.utils.downloader import get_path_from_url_with_filelock
+from tests.parallel_launch import TestMultipleGpus
+from tests.testing_utils import require_paddle_at_least_8_gpu, skip_for_none_ce_case
+from tests.trainer.test_unified_checkpoint import remove_ckpt, remove_logs
+from tests.trainer.trainer_utils import get_pretrain_arguments
+
+environment_variables = {
+    "NCCL_ALGO": "Tree",
+    "NVIDIA_TF32_OVERRIDE": "0",
+    "NCCL_IB_TIMEOUT": "22",
+    "NCCL_DEBUG": "INFO",
+    "FLAGS_embedding_deterministic": "1",
+    "FLAGS_cudnn_deterministic": "1",
+    "Flags_mp_aysnc_allreduce": "1",
+    "Flags_skip_mp_c_identity": "1",
+    "FLAGS_shard_norm_align_dp": "0",
+    "FLAGS_shard_use_reduce": "1",
+    "test_ci_no_save_model": "1",
+}
+
+moe_arguments = {
+    "model_name_or_path": "__internal_testing__/unified-ckpt-qwen2moe",
+    "dataset_name_or_path": "./unified_checkpoint/peft_input/data/",
+    "output_dir": "./unified_checkpoint/checkpoints/qwen2moe_sft_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps": 16,
+    "learning_rate": 3e-04,
+    "max_steps": 10,
+    "save_steps": 6,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "no",
+    "save_strategy": "steps",
+    "src_length": 1024,
+    "max_length": 2048,
+    "bf16": "true",
+    "fp16_opt_level": "O2",
+    "do_train": "true",
+    "do_eval": "false",
+    "disable_tqdm": "true",
+    "eval_with_do_generation": "false",
+    "recompute": "true",
+    "recompute_granularity": "full",
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "",
+    "lora": "false",
+    "zero_padding": "false",
+    "use_flash_attention": "false",
+    "unified_checkpoint": 1,
+    "continue_training": 0,
+    "sequence_parallel": 0,
+}
+
+
+def check_acc(log_dir="log"):
+    file_path = os.path.join(log_dir, "workerlog.n0.c0")
+    cmd = "grep -a 'global_step: 10' " + file_path + " | awk -F ','  '{print $2}' | awk  '{print $6}'"
+    import subprocess
+
+    res = subprocess.check_output(cmd, shell=True, text=True)
+    res = [float(x) for x in res.split()]
+
+    return res
+
+
+seed = 2024
+
+rng = np.random.default_rng(seed=seed)
+
+
+@pytest.mark.xdist_group(name="UC")
+class TestUnifiedCheckpointBase(TestMultipleGpus):
+    @classmethod
+    @property
+    def __test__(cls):
+        return cls != TestUnifiedCheckpointBase
+
+    def setUp(self):
+        """
+        1. update runfirst and rerun to run defined different config
+        2. update need_allclose to True if you want to check the result
+        3. update rtol to the relative value you want to check
+        """
+
+        self.configs = get_pretrain_arguments(moe_arguments)
+        os.environ.update(environment_variables)
+
+        file_ = "https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz"
+        input_dir = "unified_checkpoint/peft_input/"
+        os.makedirs(input_dir, exist_ok=True)
+        file_path = os.path.join(input_dir, "AdvertiseGen.tar.gz")
+        if not os.path.exists(file_path):
+            get_path_from_url_with_filelock(file_, root_dir=input_dir)
+
+        self.need_allclose = True
+        self.rtol = 1e-7
+
+        self.run_file = "llm/run_finetune.py"
+
+    def runfirst(self, train_args):
+        self.run_n1c8(self.run_file, **train_args)
+
+    def rerun(self, train_args):
+        self.run_n1c8(self.run_file, **train_args)
+
+    @require_paddle_at_least_8_gpu
+    def testTP4DP2(self):
+        remove_logs()
+        remove_ckpt(moe_arguments["output_dir"])
+
+        train_args = self.configs["TP4DP2"]
+        self.runfirst(train_args)
+        self.rerun(train_args)
+
+        if self.need_allclose:
+            res = check_acc()
+            assert len(res) == 2
+            np.testing.assert_allclose(res[0], res[1], self.rtol)
+
+    @skip_for_none_ce_case
+    @require_paddle_at_least_8_gpu
+    def testTP2Sharding4(self):
+        remove_logs()
+        remove_ckpt(moe_arguments["output_dir"])
+
+        train_args = self.configs["TP2Sharding4"]
+        self.runfirst(train_args)
+        self.rerun(train_args)
+
+        if self.need_allclose:
+            res = check_acc()
+            assert len(res) == 2
+            np.testing.assert_allclose(res[0], res[1], self.rtol)
+
+
+@pytest.mark.xdist_group(name="UC")
+class TestUnifiedCheckpointFull(TestUnifiedCheckpointBase):
+    @skip_for_none_ce_case
+    @require_paddle_at_least_8_gpu
+    def testTP2Sharding4V2(self):
+        remove_logs()
+        remove_ckpt(moe_arguments["output_dir"])
+
+        train_args = self.configs["TP2Sharding4"]
+        train_args.update({"sharding_parallel_config": "split_param"})
+        train_args.update({"amp_master_grad": True})
+        self.runfirst(train_args)
+        self.rerun(train_args)
+
+        if self.need_allclose:
+            res = check_acc()
+            assert len(res) == 2
+            np.testing.assert_allclose(res[0], res[1], self.rtol)
diff --git a/tests/trainer/trainer_utils.py b/tests/trainer/trainer_utils.py
index ae9a40e61d59..cda374ce1c6a 100644
--- a/tests/trainer/trainer_utils.py
+++ b/tests/trainer/trainer_utils.py
@@ -141,6 +141,14 @@ def get_pretrain_arguments(pretrain_arguments):
     train_args["gradient_accumulation_steps"] = train_args["gradient_accumulation_steps"] // 8
     configs["DP8"] = train_args
 
+    train_args = copy.deepcopy(pretrain_arguments)
+    train_args["tensor_parallel_degree"] = 2
+    train_args["pipeline_parallel_degree"] = 1
+    train_args["sharding_parallel_degree"] = 2
+    train_args["sharding"] = "stage1"
+    train_args["gradient_accumulation_steps"] = train_args["gradient_accumulation_steps"] // 4
+    configs["TP2DP2Sharding2"] = train_args
+
     return configs
 
 
diff --git a/tests/transformers/auto/test_tokenizer.py b/tests/transformers/auto/test_tokenizer.py
index 1e47267f91a3..e36aebd072b6 100644
--- a/tests/transformers/auto/test_tokenizer.py
+++ b/tests/transformers/auto/test_tokenizer.py
@@ -124,8 +124,9 @@ def test_new_tokenizer_fast_registration(self):
                 new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir, use_fast=True)
                 self.assertIsInstance(new_tokenizer, CustomTokenizerFast)
 
-                new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir, use_fast=False)
-                self.assertIsInstance(new_tokenizer, CustomTokenizer)
+                # TODO: fix this test. Now keep loaded tokenizer type
+                # new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir, use_fast=False)
+                # self.assertIsInstance(new_tokenizer, CustomTokenizer)
         finally:
             if "custom" in CONFIG_MAPPING._extra_content:
                 del CONFIG_MAPPING._extra_content["custom"]
diff --git a/tests/transformers/llm_embed/__init__.py b/tests/transformers/llm_embed/__init__.py
new file mode 100644
index 000000000000..a9cc79cc9d7f
--- /dev/null
+++ b/tests/transformers/llm_embed/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/llm_embed/test_modeling.py b/tests/transformers/llm_embed/test_modeling.py
new file mode 100644
index 000000000000..80ca5542ee37
--- /dev/null
+++ b/tests/transformers/llm_embed/test_modeling.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import AutoTokenizer, BiEncoderModel
+
+from ...testing_utils import require_gpu
+
+
+class BiEncoderModelIntegrationTest(unittest.TestCase):
+    @require_gpu(1)
+    def test_model_tiny_logits(self):
+        input_texts = [
+            "This is a test",
+            "This is another test",
+        ]
+
+        model_name_or_path = "BAAI/bge-large-en-v1.5"
+        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+        model = BiEncoderModel(model_name_or_path=model_name_or_path, dtype="float16", tokenizer=tokenizer)
+        with paddle.no_grad():
+            out = model.encode_sentences(sentences=input_texts)
+
+        print(out)
+        """
+        [[ 0.00674057  0.03396606  0.00722122 ...  0.01176453  0.00311279 -0.02825928]
+         [ 0.00708771  0.03982544 -0.00155735 ...  0.00658417  0.01318359 -0.03259277]]
+        """
+
+        del model
+        paddle.device.cuda.empty_cache()
+        gc.collect()
diff --git a/tests/transformers/nv_embed/__init__.py b/tests/transformers/nv_embed/__init__.py
new file mode 100644
index 000000000000..a9cc79cc9d7f
--- /dev/null
+++ b/tests/transformers/nv_embed/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/nv_embed/test_modeling.py b/tests/transformers/nv_embed/test_modeling.py
new file mode 100644
index 000000000000..0718389f156d
--- /dev/null
+++ b/tests/transformers/nv_embed/test_modeling.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import NVEncodeModel, PretrainedConfig
+
+from ...testing_utils import require_gpu
+
+
+class NVEncodeModelIntegrationTest(unittest.TestCase):
+    @require_gpu(1)
+    def test_model_tiny_logits(self):
+        input_texts = [
+            "This is a test",
+            "This is another test",
+        ]
+
+        config = PretrainedConfig(
+            attention_dropout=0.0,
+            bos_token_id=1,
+            dtype="float16",
+            eos_token_id=2,
+            hidden_act="silu",
+            hidden_size=4096,
+            initializer_range=0.02,
+            intermediate_size=14336,
+            max_position_embeddings=32768,
+            num_attention_heads=32,
+            num_hidden_layers=32,
+            num_key_value_heads=8,
+            rms_norm_eps=1e-05,
+            rope_theta=10000.0,
+            sliding_window=4096,
+            tie_word_embeddings=False,
+            vocab_size=32000,
+        )
+        model = NVEncodeModel(
+            config=config,
+            tokenizer_path="BAAI/bge-large-en-v1.5",
+            query_instruction="",
+            document_instruction="",
+        )
+        with paddle.no_grad():
+            out = model.encode_sentences(input_texts, instruction_len=0)
+
+        print(out)
+        """
+        [[-0.00473404  0.00711441  0.01237488 ... -0.00228691 -0.01416779 -0.00429535]
+         [-0.00343323  0.00911713  0.00894928 ... -0.00637054 -0.0165863 -0.00852966]]
+        """
+
+        del model
+        paddle.device.cuda.empty_cache()
+        gc.collect()
diff --git a/tests/transformers/test_modeling_common.py b/tests/transformers/test_modeling_common.py
index 51e8745fcb33..a32cc3bfcf26 100644
--- a/tests/transformers/test_modeling_common.py
+++ b/tests/transformers/test_modeling_common.py
@@ -38,7 +38,13 @@
 from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
 from paddlenlp.transformers.configuration_utils import PretrainedConfig
 from paddlenlp.transformers.model_utils import PretrainedModel
-from paddlenlp.utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME, MODEL_HOME
+from paddlenlp.utils.env import (
+    CONFIG_NAME,
+    LEGACY_CONFIG_NAME,
+    MODEL_HOME,
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+)
 
 from ..testing_utils import slow
 
@@ -968,11 +974,8 @@ def test_to_static_use_top_k(self):
                         use_top_p=False,
                     ),
                 )
-                if paddle.framework.use_pir_api():
-                    model_path = os.path.join(tempdir, "model.json")
-                else:
-                    model_path = os.path.join(tempdir, "model.pdmodel")
-                params_path = os.path.join(tempdir, "model.pdiparams")
+                model_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_MODEL_SUFFIX}")
+                params_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}")
                 config = paddle.inference.Config(model_path, params_path)
 
                 config.disable_gpu()
@@ -1040,11 +1043,8 @@ def test_to_static_use_top_p(self):
                     ),
                 )
 
-                if paddle.framework.use_pir_api():
-                    model_path = os.path.join(tempdir, "model.json")
-                else:
-                    model_path = os.path.join(tempdir, "model.pdmodel")
-                params_path = os.path.join(tempdir, "model.pdiparams")
+                model_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_MODEL_SUFFIX}")
+                params_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}")
                 config = paddle.inference.Config(model_path, params_path)
 
                 config.disable_gpu()

模型名称	数据集名称	CMeEE-V2	Boson	CLUENER	CCIR2021-NER	任务对话2018-NER	银行借贷2021-NER	SKE2019	Avg
	数据集领域	医疗领域	通用领域	通用领域	新闻领域	对话领域	金融领域	金融领域
PP-UIE-0.5B	F1(0-shot)	0.479	0.638	0.593	0.773	0.723	0.361	0.782	0.621
PP-UIE-1.5B	F1(0-shot)	0.485	0.688	0.61	0.799	0.768	0.444	0.803	0.657
	F1(5-shot)	0.52	0.694	0.625	0.812	0.812	0.466	0.801	0.676
PP-UIE-7B	F1(0-shot)	0.521	0.696	0.615	0.826	0.807	0.434	0.812	0.673
	F1(5-shot)	0.527	0.705	0.626	0.826	0.861	0.483	0.801	0.69
PP-UIE-14B	F1(0-shot)	0.556	0.712	0.637	0.841	0.843	0.488	0.832	0.701
	F1(5-shot)	0.588	0.729	0.67	0.837	0.865	0.576	0.832	0.728