diff --git a/README.md b/README.md
index 7cf797492ffb..f7fc6111ac6a 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@
 ------------------------------------------------------------------------------------------
 
 <p align="center">
-    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href="https://paddlenlp.readthedocs.io/en/latest/?badge=latest"><img src="https://readthedocs.org/projects/paddlenlp/badge/?version=latest">
     <a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
     <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
     <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
@@ -16,6 +16,7 @@
     <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
     <a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
     <a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
 </p>
 
 <h4 align="center">
@@ -31,17 +32,21 @@
 <a href="https://trendshift.io/repositories/2246" target="_blank"><img src="https://trendshift.io/api/badge/repositories/2246" alt="PaddlePaddle%2FPaddleNLP | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 
 ## News 📢
+
+* **2025.02.20 🔥🔥《PP-UIE信息抽取智能引擎全新升级》** 强化零样本学习能力，支持极少甚至零标注数据实现高效冷启动与迁移学习，显著降低数据标注成本；具备处理长文本能力，支持 8192 个Token长度文档信息抽取，实现跨段落识别关键信息，形成完整理解；提供完整可定制化的训练和推理全流程，训练效率相较于LLama-Factory实现了1.8倍的提升。
+2月26日（周三）19：00为您深度解析全新PP-UIE技术方案及在部署方面的功能、优势与技巧。报名链接：https://www.wjx.top/vm/mBKC6pb.aspx?udsid=606418
+
+* **2025.02.10 PaddleNLP 现已支持 DeepSeek-R1系列模型，[在线使用](https://aistudio.baidu.com/projectdetail/8775758)**：依托全新的 PaddleNLP 3.0套件，DeepSeek-R1系列模型现已全面支持。凭借数据并行、数据分组切分并行、模型并行、流水线并行以及专家并行等一系列先进的分布式训练能力，结合 Paddle 框架独有的列稀疏注意力掩码表示技术——FlashMask 方法，DeepSeek-R1系列模型在训练过程中显著降低了显存消耗，同时取得了卓越的训练性能提升。
+
 * **2024.12.16 [PaddleNLP v3.0 Beta3](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v3.0.0-beta3)**：大模型功能全新升级，新增了 Llama-3.2、DeepSeekV2模型，升级了 TokenizerFast，快速分词，重构了 SFTTrainer，一键开启 SFT 训练。此外，PaddleNLP 还支持了优化器状态的卸载和重载功能，实现了精细化的重新计算，训练性能提升7%。在 Unified Checkpoint 方面，进一步优化了异步保存逻辑，新增 Checkpoint 压缩功能，可节省78.5%存储空间。
 最后，在大模型推理方面，升级 Append Attention，支持了 FP8量化，支持投机解码。
 
+<details><summary> <b>点击展开</b> </summary><div>
+
 * **2024.12.13 📚《飞桨大模型套件 Unified Checkpoint 技术》**，加速模型存储95%，节省空间78%。支持全分布式策略调整自适应转换，提升模型训练的灵活性与可扩展性。训练-压缩-推理统一存储协议，无需手动转换提升全流程体验。Checkpoint 无损压缩结合异步保存，实现秒级存储并降低模型存储成本。适用于智能制造、指挥交通、医疗健康、金融服务等产业实际场景。12月24日（周二）19：00直播为您详细解读该技术如何优化大模型训练流程。报名链接：https://www.wjx.top/vm/huZkHn9.aspx?udsid=787976
 
 * **2024.11.28 📚《FlashRAG-Paddle | 基于 PaddleNLP 的高效开发与评测 RAG 框架》**，为文本更快更好构建准确嵌入表示、加速推理生成速度。PaddleNLP 支持超大 Batch 嵌入表示学习与多硬件高性能推理，涵盖 INT8/INT4量化技术及多种高效注意力机制优化与 TensorCore 深度优化。内置全环节算子融合技术，使得 FlashRAG 推理性能相比 transformers 动态图提升70%以上，结合检索增强知识输出结果更加准确，带来敏捷高效的使用体验。直播时间：12月3日（周二）19：00。报名链接：https://www.wjx.top/vm/eaBa1vA.aspx?udsid=682361
 
-
-
-<details><summary> <b>点击展开</b> </summary><div>
-
 * **2024.08.08 📚《飞桨产业级大语言模型开发利器 PaddleNLP 3.0 重磅发布》**，训压推全流程贯通，主流模型全覆盖。大模型自动并行，千亿模型训推全流程开箱即用。提供产业级高性能精调与对齐解决方案，压缩推理领先，多硬件适配。覆盖产业级智能助手、内容创作、知识问答、关键信息抽取等应用场景。直播时间：8月22日（周四）19：00。报名链接：https://www.wjx.top/vm/Y2f7FFY.aspx?udsid=143844
 
 * **2024.06.27 [PaddleNLP v3.0 Beta](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v3.0.0-beta0)**：拥抱大模型，体验全升级。统一大模型套件，实现国产计算芯片全流程接入；全面支持飞桨4D 并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程；自研极致收敛的 RsLoRA+算法、自动扩缩容存储机制 Unified Checkpoint 和通用化支持的 FastFFN、FusedQKV 助力大模型训推；主流模型持续支持更新，提供高效解决方案。
@@ -69,39 +74,45 @@
 
 大模型套件高性能推理模块内置动态插入和全环节算子融合策略，极大加快并行推理速度。底层实现细节封装化，实现开箱即用的高性能并行推理能力。
 
+## 文档
+更多详细文档, 请访问 [PaddleNLP Documentation](https://paddlenlp.readthedocs.io/).
+
 ------------------------------------------------------------------------------------------
 
 ## 模型支持
 
 * 模型参数已支持 LLaMA 系列、Baichuan 系列、Bloom 系列、ChatGLM 系列、Gemma 系列、Mistral 系列、OPT 系列和 Qwen 系列，详细列表👉【LLM】模型参数支持列表如下：
 
-|                                          模型系列                                           | 模型名称                                                                                                                                                                                                                                                                                                                                                                                      |
-|:-------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|      [LLaMA](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)       | facebook/llama-7b, facebook/llama-13b, facebook/llama-30b, facebook/llama-65b                                                                                                                                                                                                                                                                                                                 |
-|      [Llama2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)      | meta-llama/Llama-2-7b, meta-llama/Llama-2-7b-chat, meta-llama/Llama-2-13b, meta-llama/Llama-2-13b-chat, meta-llama/Llama-2-70b, meta-llama/Llama-2-70b-chat                                                                                                                                                                                                                                   |
-|      [Llama3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)      | meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-70B-Instruct                                                                                                                                                                                                                                                            |
-|     [Llama3.1](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)     | meta-llama/Meta-Llama-3.1-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Meta-Llama-3.1-405B, meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Llama-Guard-3-8B                                                                                                                                              |
-|     [Llama3.2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)     | meta-llama/Llama-3.2-1B, meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-Guard-3-1B                                                                                                                                                                                                                                             |
-|     [Llama3.3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)     | meta-llama/Llama-3.3-70B-Instruct                                                                                                                                                                                                                                                                                                                                                             |
-|   [Baichuan](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)    | baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Base, baichuan-inc/Baichuan-13B-Chat                                                                                                                                                                                                                                                                                                      |
-|   [Baichuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)   | baichuan-inc/Baichuan2-7B-Base, baichuan-inc/Baichuan2-7B-Chat, baichuan-inc/Baichuan2-13B-Base, baichuan-inc/Baichuan2-13B-Chat                                                                                                                                                                                                                                                              |
-|      [Bloom](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/bloom)       | bigscience/bloom-560m, bigscience/bloom-560m-bf16, bigscience/bloom-1b1, bigscience/bloom-3b, bigscience/bloom-7b1, bigscience/bloomz-560m, bigscience/bloomz-1b1, bigscience/bloomz-3b, bigscience/bloomz-7b1-mt, bigscience/bloomz-7b1-p3, bigscience/bloomz-7b1, bellegroup/belle-7b-2m                                                                                                    |
-|    [ChatGLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm/)    | THUDM/chatglm-6b, THUDM/chatglm-6b-v1.1                                                                                                                                                                                                                                                                                                                                                       |
-|   [ChatGLM2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)    | THUDM/chatglm2-6b                                                                                                                                                                                                                                                                                                                                                                             |
-|   [ChatGLM3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)    | THUDM/chatglm3-6b                                                                                                                                                                                                                                                                                                                                                                             |
-| [DeepSeekV2](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2) | deepseek-ai/DeepSeek-V2, deepseek-ai/DeepSeek-V2-Chat, deepseek-ai/DeepSeek-V2-Lite, deepseek-ai/DeepSeek-V2-Lite-Chat, deepseek-ai/DeepSeek-Coder-V2-Base, deepseek-ai/DeepSeek-Coder-V2-Instruct, deepseek-ai/DeepSeek-Coder-V2-Lite-Base, deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct                                                                                                      |
-|      [Gemma](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/gemma)       | google/gemma-7b, google/gemma-7b-it, google/gemma-2b, google/gemma-2b-it                                                                                                                                                                                                                                                                                                                      |
-|    [Mistral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mistral)     | mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-7B-v0.1                                                                                                                                                                                                                                                                                                                                 |
-|    [Mixtral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mixtral)     | mistralai/Mixtral-8x7B-Instruct-v0.1                                                                                                                                                                                                                                                                                                                                                          |
-|        [OPT](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/opt)         | facebook/opt-125m, facebook/opt-350m, facebook/opt-1.3b, facebook/opt-2.7b, facebook/opt-6.7b, facebook/opt-13b, facebook/opt-30b, facebook/opt-66b, facebook/opt-iml-1.3b, opt-iml-max-1.3b                                                                                                                                                                                                  |
-|       [Qwen](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)       | qwen/qwen-7b, qwen/qwen-7b-chat, qwen/qwen-14b, qwen/qwen-14b-chat, qwen/qwen-72b, qwen/qwen-72b-chat,                                                                                                                                                                                                                                                                                        |
-|     [Qwen1.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)      | Qwen/Qwen1.5-0.5B, Qwen/Qwen1.5-0.5B-Chat, Qwen/Qwen1.5-1.8B, Qwen/Qwen1.5-1.8B-Chat, Qwen/Qwen1.5-4B, Qwen/Qwen1.5-4B-Chat, Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, Qwen/Qwen1.5-14B, Qwen/Qwen1.5-14B-Chat, Qwen/Qwen1.5-32B, Qwen/Qwen1.5-32B-Chat, Qwen/Qwen1.5-72B, Qwen/Qwen1.5-72B-Chat, Qwen/Qwen1.5-110B, Qwen/Qwen1.5-110B-Chat, Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat |
-|      [Qwen2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)       | Qwen/Qwen2-0.5B, Qwen/Qwen2-0.5B-Instruct, Qwen/Qwen2-1.5B, Qwen/Qwen2-1.5B-Instruct, Qwen/Qwen2-7B, Qwen/Qwen2-7B-Instruct, Qwen/Qwen2-72B, Qwen/Qwen2-72B-Instruct, Qwen/Qwen2-57B-A14B, Qwen/Qwen2-57B-A14B-Instruct                                                                                                                                                                       |
-|    [Qwen2-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)    | Qwen/Qwen2-Math-1.5B, Qwen/Qwen2-Math-1.5B-Instruct, Qwen/Qwen2-Math-7B, Qwen/Qwen2-Math-7B-Instruct, Qwen/Qwen2-Math-72B, Qwen/Qwen2-Math-72B-Instruct, Qwen/Qwen2-Math-RM-72B                                                                                                                                                                                                               |
-|     [Qwen2.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)      | Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-32B, Qwen/Qwen2.5-32B-Instruct, Qwen/Qwen2.5-72B, Qwen/Qwen2.5-72B-Instruct                                                                     |
-|   [Qwen2.5-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)   | Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen2.5-Math-7B, Qwen/Qwen2.5-Math-7B-Instruct, Qwen/Qwen2.5-Math-72B, Qwen/Qwen2.5-Math-72B-Instruct, Qwen/Qwen2.5-Math-RM-72B                                                                                                                                                                                                 |
-|  [Qwen2.5-Coder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)   | Qwen/Qwen2.5-Coder-1.5B, Qwen/Qwen2.5-Coder-1.5B-Instruct, Qwen/Qwen2.5-Coder-7B, Qwen/Qwen2.5-Coder-7B-Instruct                                                                                                                                                                                                                                                                              |
-|      [Yuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/yuan/)       | IEITYuan/Yuan2-2B, IEITYuan/Yuan2-51B, IEITYuan/Yuan2-102B                                                                                                                                                                                                                                                                                                                                    |
+|                                                模型系列                                                 | 模型名称                                                                                                                                                                                                                                                                                                                                                                                      |
+|:-------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [PP-UIE](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/application/information_extraction) | paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B                                                                                                                                                                                                                                                                                                       |
+|            [LLaMA](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)             | facebook/llama-7b, facebook/llama-13b, facebook/llama-30b, facebook/llama-65b                                                                                                                                                                                                                                                                                                                 |
+|            [Llama2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)            | meta-llama/Llama-2-7b, meta-llama/Llama-2-7b-chat, meta-llama/Llama-2-13b, meta-llama/Llama-2-13b-chat, meta-llama/Llama-2-70b, meta-llama/Llama-2-70b-chat                                                                                                                                                                                                                                   |
+|            [Llama3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)            | meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-70B-Instruct                                                                                                                                                                                                                                                            |
+|           [Llama3.1](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)           | meta-llama/Meta-Llama-3.1-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Meta-Llama-3.1-405B, meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Llama-Guard-3-8B                                                                                                                                              |
+|           [Llama3.2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)           | meta-llama/Llama-3.2-1B, meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-Guard-3-1B                                                                                                                                                                                                                                             |
+|           [Llama3.3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/llama)           | meta-llama/Llama-3.3-70B-Instruct                                                                                                                                                                                                                                                                                                                                                             |
+|         [Baichuan](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)          | baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Base, baichuan-inc/Baichuan-13B-Chat                                                                                                                                                                                                                                                                                                      |
+|         [Baichuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/baichuan)         | baichuan-inc/Baichuan2-7B-Base, baichuan-inc/Baichuan2-7B-Chat, baichuan-inc/Baichuan2-13B-Base, baichuan-inc/Baichuan2-13B-Chat                                                                                                                                                                                                                                                              |
+|            [Bloom](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/bloom)             | bigscience/bloom-560m, bigscience/bloom-560m-bf16, bigscience/bloom-1b1, bigscience/bloom-3b, bigscience/bloom-7b1, bigscience/bloomz-560m, bigscience/bloomz-1b1, bigscience/bloomz-3b, bigscience/bloomz-7b1-mt, bigscience/bloomz-7b1-p3, bigscience/bloomz-7b1, bellegroup/belle-7b-2m                                                                                                    |
+|          [ChatGLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm/)          | THUDM/chatglm-6b, THUDM/chatglm-6b-v1.1                                                                                                                                                                                                                                                                                                                                                       |
+|         [ChatGLM2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)          | THUDM/chatglm2-6b                                                                                                                                                                                                                                                                                                                                                                             |
+|         [ChatGLM3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/chatglm2)          | THUDM/chatglm3-6b                                                                                                                                                                                                                                                                                                                                                                             |
+|       [DeepSeekV2](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2)       | deepseek-ai/DeepSeek-V2, deepseek-ai/DeepSeek-V2-Chat, deepseek-ai/DeepSeek-V2-Lite, deepseek-ai/DeepSeek-V2-Lite-Chat, deepseek-ai/DeepSeek-Coder-V2-Base, deepseek-ai/DeepSeek-Coder-V2-Instruct, deepseek-ai/DeepSeek-Coder-V2-Lite-Base, deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct                                                                                                      |
+|       [DeepSeekV3](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2)       | deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base                                                                                                                                                                                                                                                                                                                                         |
+|      [DeepSeek-R1](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/deepseek-v2)       | deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero, deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B                                                                            |
+|            [Gemma](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/gemma)             | google/gemma-7b, google/gemma-7b-it, google/gemma-2b, google/gemma-2b-it                                                                                                                                                                                                                                                                                                                      |
+|          [Mistral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mistral)           | mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-7B-v0.1                                                                                                                                                                                                                                                                                                                                 |
+|          [Mixtral](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/mixtral)           | mistralai/Mixtral-8x7B-Instruct-v0.1                                                                                                                                                                                                                                                                                                                                                          |
+|              [OPT](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/opt)               | facebook/opt-125m, facebook/opt-350m, facebook/opt-1.3b, facebook/opt-2.7b, facebook/opt-6.7b, facebook/opt-13b, facebook/opt-30b, facebook/opt-66b, facebook/opt-iml-1.3b, opt-iml-max-1.3b                                                                                                                                                                                                  |
+|             [Qwen](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)             | qwen/qwen-7b, qwen/qwen-7b-chat, qwen/qwen-14b, qwen/qwen-14b-chat, qwen/qwen-72b, qwen/qwen-72b-chat,                                                                                                                                                                                                                                                                                        |
+|           [Qwen1.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)            | Qwen/Qwen1.5-0.5B, Qwen/Qwen1.5-0.5B-Chat, Qwen/Qwen1.5-1.8B, Qwen/Qwen1.5-1.8B-Chat, Qwen/Qwen1.5-4B, Qwen/Qwen1.5-4B-Chat, Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, Qwen/Qwen1.5-14B, Qwen/Qwen1.5-14B-Chat, Qwen/Qwen1.5-32B, Qwen/Qwen1.5-32B-Chat, Qwen/Qwen1.5-72B, Qwen/Qwen1.5-72B-Chat, Qwen/Qwen1.5-110B, Qwen/Qwen1.5-110B-Chat, Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat |
+|            [Qwen2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)             | Qwen/Qwen2-0.5B, Qwen/Qwen2-0.5B-Instruct, Qwen/Qwen2-1.5B, Qwen/Qwen2-1.5B-Instruct, Qwen/Qwen2-7B, Qwen/Qwen2-7B-Instruct, Qwen/Qwen2-72B, Qwen/Qwen2-72B-Instruct, Qwen/Qwen2-57B-A14B, Qwen/Qwen2-57B-A14B-Instruct                                                                                                                                                                       |
+|          [Qwen2-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)          | Qwen/Qwen2-Math-1.5B, Qwen/Qwen2-Math-1.5B-Instruct, Qwen/Qwen2-Math-7B, Qwen/Qwen2-Math-7B-Instruct, Qwen/Qwen2-Math-72B, Qwen/Qwen2-Math-72B-Instruct, Qwen/Qwen2-Math-RM-72B                                                                                                                                                                                                               |
+|           [Qwen2.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)            | Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-32B, Qwen/Qwen2.5-32B-Instruct, Qwen/Qwen2.5-72B, Qwen/Qwen2.5-72B-Instruct                                                                     |
+|         [Qwen2.5-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)         | Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen2.5-Math-7B, Qwen/Qwen2.5-Math-7B-Instruct, Qwen/Qwen2.5-Math-72B, Qwen/Qwen2.5-Math-72B-Instruct, Qwen/Qwen2.5-Math-RM-72B                                                                                                                                                                                                 |
+|        [Qwen2.5-Coder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)         | Qwen/Qwen2.5-Coder-1.5B, Qwen/Qwen2.5-Coder-1.5B-Instruct, Qwen/Qwen2.5-Coder-7B, Qwen/Qwen2.5-Coder-7B-Instruct                                                                                                                                                                                                                                                                              |
+|            [Yuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/yuan/)             | IEITYuan/Yuan2-2B, IEITYuan/Yuan2-51B, IEITYuan/Yuan2-102B                                                                                                                                                                                                                                                                                                                                    |
 
 * 4D 并行和算子优化已支持 LLaMA 系列、Baichuan 系列、Bloom 系列、ChatGLM 系列、Gemma 系列、Mistral 系列、OPT 系列和 Qwen 系列，【LLM】模型4D 并行和算子支持列表如下：
 
@@ -130,19 +141,19 @@
 
 
 | Model                                      | Pretrain | SFT | LoRA | FlashMask | Prefix Tuning | DPO/SimPO/ORPO/KTO | RLHF | Mergekit | Quantization |
-|--------------------------------------------|:--------:|:---:|:----:|:---------:|:-------------:|:--------------:|:----:|:-----:|:------------:|
-| [Llama](./llm/config/llama)                |    ✅     |  ✅  |  ✅   |     ✅     |       ✅       |       ✅        |  ✅   |   ✅   |      ✅       |
-| [Qwen](./llm/config/qwen)                  |    ✅     |  ✅  |  ✅   |     ✅     |       ✅       |       ✅        |  🚧  |   ✅   |      🚧      |
-| [Mixtral](./llm/config/mixtral)            |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |       ✅        |  🚧  |   ✅   |      🚧      |
-| [Mistral](./llm/config/mistral)            |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |       ✅        |  🚧  |   ✅   |      🚧      |
-| [Baichuan/Baichuan2](./llm/config/llama)   |    ✅     |  ✅  |  ✅   |     ✅     |       ✅       |       ✅        |  🚧  |   ✅   |      ✅       |
-| [ChatGLM-6B](./llm/config/chatglm)         |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |       🚧       |  🚧  |   ✅   |      ✅       |
-| [ChatGLM2/ChatGLM3](./llm/config/chatglm2) |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |       ✅        |  🚧  |   ✅   |      ✅       |
-| [Bloom](./llm/config/bloom)                |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |       🚧       |  🚧  |   ✅   |      ✅       |
-| [GPT-3](./llm/config/gpt-3)                |    ✅     |  ✅  |  🚧  |    🚧     |      🚧       |       🚧       |  🚧  |   ✅   |      🚧      |
-| [OPT](./llm/config/opt)                    |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |       🚧       |  🚧  |   ✅   |      🚧      |
-| [Gemma](./llm/config/gemma)                |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |       ✅        |  🚧  |   ✅   |      🚧      |
-| [Yuan](./llm/config/yuan)                  |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |       ✅        |  🚧  |   ✅   |      🚧      |
+|--------------------------------------------|:--------:|:---:|:----:|:---------:|:-------------:|:------------------:|:----:|:--------:|:------------:|
+| [Llama](./llm/config/llama)                |    ✅     |  ✅  |  ✅   |     ✅     |       ✅       |         ✅          |  ✅   |    ✅     |      ✅       |
+| [Qwen](./llm/config/qwen)                  |    ✅     |  ✅  |  ✅   |     ✅     |       ✅       |         ✅          |  🚧  |    ✅     |      🚧      |
+| [Mixtral](./llm/config/mixtral)            |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |         ✅          |  🚧  |    ✅     |      🚧      |
+| [Mistral](./llm/config/mistral)            |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |         ✅          |  🚧  |    ✅     |      🚧      |
+| [Baichuan/Baichuan2](./llm/config/llama)   |    ✅     |  ✅  |  ✅   |     ✅     |       ✅       |         ✅          |  🚧  |    ✅     |      ✅       |
+| [ChatGLM-6B](./llm/config/chatglm)         |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |         🚧         |  🚧  |    ✅     |      ✅       |
+| [ChatGLM2/ChatGLM3](./llm/config/chatglm2) |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |         ✅          |  🚧  |    ✅     |      ✅       |
+| [Bloom](./llm/config/bloom)                |    ✅     |  ✅  |  ✅   |    🚧     |       ✅       |         🚧         |  🚧  |    ✅     |      ✅       |
+| [GPT-3](./llm/config/gpt-3)                |    ✅     |  ✅  |  🚧  |    🚧     |      🚧       |         🚧         |  🚧  |    ✅     |      🚧      |
+| [OPT](./llm/config/opt)                    |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |         🚧         |  🚧  |    ✅     |      🚧      |
+| [Gemma](./llm/config/gemma)                |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |         ✅          |  🚧  |    ✅     |      🚧      |
+| [Yuan](./llm/config/yuan)                  |    ✅     |  ✅  |  ✅   |    🚧     |      🚧       |         ✅          |  🚧  |    ✅     |      🚧      |
 * [大模型推理](./llm/docs/predict/inference.md)已支持 LLaMA 系列、Qwen 系列、Mistral 系列、ChatGLM 系列、Bloom 系列和 Baichuan 系列，支持 Weight Only INT8及 INT4推理，支持 WAC（权重、激活、Cache KV）进行 INT8、FP8量化的推理，【LLM】模型推理支持列表如下：
 
 |          模型名称/量化类型支持           | FP16/BF16 | WINT8 | WINT4 | INT8-A8W8 | FP8-A8W8 | INT8-A8W8C8 |
@@ -160,7 +171,7 @@
 ### 环境依赖
 
 * python >= 3.8
-* paddlepaddle >= 3.0.0b0
+* paddlepaddle >= 3.0.0rc0
 
 如果您尚未安装 PaddlePaddle，请参考 [飞桨官网](https://www.paddlepaddle.org.cn/) 进行安装。
 
@@ -205,7 +216,7 @@ wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwe
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
 cd .. # change folder to PaddleNLP/llm
 # 如需使用use_fused_rms_norm=true，需要前往slm/model_zoo/gpt-3/external_ops安装fused_ln
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json --use_fused_rms_norm false
+python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
 ```
 
 ### 大模型 SFT 精调
@@ -215,7 +226,7 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
+python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
 ```
 
 更多大模型全流程步骤，请参考[飞桨大模型套件](./llm)介绍。
@@ -230,7 +241,7 @@ dataset = load_dataset("ZHUI/alpaca_demo", split="train")
 training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", device="gpu")
 trainer = SFTTrainer(
     args=training_args,
-    model="Qwen/Qwen2.5-0.5B",
+    model="Qwen/Qwen2.5-0.5B-Instruct",
     train_dataset=dataset,
 )
 trainer.train()
diff --git a/README_en.md b/README_en.md
index db0edf1ceeaa..d7934748379d 100644
--- a/README_en.md
+++ b/README_en.md
@@ -7,7 +7,7 @@
 ------------------------------------------------------------------------------------------
 
 <p align="center">
-    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href="https://paddlenlp.readthedocs.io/en/latest/?badge=latest"><img src="https://readthedocs.org/projects/paddlenlp/badge/?version=latest">
     <a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
     <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
     <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
@@ -16,6 +16,7 @@
     <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
     <a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
     <a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
 </p>
 
 <h4 align="center">
@@ -52,6 +53,9 @@ The fine-tuning algorithms are deeply integrated with zero-padding data streams
 
 The high-performance inference module of the large model toolkit incorporates dynamic insertion and operator fusion strategies throughout the entire process, greatly accelerating parallel inference speed. The underlying implementation details are encapsulated, enabling out-of-the-box high-performance parallel inference capabilities.
 
+## Documentation
+For detailed documentation, visit the [PaddleNLP Documentation](https://paddlenlp.readthedocs.io/).
+
 ------------------------------------------------------------------------------------------
 
 ## Support Models
@@ -68,7 +72,7 @@ Detailed list 👉 [Supported Model List](https://github.com/PaddlePaddle/Paddle
 ### Pip Installation
 
 ```shell
-pip install --upgrade paddlenlp==3.0.0b2
+pip install --upgrade paddlenlp==3.0.0b3
 ```
 
 or you can install the latest develop branch code with the following command:
diff --git a/csrc/README.md b/csrc/README.md
index 02bd4a372e46..24fe14da6756 100644
--- a/csrc/README.md
+++ b/csrc/README.md
@@ -1,6 +1,9 @@
-# PaddleNLP 自定义 OP
+# PaddleNLP 大模型高性能自定义推理算子
 
-此文档介绍如何编译安装 PaddleNLP 自定义 OP。
+此文档介绍如何编译安装 PaddleNLP 大模型高性能自定义推理算子的安装教程。
+
+使用这些高性能算子，可以大幅提升大模型推理速度。
+大模型推理相关教程详见[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/README.md#6-%E6%8E%A8%E7%90%86)。
 
 ## 安装 C++ 依赖
 
diff --git a/csrc/gpu/append_attention.cu b/csrc/gpu/append_attention.cu
index d24a20e48d11..d6f3efbbf3df 100644
--- a/csrc/gpu/append_attention.cu
+++ b/csrc/gpu/append_attention.cu
@@ -56,6 +56,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
     const std::string& cache_quant_type_str,
     const bool use_neox_rotary_style,
     const int max_input_length,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float out_linear_in_scale,
@@ -97,13 +98,13 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
   if (out_linear_in_scale > 0.0) {
     if (fabs(quant_max_bound - 127.0f) < 0.000001) {
       fmha_out = GetEmptyTensor(
-        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
+        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
         paddle::DataType::INT8,
         qkv.place());
     } 
     else if (fabs(quant_max_bound - 448.0f) < 0.000001) {
       fmha_out = GetEmptyTensor(
-        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
+        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
         paddle::DataType::FLOAT8_E4M3FN,
         qkv.place());
     }else{
@@ -111,7 +112,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
     }
   } else {
     fmha_out = GetEmptyTensor(
-        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
+        {meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
         D,
         qkv.place());
   }
@@ -203,6 +204,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             encoder_block_shape_q,
             max_input_length,
             max_enc_len_this_time_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -240,6 +242,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             encoder_block_shape_q,
             max_input_length,
             max_enc_len_this_time_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -282,6 +285,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
           encoder_block_shape_q,
           max_input_length,
           max_enc_len_this_time_data,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -428,6 +432,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             decoder_block_shape_q,
             max_input_length,
             max_len_kv_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -465,6 +470,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
             decoder_block_shape_q,
             max_input_length,
             max_len_kv_data,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -508,6 +514,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
           decoder_block_shape_q,
           max_input_length,
           max_len_kv_data,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -565,6 +572,7 @@ std::vector<paddle::Tensor> AppendAttention(
     const std::string& cache_quant_type_str,
     const bool use_neox_rotary_style,
     const int max_input_length,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float out_linear_in_scale,
@@ -578,9 +586,10 @@ std::vector<paddle::Tensor> AppendAttention(
   meta_data.token_nums = qkv_dims[0];
   meta_data.kv_num_heads = key_cache_dims[1];
   meta_data.head_dims = key_cache_dims[3];
-  const int total_num_head =
-      qkv_dims[qkv_dims.size() - 1] / meta_data.head_dims;
-  meta_data.q_num_heads = total_num_head - 2 * meta_data.kv_num_heads;
+  meta_data.head_dims_v = value_cache.dims()[3];
+  const int q_hidden_size =
+      qkv_dims[qkv_dims.size() - 1] - meta_data.kv_num_heads * (meta_data.head_dims + meta_data.head_dims_v);
+  meta_data.q_num_heads = q_hidden_size / meta_data.head_dims;
 
   meta_data.max_blocks_per_seq = block_tables.dims()[1];
   meta_data.block_size = key_cache.dims()[2];
@@ -626,6 +635,7 @@ std::vector<paddle::Tensor> AppendAttention(
           cache_quant_type_str,
           use_neox_rotary_style,
           max_input_length,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -672,6 +682,7 @@ std::vector<paddle::Tensor> AppendAttention(
           cache_quant_type_str,
           use_neox_rotary_style,
           max_input_length,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           out_linear_in_scale,
@@ -719,6 +730,7 @@ std::vector<paddle::Tensor> AppendAttention(
             cache_quant_type_str,
             use_neox_rotary_style,
             max_input_length,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -764,6 +776,7 @@ std::vector<paddle::Tensor> AppendAttention(
             cache_quant_type_str,
             use_neox_rotary_style,
             max_input_length,
+            softmax_scale,
             quant_max_bound,
             quant_min_bound,
             out_linear_in_scale,
@@ -821,10 +834,12 @@ std::vector<std::vector<int64_t>> AppendAttentionInferShape(
     const paddle::optional<std::vector<int64_t>>& out_linear_smooths_shape) {
   const int token_num = qkv_shape[0];
   const int kv_num_heads = key_cache_shape[1];
-  const int head_dim = key_cache_shape[3];
-  const int total_num_head = qkv_shape[qkv_shape.size() - 1] / head_dim;
-  const int num_heads = total_num_head - 2 * kv_num_heads;
-  return {{token_num, num_heads * head_dim}, qkv_shape};
+  const int head_dim_qk = key_cache_shape[3];
+  const int head_dim_v = value_cache_shape[3];
+  const int q_hidden_size =
+      qkv_shape[qkv_shape.size() - 1] - kv_num_heads * (head_dim_qk + head_dim_v);
+  const int num_heads = q_hidden_size / head_dim_qk;
+  return {{token_num, num_heads * head_dim_v}, qkv_shape};
 }
 
 std::vector<paddle::DataType> AppendAttentionInferDtype(
@@ -865,6 +880,7 @@ std::vector<paddle::DataType> AppendAttentionInferDtype(
     const std::string& cache_quant_type_str,
     const bool use_neox_rotary_style,
     const int max_input_length,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float out_linear_in_scale,
@@ -941,6 +957,7 @@ PD_BUILD_OP(append_attention)
             "cache_quant_type: std::string",
             "use_neox_rotary_style: bool",
             "max_input_length: int",
+            "softmax_scale: float",
             "quant_max_bound: float",
             "quant_min_bound: float",
             "out_linear_in_scale: float",
diff --git a/csrc/gpu/append_attn/append_attention_c16_impl.cuh b/csrc/gpu/append_attn/append_attention_c16_impl.cuh
index 3b08d0a85dbc..8b75fa13cdca 100644
--- a/csrc/gpu/append_attn/append_attention_c16_impl.cuh
+++ b/csrc/gpu/append_attn/append_attention_c16_impl.cuh
@@ -23,15 +23,17 @@ template <typename T,
           uint32_t NUM_WARPS,
           uint32_t NUM_WARP_Q,
           uint32_t NUM_WARP_KV,
-          uint32_t HEAD_DIM,
+          uint32_t HEAD_DIM_QK,
+          uint32_t HEAD_DIM_V,
           uint32_t BLOCK_SIZE,
           uint32_t num_frags_x,
           uint32_t num_frags_z,
-          uint32_t num_frags_y,
+          uint32_t num_frags_y_qk,
+          uint32_t num_frags_y_v,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_kernel(
-    T *__restrict__ q,        // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     T *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                               // head_dim]
     T *__restrict__ cache_v,
@@ -46,7 +48,7 @@ __global__ void multi_query_append_attention_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -57,7 +59,9 @@ __global__ void multi_query_append_attention_kernel(
     float *__restrict__ tmp_d,      // [token_num, num_chunks, num_heads]
     OutT *__restrict__ out,
     const int speculate_max_draft_token_num = 5) {
-  constexpr uint32_t num_vecs_per_head = HEAD_DIM / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_qk =
+      HEAD_DIM_QK / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_v = HEAD_DIM_V / num_elems_per_128b<T>();
   const uint32_t btid = blockIdx.x, kv_head_idx = blockIdx.z;
   const uint32_t kv_num_heads = gridDim.z;
   const uint32_t q_num_heads = kv_num_heads * GROUP_SIZE;
@@ -104,25 +108,30 @@ __global__ void multi_query_append_attention_kernel(
 
   extern __shared__ uint8_t smem[];
   float s_frag[num_frags_x][num_frags_z][8];
-  float o_frag[num_frags_x][num_frags_y][8];
+  float o_frag[num_frags_x][num_frags_y_v][8];
   float m_frag[num_frags_x][2];
   float d_frag[num_frags_x][2];
-  init_states<T, num_frags_x, num_frags_y>(o_frag, m_frag, d_frag);
-
-  const uint32_t q_n_stride = q_num_heads * HEAD_DIM;
-  const uint32_t q_ori_n_stride = (q_num_heads + kv_num_heads * 2) * HEAD_DIM;
-  const uint32_t kv_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_h_stride = BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_b_stride = HEAD_DIM;
+  init_states<T, num_frags_x, num_frags_y_v>(o_frag, m_frag, d_frag);
+
+  const uint32_t q_n_stride = q_num_heads * HEAD_DIM_V;
+  const uint32_t q_ori_n_stride = q_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_V;
+  const uint32_t k_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_h_stride = BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_b_stride = HEAD_DIM_QK;
+  const uint32_t v_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_h_stride = BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_b_stride = HEAD_DIM_V;
   const uint32_t q_start_seq_id =
       batch_id * max_seq_len - __ldg(&cum_offsets[batch_id]);
   const uint32_t q_base_seq_id_this_block =
       (tile_id * NUM_WARPS + wid) * num_frags_x * 16;
   const uint32_t q_offset = q_start_seq_id * q_ori_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_QK +
                             tid % 8 * num_elems_per_128b<T>();
   const uint32_t o_offset = q_start_seq_id * q_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_V +
                             tid % 8 * num_elems_per_128b<T>();
   T *q_base_ptr = q + q_offset;
   T *o_base_ptr_T = nullptr;
@@ -130,13 +139,13 @@ __global__ void multi_query_append_attention_kernel(
   if constexpr (partition_kv) {
     if (ENABLE_PREFILL) {
       o_base_ptr_T = tmp_workspace + q_start_seq_id * num_chunks * q_n_stride +
-                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
                      tid % 8 * num_elems_per_128b<T>();
     } else {
       o_base_ptr_T =
           tmp_workspace +
           batch_id * speculate_max_draft_token_num * num_chunks * q_n_stride +
-          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
           tid % 8 * num_elems_per_128b<T>();
     }
   } else {
@@ -144,24 +153,42 @@ __global__ void multi_query_append_attention_kernel(
   }
   smem_t qo_smem(smem);
 
-  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       wid * num_frags_x * 16 + tid % 16, tid / 16);  // 16 * 16
-  load_q_global_smem<GROUP_SIZE, num_frags_x, num_frags_y, HEAD_DIM, T>(
+  load_q_global_smem<GROUP_SIZE, num_frags_x, num_frags_y_qk, HEAD_DIM_QK, T>(
       q_base_ptr,
       &qo_smem,
       q_base_seq_id_this_block,
       q_end,
       q_ori_n_stride,
-      HEAD_DIM);
+      HEAD_DIM_QK);
   commit_group();
   wait_group<0>();
   __syncthreads();
+#ifdef DEBUG_PERCISION
+  if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+    printf("q_smem(%d * 192个bfloat16):\n", 4 * num_frags_x * 16);
+    // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 * sizeof(CacheT);
+    T *q_smem_t = reinterpret_cast<T *>(qo_smem.base);
+    for (uint32_t i = 0; i < 4 * num_frags_x * 16; ++i) {
+      printf("q_smem[%d]:", (int)i);
+      for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+        printf("[");
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.2f ", (float)q_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+        }
+        printf("]");
+      }
+      printf("\n");
+    }
+  }
+  __syncthreads();
+#endif
+  q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y_qk, T>(
+      &qo_smem, softmax_scale);
 
-  q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y, T>(&qo_smem,
-                                                                scale);
-
-  smem_t k_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM * sizeof(T)),
-      v_smem(smem + (NUM_WARPS * num_frags_x + num_frags_z) * 16 * HEAD_DIM *
+  smem_t k_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM_QK * sizeof(T)),
+      v_smem(smem + (NUM_WARPS * num_frags_x + num_frags_z) * 16 * HEAD_DIM_QK *
                         sizeof(T));
 
 
@@ -182,50 +209,55 @@ __global__ void multi_query_append_attention_kernel(
                          chunk_start)))
               : chunk_len) /
       (num_frags_z * 16);
-  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       8 * (tid / 16) + tid % 8, (tid % 16) / 8);
 
   uint32_t v_smem_offset_r =
-      smem_t::get_permuted_offset<num_vecs_per_head>(tid % 16, tid / 16);
+      smem_t::get_permuted_offset<num_vecs_per_head_v>(tid % 16, tid / 16);
 
-  uint32_t kv_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
+      wid * 4 + tid / 8, tid % 8);
+  uint32_t v_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_v>(
       wid * 4 + tid / 8, tid % 8);
 
   uint32_t kv_idx_base = chunk_start;
   int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]);
-  const uint32_t const_offset = kv_head_idx * kv_h_stride +
-                                (wid * 4 + tid / 8) * kv_b_stride +
-                                tid % 8 * num_elems_per_128b<T>();
-  T *cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
-  T *cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+  const uint32_t const_offset_k = kv_head_idx * k_h_stride +
+                                  (wid * 4 + tid / 8) * k_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  const uint32_t const_offset_v = kv_head_idx * v_h_stride +
+                                  (wid * 4 + tid / 8) * v_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  T *cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
+  T *cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
 
   produce_kv_blockwise<SharedMemFillMode::kNoFill,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_qk,
                        num_frags_z,
                        NUM_WARP_Q>(k_smem,
-                                   &kv_smem_offset_w,
+                                   &k_smem_offset_w,
                                    &cache_k_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   k_n_stride,
+                                   k_h_stride,
+                                   k_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
   produce_kv_blockwise<SharedMemFillMode::kFillZero,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_v,
                        num_frags_z,
                        NUM_WARP_Q>(v_smem,
-                                   &kv_smem_offset_w,
+                                   &v_smem_offset_w,
                                    &cache_v_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   v_n_stride,
+                                   v_h_stride,
+                                   v_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
@@ -233,10 +265,45 @@ __global__ void multi_query_append_attention_kernel(
   for (uint32_t iter = 0; iter < num_iterations; ++iter) {
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("k_smem(%d * 192个bfloat16):\n", num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *k_smem_t = reinterpret_cast<T *>(k_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("k_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)k_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // s = qk
-    compute_qk<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_qk<num_frags_x, num_frags_y_qk, num_frags_z, T>(
         &qo_smem, &q_smem_offset_r, &k_smem, &k_smem_offset_r, s_frag);
+#ifdef DEBUG_PERCISION
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_z; ++j) {
+          printf("s_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
+    __syncthreads();
+#endif
     // mask according to kv_idx and q_idx
     if (iter >= mask_check_iteration) {
       mask_s<T,
@@ -245,7 +312,7 @@ __global__ void multi_query_append_attention_kernel(
              GROUP_SIZE,
              NUM_WARPS,
              num_frags_x,
-             num_frags_y,
+             num_frags_y_v,
              num_frags_z>(q_base_seq_id_this_block,
                           kv_idx_base,
                           q_len,
@@ -255,7 +322,7 @@ __global__ void multi_query_append_attention_kernel(
     }
 
     // update m,d
-    update_mdo_states<num_frags_x, num_frags_y, num_frags_z>(
+    update_mdo_states<num_frags_x, num_frags_y_v, num_frags_z>(
         s_frag, o_frag, m_frag, d_frag);
     __syncthreads();
 
@@ -264,43 +331,77 @@ __global__ void multi_query_append_attention_kernel(
     if (block_id < 0) {
       block_id = 0;
     }
-    cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
+    cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
     produce_kv_blockwise<SharedMemFillMode::kNoFill,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_qk,
                          num_frags_z,
                          NUM_WARP_Q>(k_smem,
-                                     &kv_smem_offset_w,
+                                     &k_smem_offset_w,
                                      &cache_k_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     k_n_stride,
+                                     k_h_stride,
+                                     k_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("v_smem(%d * 128个bfloat16):\n", num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *v_smem_t = reinterpret_cast<T *>(v_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("v_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_V / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)v_smem_t[i * HEAD_DIM_V + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // compute sfm*v
-    compute_sfm_v<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_sfm_v<num_frags_x, num_frags_y_v, num_frags_z, T>(
         &v_smem, &v_smem_offset_r, s_frag, o_frag, d_frag);
-
+#ifdef DEBUG_PERCISION
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+          printf("o_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
     __syncthreads();
-    cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+#endif
+    __syncthreads();
+    cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
     produce_kv_blockwise<SharedMemFillMode::kFillZero,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_v,
                          num_frags_z,
                          NUM_WARP_Q>(v_smem,
-                                     &kv_smem_offset_w,
+                                     &v_smem_offset_w,
                                      &cache_v_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     v_n_stride,
+                                     v_h_stride,
+                                     v_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
@@ -309,12 +410,28 @@ __global__ void multi_query_append_attention_kernel(
   __syncthreads();
 
   if constexpr (!partition_kv) {
-    normalize_d<num_frags_x, num_frags_y>(o_frag, d_frag);
+    normalize_d<num_frags_x, num_frags_y_v>(o_frag, d_frag);
+  }
+#ifdef DEBUG_PERCISION
+  __syncthreads();
+  if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+      blockIdx.x == 0) {
+    for (uint32_t i = 0; i < num_frags_x; ++i) {
+      for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+        printf("o_frag[%d][%d]:\n", i, j);
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.4f ", s_frag[i][j][k]);
+        }
+        printf("\n");
+      }
+    }
   }
+  __syncthreads();
+#endif
   if constexpr (partition_kv) {
     write_o_reg_gmem_shift_smooth_quant<GROUP_SIZE,
                                         num_frags_x,
-                                        num_frags_y,
+                                        num_frags_y_v,
                                         partition_kv>(
         o_frag,
         &qo_smem,
@@ -328,11 +445,11 @@ __global__ void multi_query_append_attention_kernel(
         in_scale,
         q_len,
         partition_kv ? q_n_stride * num_chunks : q_n_stride,
-        HEAD_DIM);
+        HEAD_DIM_V);
   } else {
     write_o_reg_gmem_shift_smooth_quant<GROUP_SIZE,
                                         num_frags_x,
-                                        num_frags_y,
+                                        num_frags_y_v,
                                         partition_kv>(
         o_frag,
         &qo_smem,
@@ -346,7 +463,7 @@ __global__ void multi_query_append_attention_kernel(
         in_scale,
         q_len,
         partition_kv ? q_n_stride * num_chunks : q_n_stride,
-        HEAD_DIM);
+        HEAD_DIM_V);
   }
 
 
@@ -387,15 +504,17 @@ template <typename T,
           uint32_t NUM_WARPS,
           uint32_t NUM_WARP_Q,
           uint32_t NUM_WARP_KV,
-          uint32_t HEAD_DIM,
+          uint32_t HEAD_DIM_QK,
+          uint32_t HEAD_DIM_V,
           uint32_t BLOCK_SIZE,
           uint32_t num_frags_x,
           uint32_t num_frags_z,
-          uint32_t num_frags_y,
+          uint32_t num_frags_y_qk,
+          uint32_t num_frags_y_v,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_warp1_4_kernel(
-    T *__restrict__ q,        // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     T *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                               // head_dim]
     T *__restrict__ cache_v,
@@ -410,7 +529,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -421,7 +540,9 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     float *__restrict__ tmp_d,      // [token_num, num_chunks, num_heads]
     OutT *__restrict__ out,
     const int speculate_max_draft_token_num = 5) {
-  constexpr uint32_t num_vecs_per_head = HEAD_DIM / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_qk =
+      HEAD_DIM_QK / num_elems_per_128b<T>();
+  constexpr uint32_t num_vecs_per_head_v = HEAD_DIM_V / num_elems_per_128b<T>();
   static_assert(NUM_WARP_Q == 1, "NUM_WARP_Q must be 1");
   static_assert(NUM_WARP_KV == 4, "NUM_WARP_KV must be 4");
   const uint32_t btid = blockIdx.x, kv_head_idx = blockIdx.z;
@@ -467,24 +588,29 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
 
   extern __shared__ uint8_t smem[];
   float s_frag[num_frags_x][num_frags_z][8];
-  float o_frag[num_frags_x][num_frags_y][8];
+  float o_frag[num_frags_x][num_frags_y_v][8];
   float m_frag[num_frags_x][2];
   float d_frag[num_frags_x][2];
-  init_states<T, num_frags_x, num_frags_y>(o_frag, m_frag, d_frag);
-
-  const uint32_t q_n_stride = q_num_heads * HEAD_DIM;
-  const uint32_t q_ori_n_stride = (q_num_heads + kv_num_heads * 2) * HEAD_DIM;
-  const uint32_t kv_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_h_stride = BLOCK_SIZE * HEAD_DIM;
-  const uint32_t kv_b_stride = HEAD_DIM;
+  init_states<T, num_frags_x, num_frags_y_v>(o_frag, m_frag, d_frag);
+
+  const uint32_t q_n_stride = q_num_heads * HEAD_DIM_V;
+  const uint32_t q_ori_n_stride = q_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_QK +
+                                  kv_num_heads * HEAD_DIM_V;
+  const uint32_t k_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_h_stride = BLOCK_SIZE * HEAD_DIM_QK;
+  const uint32_t k_b_stride = HEAD_DIM_QK;
+  const uint32_t v_n_stride = kv_num_heads * BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_h_stride = BLOCK_SIZE * HEAD_DIM_V;
+  const uint32_t v_b_stride = HEAD_DIM_V;
   const uint32_t q_start_seq_id =
       batch_id * max_seq_len - __ldg(&cum_offsets[batch_id]);
   const uint32_t q_base_seq_id_this_block = tile_id * num_frags_x * 16;
   const uint32_t q_offset = q_start_seq_id * q_ori_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_QK +
                             tid % 8 * num_elems_per_128b<T>();
   const uint32_t o_offset = q_start_seq_id * q_n_stride +
-                            q_head_idx * HEAD_DIM +
+                            q_head_idx * HEAD_DIM_V +
                             tid % 8 * num_elems_per_128b<T>();
   T *q_base_ptr = q + q_offset;
   T *o_base_ptr_T = nullptr;
@@ -494,41 +620,59 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   } else {
     if (ENABLE_PREFILL) {
       o_base_ptr_T = tmp_workspace + batch_id * num_chunks * q_n_stride +
-                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+                     chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
                      tid % 8 * num_elems_per_128b<T>();
     } else {
       o_base_ptr_T =
           tmp_workspace +
           batch_id * speculate_max_draft_token_num * num_chunks * q_n_stride +
-          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM +
+          chunk_idx * q_n_stride + q_head_idx * HEAD_DIM_V +
           tid % 8 * num_elems_per_128b<T>();
     }
   }
 
   smem_t qo_smem(smem);
 
-  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t q_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       tid % 16, tid / 16);  // 16 * 16
   load_q_global_smem_multi_warps<GROUP_SIZE,
                                  num_frags_x,
-                                 num_frags_y,
-                                 HEAD_DIM,
+                                 num_frags_y_qk,
+                                 HEAD_DIM_QK,
                                  T>(q_base_ptr,
                                     &qo_smem,
                                     q_base_seq_id_this_block,
                                     q_end,
                                     q_ori_n_stride,
-                                    HEAD_DIM);
+                                    HEAD_DIM_QK);
   commit_group();
   wait_group<0>();
   __syncthreads();
+#ifdef DEBUG_PERCISION_DEC
+  if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+    printf("q_smem(%d * 192个bfloat16):\n", num_frags_x * 16);
+    // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 * sizeof(CacheT);
+    T *q_smem_t = reinterpret_cast<T *>(qo_smem.base);
+    for (uint32_t i = 0; i < 4 * num_frags_x * 16; ++i) {
+      printf("q_smem[%d]:", (int)i);
+      for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+        printf("[");
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.2f ", (float)q_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+        }
+        printf("]");
+      }
+      printf("\n");
+    }
+  }
+  __syncthreads();
+#endif
+  q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y_qk, T>(
+      &qo_smem, softmax_scale);
 
-  q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y, T>(
-      &qo_smem, scale);
-
-  smem_t k_smem(smem + num_frags_x * 16 * HEAD_DIM * sizeof(T)),
-      v_smem(smem + (num_frags_x + NUM_WARP_KV * num_frags_z) * 16 * HEAD_DIM *
-                        sizeof(T));
+  smem_t k_smem(smem + num_frags_x * 16 * HEAD_DIM_QK * sizeof(T)),
+      v_smem(smem + (num_frags_x + NUM_WARP_KV * num_frags_z) * 16 *
+                        HEAD_DIM_QK * sizeof(T));
 
   const uint32_t num_iterations = div_up(
       CAUSAL
@@ -548,34 +692,39 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
               : chunk_len) /
       (NUM_WARP_KV * num_frags_z * 16);
 
-  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
       wid * num_frags_z * 16 + 8 * (tid / 16) + tid % 8, (tid % 16) / 8);
 
-  uint32_t v_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t v_smem_offset_r = smem_t::get_permuted_offset<num_vecs_per_head_v>(
       wid * num_frags_z * 16 + tid % 16, tid / 16);
-  uint32_t kv_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head>(
+  uint32_t k_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_qk>(
+      wid * 4 + tid / 8, tid % 8);
+  uint32_t v_smem_offset_w = smem_t::get_permuted_offset<num_vecs_per_head_v>(
       wid * 4 + tid / 8, tid % 8);
 
   uint32_t kv_idx_base = chunk_start;
   int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]);
-  const uint32_t const_offset = kv_head_idx * kv_h_stride +
-                                (wid * 4 + tid / 8) * kv_b_stride +
-                                tid % 8 * num_elems_per_128b<T>();
-  T *cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
-  T *cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+  const uint32_t const_offset_k = kv_head_idx * k_h_stride +
+                                  (wid * 4 + tid / 8) * k_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  const uint32_t const_offset_v = kv_head_idx * v_h_stride +
+                                  (wid * 4 + tid / 8) * v_b_stride +
+                                  tid % 8 * num_elems_per_128b<T>();
+  T *cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
+  T *cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
 
   produce_kv_blockwise<SharedMemFillMode::kNoFill,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_qk,
                        num_frags_z,
                        NUM_WARP_Q>(k_smem,
-                                   &kv_smem_offset_w,
+                                   &k_smem_offset_w,
                                    &cache_k_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   k_n_stride,
+                                   k_h_stride,
+                                   k_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
@@ -583,15 +732,15 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   produce_kv_blockwise<SharedMemFillMode::kFillZero,
                        NUM_WARPS,
                        BLOCK_SIZE,
-                       num_frags_y,
+                       num_frags_y_v,
                        num_frags_z,
                        NUM_WARP_Q>(v_smem,
-                                   &kv_smem_offset_w,
+                                   &v_smem_offset_w,
                                    &cache_v_now,
                                    kv_head_idx,
-                                   kv_n_stride,
-                                   kv_h_stride,
-                                   kv_b_stride,
+                                   v_n_stride,
+                                   v_h_stride,
+                                   v_b_stride,
                                    kv_idx_base,
                                    chunk_end);
   commit_group();
@@ -600,10 +749,45 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   for (uint32_t iter = 0; iter < num_iterations; ++iter) {
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION_DEC
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("k_smem(%d * 192个bfloat16):\n", 4 * num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *k_smem_t = reinterpret_cast<T *>(k_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("k_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_QK / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)k_smem_t[i * HEAD_DIM_QK + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // s = qk
-    compute_qk<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_qk<num_frags_x, num_frags_y_qk, num_frags_z, T>(
         &qo_smem, &q_smem_offset_r, &k_smem, &k_smem_offset_r, s_frag);
+#ifdef DEBUG_PERCISION_DEC
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_z; ++j) {
+          printf("s_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
+    __syncthreads();
+#endif
     // mask according to kv_idx and q_idx
     if (iter >= mask_check_iteration) {
       mask_s<T,
@@ -612,7 +796,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
              GROUP_SIZE,
              NUM_WARPS,
              num_frags_x,
-             num_frags_y,
+             num_frags_y_v,
              num_frags_z>(q_base_seq_id_this_block,
                           kv_idx_base + wid * num_frags_z * 16,
                           q_len,
@@ -622,7 +806,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     }
 
     // update m,d
-    update_mdo_states<num_frags_x, num_frags_y, num_frags_z>(
+    update_mdo_states<num_frags_x, num_frags_y_v, num_frags_z>(
         s_frag, o_frag, m_frag, d_frag);
     __syncthreads();
 
@@ -631,43 +815,77 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
     if (block_id < 0) {
       block_id = 0;
     }
-    cache_k_now = cache_k + block_id * kv_n_stride + const_offset;
+    cache_k_now = cache_k + block_id * k_n_stride + const_offset_k;
     produce_kv_blockwise<SharedMemFillMode::kNoFill,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_qk,
                          num_frags_z,
                          NUM_WARP_Q>(k_smem,
-                                     &kv_smem_offset_w,
+                                     &k_smem_offset_w,
                                      &cache_k_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     k_n_stride,
+                                     k_h_stride,
+                                     k_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
     wait_group<1>();
     __syncthreads();
-
+#ifdef DEBUG_PERCISION_DEC
+    if (tid == 0 && threadIdx.y == 0 && blockIdx.z == 0 && blockIdx.x == 0) {
+      printf("v_smem(%d * 128个bfloat16):\n", 4 * num_frags_z * 16);
+      // const uint32_t k_num = num_frags_z * 64 * HEAD_DIM / 2 *
+      // sizeof(CacheT);
+      T *v_smem_t = reinterpret_cast<T *>(v_smem.base);
+      for (uint32_t i = 0; i < num_frags_z * 16; ++i) {
+        printf("v_smem[%d]:", (int)i);
+        for (uint32_t j = 0; j < HEAD_DIM_V / 8; ++j) {
+          printf("[");
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.2f ", (float)v_smem_t[i * HEAD_DIM_V + j * 8 + k]);
+          }
+          printf("]");
+        }
+        printf("\n");
+      }
+    }
+    __syncthreads();
+#endif
     // compute sfm*v
-    compute_sfm_v<num_frags_x, num_frags_y, num_frags_z, T>(
+    compute_sfm_v<num_frags_x, num_frags_y_v, num_frags_z, T>(
         &v_smem, &v_smem_offset_r, s_frag, o_frag, d_frag);
     __syncthreads();
-
-    cache_v_now = cache_v + block_id * kv_n_stride + const_offset;
+#ifdef DEBUG_PERCISION_DEC
+    __syncthreads();
+    if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+        blockIdx.x == 0) {
+      for (uint32_t i = 0; i < num_frags_x; ++i) {
+        for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+          printf("o_frag[%d][%d]:\n", i, j);
+          for (uint32_t k = 0; k < 8; ++k) {
+            printf("%.4f ", s_frag[i][j][k]);
+          }
+          printf("\n");
+        }
+      }
+    }
+    __syncthreads();
+#endif
+    cache_v_now = cache_v + block_id * v_n_stride + const_offset_v;
     produce_kv_blockwise<SharedMemFillMode::kFillZero,
                          NUM_WARPS,
                          BLOCK_SIZE,
-                         num_frags_y,
+                         num_frags_y_v,
                          num_frags_z,
                          NUM_WARP_Q>(v_smem,
-                                     &kv_smem_offset_w,
+                                     &v_smem_offset_w,
                                      &cache_v_now,
                                      kv_head_idx,
-                                     kv_n_stride,
-                                     kv_h_stride,
-                                     kv_b_stride,
+                                     v_n_stride,
+                                     v_h_stride,
+                                     v_b_stride,
                                      kv_idx_base,
                                      chunk_end);
     commit_group();
@@ -675,19 +893,34 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
   wait_group<0>();
   __syncthreads();
 
-  merge_block_res_v2<num_frags_x, num_frags_y, T>(
+  merge_block_res_v2<num_frags_x, num_frags_y_v, T>(
       o_frag, reinterpret_cast<float *>(smem), m_frag, d_frag, wid, tid);
 
   if (num_chunks_this_seq <= 1) {
-    normalize_d<num_frags_x, num_frags_y>(o_frag, d_frag);
+    normalize_d<num_frags_x, num_frags_y_v>(o_frag, d_frag);
   }
-
+#ifdef DEBUG_PERCISION_DEC
+  __syncthreads();
+  if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.z == 0 &&
+      blockIdx.x == 0) {
+    for (uint32_t i = 0; i < num_frags_x; ++i) {
+      for (uint32_t j = 0; j < num_frags_y_v; ++j) {
+        printf("o_frag[%d][%d]:\n", i, j);
+        for (uint32_t k = 0; k < 8; ++k) {
+          printf("%.4f ", s_frag[i][j][k]);
+        }
+        printf("\n");
+      }
+    }
+  }
+  __syncthreads();
+#endif
   // write o
   // [num_frags_x, 16, num_frags_y, 16]
   if (num_chunks_this_seq <= 1) {
     write_o_reg_gmem_multi_warps_shift_smooth_quant<GROUP_SIZE,
                                                     num_frags_x,
-                                                    num_frags_y,
+                                                    num_frags_y_v,
                                                     false>(
         o_frag,
         &qo_smem,
@@ -701,11 +934,11 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
         in_scale,
         q_len,
         q_n_stride,
-        HEAD_DIM);
+        HEAD_DIM_V);
   } else {
     write_o_reg_gmem_multi_warps_shift_smooth_quant<GROUP_SIZE,
                                                     num_frags_x,
-                                                    num_frags_y,
+                                                    num_frags_y_v,
                                                     partition_kv>(
         o_frag,
         &qo_smem,
@@ -719,7 +952,7 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
         in_scale,
         q_len,
         q_n_stride * num_chunks,
-        HEAD_DIM);
+        HEAD_DIM_V);
   }
 
   if (num_chunks_this_seq > 1) {
@@ -757,7 +990,8 @@ __global__ void multi_query_append_attention_warp1_4_kernel(
 
 template <typename T,
           uint32_t GROUP_SIZE,
-          uint32_t HEAD_DIM,
+          uint32_t HEAD_DIM_QK,
+          uint32_t HEAD_DIM_V,
           uint32_t BLOCK_SIZE,
           bool CAUSAL,
           uint32_t BLOCK_SHAPE_Q,
@@ -783,6 +1017,7 @@ void MultiQueryAppendAttention(
     const int num_blocks_x_cpu,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -802,18 +1037,18 @@ void MultiQueryAppendAttention(
   constexpr uint32_t num_warps = 4;
   constexpr uint32_t NUM_WARP_KV = num_warps / NUM_WARP_Q;
   constexpr uint32_t num_frags_x = BLOCK_SHAPE_Q / (16 * NUM_WARP_Q);  // 1 or 2
-  constexpr uint32_t num_frags_y = HEAD_DIM / 16;
+  constexpr uint32_t num_frags_y_qk = HEAD_DIM_QK / 16;
+  constexpr uint32_t num_frags_y_v = HEAD_DIM_V / 16;
   constexpr uint32_t num_qrow_per_block = NUM_WARP_Q * num_frags_x * 16;
 
   auto *allocator = paddle::GetAllocator(qkv.place());
 
-  const float scale = 1.f / sqrt(HEAD_DIM);
-
   if constexpr (NUM_WARP_Q == 4) {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16;
     constexpr uint32_t smem_size =
-        (num_warps * num_frags_x + NUM_WARP_KV * num_frags_z * 2) * 16 *
-        HEAD_DIM * sizeof(T);
+        ((num_warps * num_frags_x + NUM_WARP_KV * num_frags_z) * HEAD_DIM_QK +
+         NUM_WARP_KV * num_frags_z * HEAD_DIM_V) *
+        16 * sizeof(T);
     auto split_kv_kernel = multi_query_append_attention_kernel<NV_TYPE,
                                                                true,
                                                                GROUP_SIZE,
@@ -821,11 +1056,13 @@ void MultiQueryAppendAttention(
                                                                num_warps,
                                                                NUM_WARP_Q,
                                                                NUM_WARP_KV,
-                                                               HEAD_DIM,
+                                                               HEAD_DIM_QK,
+                                                               HEAD_DIM_V,
                                                                BLOCK_SIZE,
                                                                num_frags_x,
                                                                num_frags_z,
-                                                               num_frags_y,
+                                                               num_frags_y_qk,
+                                                               num_frags_y_v,
                                                                OUT_NV_TYPE,
                                                                ENABLE_PREFILL>;
     if (smem_size >= 48 * 1024) {
@@ -853,11 +1090,13 @@ void MultiQueryAppendAttention(
                                               num_warps,
                                               NUM_WARP_Q,
                                               NUM_WARP_KV,
-                                              HEAD_DIM,
+                                              HEAD_DIM_QK,
+                                              HEAD_DIM_V,
                                               BLOCK_SIZE,
                                               num_frags_x,
                                               num_frags_z,
-                                              num_frags_y,
+                                              num_frags_y_qk,
+                                              num_frags_y_v,
                                               OUT_NV_TYPE,
                                               ENABLE_PREFILL>;
       if (smem_size >= 48 * 1024) {
@@ -885,7 +1124,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -899,9 +1138,10 @@ void MultiQueryAppendAttention(
     } else {
       phi::Allocator::AllocationPtr tmp_workspace, tmp_m, tmp_d;
       if (ENABLE_PREFILL) {
-        tmp_workspace = allocator->Allocate(
-            phi::SizeOf(qkv.dtype()) *
-            static_cast<size_t>(token_num * num_chunks * num_heads * HEAD_DIM));
+        tmp_workspace =
+            allocator->Allocate(phi::SizeOf(qkv.dtype()) *
+                                static_cast<size_t>(token_num * num_chunks *
+                                                    num_heads * HEAD_DIM_V));
         tmp_m = allocator->Allocate(
             phi::SizeOf(paddle::DataType::FLOAT32) *
             static_cast<size_t>(token_num * num_chunks * num_heads));
@@ -912,7 +1152,7 @@ void MultiQueryAppendAttention(
         tmp_workspace = allocator->Allocate(
             phi::SizeOf(qkv.dtype()) *
             static_cast<size_t>(speculate_max_draft_token_num * bsz *
-                                num_chunks * num_heads * HEAD_DIM));
+                                num_chunks * num_heads * HEAD_DIM_V));
         tmp_m = allocator->Allocate(
             phi::SizeOf(paddle::DataType::FLOAT32) *
             static_cast<size_t>(speculate_max_draft_token_num * bsz *
@@ -942,7 +1182,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -955,14 +1195,14 @@ void MultiQueryAppendAttention(
       // merge
       constexpr int vec_size = num_elems_per_128b<NV_TYPE>();
       if (is_decoder) {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
         dim3 grids_merge(bsz, num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_decoder_kernel<NV_TYPE,
                                           vec_size,
                                           blocky,
-                                          HEAD_DIM,
+                                          HEAD_DIM_V,
                                           OUT_NV_TYPE,
                                           ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -987,9 +1227,9 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM);
+                HEAD_DIM_V);
       } else {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
         dim3 grids_merge(min(sm_count * 4, token_num),
                          num_heads);  // 128k is too large
@@ -997,7 +1237,7 @@ void MultiQueryAppendAttention(
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
                                      blocky,
-                                     HEAD_DIM,
+                                     HEAD_DIM_V,
                                      OUT_NV_TYPE,
                                      ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -1022,7 +1262,7 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM,
+                HEAD_DIM_V,
                 token_num,
                 speculate_max_draft_token_num);
       }
@@ -1030,8 +1270,9 @@ void MultiQueryAppendAttention(
   } else {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16 / NUM_WARP_KV;
     constexpr uint32_t smem_size =
-        (num_frags_x + NUM_WARP_KV * num_frags_z * 2) * 16 * HEAD_DIM *
-        sizeof(T);
+        ((num_frags_x + NUM_WARP_KV * num_frags_z) * HEAD_DIM_QK +
+         NUM_WARP_KV * num_frags_z * HEAD_DIM_V) *
+        16 * sizeof(T);
     auto split_kv_kernel =
         multi_query_append_attention_warp1_4_kernel<NV_TYPE,
                                                     true,
@@ -1040,11 +1281,13 @@ void MultiQueryAppendAttention(
                                                     num_warps,
                                                     NUM_WARP_Q,
                                                     NUM_WARP_KV,
-                                                    HEAD_DIM,
+                                                    HEAD_DIM_QK,
+                                                    HEAD_DIM_V,
                                                     BLOCK_SIZE,
                                                     num_frags_x,
                                                     num_frags_z,
-                                                    num_frags_y,
+                                                    num_frags_y_qk,
+                                                    num_frags_y_v,
                                                     OUT_NV_TYPE,
                                                     ENABLE_PREFILL>;
     if (smem_size >= 48 * 1024) {
@@ -1074,11 +1317,13 @@ void MultiQueryAppendAttention(
                                                       num_warps,
                                                       NUM_WARP_Q,
                                                       NUM_WARP_KV,
-                                                      HEAD_DIM,
+                                                      HEAD_DIM_QK,
+                                                      HEAD_DIM_V,
                                                       BLOCK_SIZE,
                                                       num_frags_x,
                                                       num_frags_z,
-                                                      num_frags_y,
+                                                      num_frags_y_qk,
+                                                      num_frags_y_v,
                                                       OUT_NV_TYPE,
                                                       ENABLE_PREFILL>;
       if (smem_size >= 48 * 1024) {
@@ -1106,7 +1351,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1121,7 +1366,7 @@ void MultiQueryAppendAttention(
       if (is_decoder) {
         tmp_workspace = allocator->Allocate(
             phi::SizeOf(qkv.dtype()) *
-            static_cast<size_t>(bsz * num_chunks * num_heads * HEAD_DIM));
+            static_cast<size_t>(bsz * num_chunks * num_heads * HEAD_DIM_V));
         tmp_m = allocator->Allocate(
             phi::SizeOf(paddle::DataType::FLOAT32) *
             static_cast<size_t>(bsz * num_chunks * num_heads));
@@ -1133,7 +1378,7 @@ void MultiQueryAppendAttention(
           tmp_workspace =
               allocator->Allocate(phi::SizeOf(qkv.dtype()) *
                                   static_cast<size_t>(token_num * num_chunks *
-                                                      num_heads * HEAD_DIM));
+                                                      num_heads * HEAD_DIM_V));
           tmp_m = allocator->Allocate(
               phi::SizeOf(paddle::DataType::FLOAT32) *
               static_cast<size_t>(token_num * num_chunks * num_heads));
@@ -1144,7 +1389,7 @@ void MultiQueryAppendAttention(
           tmp_workspace = allocator->Allocate(
               phi::SizeOf(qkv.dtype()) *
               static_cast<size_t>(speculate_max_draft_token_num * bsz *
-                                  num_chunks * num_heads * HEAD_DIM));
+                                  num_chunks * num_heads * HEAD_DIM_V));
           tmp_m = allocator->Allocate(
               phi::SizeOf(paddle::DataType::FLOAT32) *
               static_cast<size_t>(speculate_max_draft_token_num * bsz *
@@ -1174,7 +1419,7 @@ void MultiQueryAppendAttention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1188,14 +1433,14 @@ void MultiQueryAppendAttention(
       // merge
       constexpr int vec_size = num_elems_per_128b<NV_TYPE>();
       if (is_decoder) {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
         dim3 grids_merge(bsz, num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_decoder_kernel<NV_TYPE,
                                           vec_size,
                                           blocky,
-                                          HEAD_DIM,
+                                          HEAD_DIM_V,
                                           OUT_NV_TYPE,
                                           ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -1220,17 +1465,16 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM);
+                HEAD_DIM_V);
       } else {
-        constexpr int blockx = HEAD_DIM / vec_size;
+        constexpr int blockx = HEAD_DIM_V / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
-        dim3 grids_merge(min(sm_count * 4, token_num),
-                         num_heads);
+        dim3 grids_merge(min(sm_count * 4, token_num), num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
                                      blocky,
-                                     HEAD_DIM,
+                                     HEAD_DIM_V,
                                      OUT_NV_TYPE,
                                      ENABLE_PREFILL>
             <<<grids_merge, blocks_merge, 0, stream>>>(
@@ -1255,7 +1499,7 @@ void MultiQueryAppendAttention(
                 num_chunks,
                 num_heads,
                 chunk_size,
-                HEAD_DIM,
+                HEAD_DIM_V,
                 token_num,
                 speculate_max_draft_token_num);
       }
@@ -1265,37 +1509,39 @@ void MultiQueryAppendAttention(
 
 template <typename T, typename OutT>
 void CascadeAppendAttentionC16Kernel(
-    const AppendAttnMetaData& meta_data,
-    const paddle::Tensor& qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
-    const paddle::Tensor&
-        cache_k,  // [max_block_num, num_heads, block_size, head_dim]
-    const paddle::Tensor&
-        cache_v,  // [max_block_num, num_heads, head_dim, block_size]
-    const paddle::optional<paddle::Tensor>& attn_mask,
-    const paddle::optional<paddle::Tensor>&
-        cache_k_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_k_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        shift_bias,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        smooth_weight,  // [num_kv_heads, head_dim]
-    const paddle::Tensor& seq_lens_q,
-    const paddle::Tensor& seq_lens_kv,
-    const paddle::Tensor& seq_lens_encoder,
-    const paddle::Tensor& padding_offsets,
-    const paddle::Tensor& cum_offsets,
-    const paddle::Tensor& block_table,
-    const paddle::Tensor& batch_ids,
-    const paddle::Tensor& tile_ids_per_batch,
+    const AppendAttnMetaData &meta_data,
+    const paddle::Tensor
+        &qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    const paddle::Tensor
+        &cache_k,  // [max_block_num, num_heads, block_size, head_dim]
+    const paddle::Tensor
+        &cache_v,  // [max_block_num, num_heads, head_dim, block_size]
+    const paddle::optional<paddle::Tensor> &attn_mask,
+    const paddle::optional<paddle::Tensor>
+        &cache_k_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_k_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &shift_bias,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &smooth_weight,  // [num_kv_heads, head_dim]
+    const paddle::Tensor &seq_lens_q,
+    const paddle::Tensor &seq_lens_kv,
+    const paddle::Tensor &seq_lens_encoder,
+    const paddle::Tensor &padding_offsets,
+    const paddle::Tensor &cum_offsets,
+    const paddle::Tensor &block_table,
+    const paddle::Tensor &batch_ids,
+    const paddle::Tensor &tile_ids_per_batch,
     const int num_blocks,
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -1303,14 +1549,15 @@ void CascadeAppendAttentionC16Kernel(
     const bool causal,
     const bool is_decoder,
     const bool enable_prefill,
-    cudaStream_t& stream,
-    paddle::Tensor* out) {
+    cudaStream_t &stream,
+    paddle::Tensor *out) {
   const auto token_num = meta_data.token_nums;
   const auto block_size = meta_data.block_size;
   const auto bsz = meta_data.batch_size;
   const auto num_heads = meta_data.q_num_heads;
   const auto group_size = meta_data.q_num_heads / meta_data.kv_num_heads;
-  const auto head_dim = meta_data.head_dims;
+  const auto head_dim_qk = meta_data.head_dims;
+  const auto head_dim_v = meta_data.head_dims_v;
 
   DISPATCH_CAUSAL(
       causal,
@@ -1322,46 +1569,51 @@ void CascadeAppendAttentionC16Kernel(
               group_size,
               GROUP_SIZE,
               {DISPATCH_HEAD_DIM(
-                  head_dim,
-                  HEAD_DIM,
-                  {DISPATCH_BLOCK_SIZE(
-                      block_size,
-                      BLOCK_SIZE,
-                      {DISPATCH_BLOCKSHAPE_Q(
-                          block_shape_q, BLOCK_SHAPE_Q, NUM_WARP_Q, {
-                            MultiQueryAppendAttention<T,
-                                                      GROUP_SIZE,
-                                                      HEAD_DIM,
-                                                      BLOCK_SIZE,
-                                                      CAUSAL,
-                                                      BLOCK_SHAPE_Q,
-                                                      NUM_WARP_Q,
-                                                      OutT,
-                                                      ENABLE_PREFILL>(
-                                meta_data,
-                                qkv,
-                                cache_k,
-                                cache_v,
-                                attn_mask,
-                                shift_bias,
-                                smooth_weight,
-                                seq_lens_q,
-                                seq_lens_kv,
-                                seq_lens_encoder,
-                                padding_offsets,
-                                cum_offsets,
-                                block_table,
-                                batch_ids,
-                                tile_ids_per_batch,
-                                num_blocks,
-                                max_seq_len,
-                                max_dec_len,
-                                quant_max_bound,
-                                quant_min_bound,
-                                in_scale,
-                                speculate_max_draft_token_num,
-                                is_decoder,
-                                stream,
-                                out);
-                          })})})})})})
+                  head_dim_qk,
+                  HEAD_DIM_QK,
+                  {DISPATCH_HEAD_DIM(
+                      head_dim_v,
+                      HEAD_DIM_V,
+                      {DISPATCH_BLOCK_SIZE(
+                          block_size,
+                          BLOCK_SIZE,
+                          {DISPATCH_BLOCKSHAPE_Q(
+                              block_shape_q, BLOCK_SHAPE_Q, NUM_WARP_Q, {
+                                MultiQueryAppendAttention<T,
+                                                          GROUP_SIZE,
+                                                          HEAD_DIM_QK,
+                                                          HEAD_DIM_V,
+                                                          BLOCK_SIZE,
+                                                          CAUSAL,
+                                                          BLOCK_SHAPE_Q,
+                                                          NUM_WARP_Q,
+                                                          OutT,
+                                                          ENABLE_PREFILL>(
+                                    meta_data,
+                                    qkv,
+                                    cache_k,
+                                    cache_v,
+                                    attn_mask,
+                                    shift_bias,
+                                    smooth_weight,
+                                    seq_lens_q,
+                                    seq_lens_kv,
+                                    seq_lens_encoder,
+                                    padding_offsets,
+                                    cum_offsets,
+                                    block_table,
+                                    batch_ids,
+                                    tile_ids_per_batch,
+                                    num_blocks,
+                                    max_seq_len,
+                                    max_dec_len,
+                                    softmax_scale,
+                                    quant_max_bound,
+                                    quant_min_bound,
+                                    in_scale,
+                                    speculate_max_draft_token_num,
+                                    is_decoder,
+                                    stream,
+                                    out);
+                              })})})})})})})
 }
diff --git a/csrc/gpu/append_attn/append_attention_c4_impl.cuh b/csrc/gpu/append_attn/append_attention_c4_impl.cuh
index 7d49de3966e0..fac1baf6f4c2 100644
--- a/csrc/gpu/append_attn/append_attention_c4_impl.cuh
+++ b/csrc/gpu/append_attn/append_attention_c4_impl.cuh
@@ -51,7 +51,7 @@ __global__ void multi_query_append_attention_c4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -189,7 +189,7 @@ __global__ void multi_query_append_attention_c4_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y, T>(&qo_smem,
-                                                                scale);
+                                                                softmax_scale);
 
   T cache_k_scale_frag[num_frags_y][4];
   T cache_k_zp_frag[num_frags_y][4];
@@ -509,7 +509,7 @@ __global__ void multi_query_append_attention_c4_warp1_4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -649,7 +649,7 @@ __global__ void multi_query_append_attention_c4_warp1_4_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y, T>(
-      &qo_smem, scale);
+      &qo_smem, softmax_scale);
 
   T cache_k_scale_frag[num_frags_y][4];
   T cache_k_zp_frag[num_frags_y][4];
@@ -970,6 +970,7 @@ void MultiQueryAppendC4Attention(
     const int num_blocks_x_cpu,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -994,8 +995,6 @@ void MultiQueryAppendC4Attention(
 
   auto *allocator = paddle::GetAllocator(qkv.place());
 
-  const float scale = 1.f / sqrt(HEAD_DIM);
-
   if constexpr (NUM_WARP_Q == 4) {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16;
     constexpr uint32_t smem_size =
@@ -1091,7 +1090,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1154,7 +1153,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1336,7 +1335,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1412,7 +1411,7 @@ void MultiQueryAppendC4Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1533,6 +1532,7 @@ void CascadeAppendAttentionC4Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -1597,6 +1597,7 @@ void CascadeAppendAttentionC4Kernel(
                                 num_blocks,
                                 max_seq_len,
                                 max_dec_len,
+                                softmax_scale,
                                 quant_max_bound,
                                 quant_min_bound,
                                 in_scale,
diff --git a/csrc/gpu/append_attn/append_attention_c8_impl.cuh b/csrc/gpu/append_attn/append_attention_c8_impl.cuh
index e0ede51a9c81..df2357bb192b 100644
--- a/csrc/gpu/append_attn/append_attention_c8_impl.cuh
+++ b/csrc/gpu/append_attn/append_attention_c8_impl.cuh
@@ -32,7 +32,7 @@ template <typename T,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_c8_kernel(
-    T *__restrict__ q,             // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     CacheT *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                                    // head_dim]
     CacheT *__restrict__ cache_v,
@@ -49,7 +49,7 @@ __global__ void multi_query_append_attention_c8_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -172,7 +172,7 @@ __global__ void multi_query_append_attention_c8_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale<num_frags_x, num_frags_y, T>(&qo_smem,
-                                                                scale);
+                                                                softmax_scale);
   smem_t k_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM * sizeof(T)),
       v_smem(smem + NUM_WARPS * num_frags_x * 16 * HEAD_DIM * sizeof(T) +
              num_frags_z * 16 * HEAD_DIM * sizeof(CacheT));
@@ -206,8 +206,7 @@ __global__ void multi_query_append_attention_c8_kernel(
 
   uint32_t k_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_head_k, inv_k_stride>(
-          wid * 4 + tid / 8,
-          tid % 8);  
+          wid * 4 + tid / 8, tid % 8);
   uint32_t v_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_blocksize, inv_v_stride>(
           wid * 8 + tid / 4, tid % 4);  // 4 * 128 / 8 = 64
@@ -338,7 +337,6 @@ __global__ void multi_query_append_attention_c8_kernel(
                                        chunk_end,
                                        const_v_offset);
     commit_group();
-
   }
   wait_group<0>();
   __syncthreads();
@@ -434,7 +432,7 @@ template <typename T,
           typename OutT = T,
           bool ENABLE_PREFILL = true>
 __global__ void multi_query_append_attention_c8_warp1_4_kernel(
-    T *__restrict__ q,             // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    T *__restrict__ q,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
     CacheT *__restrict__ cache_k,  // [max_block_num, num_heads, block_size,
                                    // head_dim]
     CacheT *__restrict__ cache_v,
@@ -451,7 +449,7 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
     const int max_seq_len,
     const int max_dec_len,
     const int max_block_num_per_seq,
-    const float scale,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -575,7 +573,7 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
   __syncthreads();
 
   q_smem_inplace_multiply_sm_scale_multi_warps<num_frags_x, num_frags_y, T>(
-      &qo_smem, scale);
+      &qo_smem, softmax_scale);
 
   smem_t k_smem(smem + num_frags_x * 16 * HEAD_DIM * sizeof(T)),
       v_smem(smem + num_frags_x * 16 * HEAD_DIM * sizeof(T) +
@@ -610,12 +608,10 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
 
   uint32_t k_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_head_k, inv_k_stride>(
-          wid * 4 + tid / 8,
-          tid %
-              8);  
+          wid * 4 + tid / 8, tid % 8);
   uint32_t v_smem_offset_w =
       smem_t::get_permuted_offset<num_vecs_per_blocksize, inv_v_stride>(
-          wid * 8 + tid / 4, tid % 4);  
+          wid * 8 + tid / 4, tid % 4);
 
   uint32_t kv_idx_base = chunk_start;
   const uint32_t const_k_offset = kv_head_idx * kv_h_stride +
@@ -805,7 +801,6 @@ __global__ void multi_query_append_attention_c8_warp1_4_kernel(
           const uint32_t qo_head_idx = q_head_idx + qo_idx_now % GROUP_SIZE;
           const uint32_t qo_idx = q_start_seq_id + qo_idx_now / GROUP_SIZE;
           if (qo_idx - q_start_seq_id < q_len) {
-
             uint32_t offset;
             if (ENABLE_PREFILL) {
               offset = (batch_id * num_chunks + chunk_idx) * q_num_heads +
@@ -857,6 +852,7 @@ void MultiQueryAppendC8Attention(
     const int num_blocks_x_cpu,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -881,8 +877,6 @@ void MultiQueryAppendC8Attention(
 
   auto *allocator = paddle::GetAllocator(qkv.place());
 
-  const float scale = 1.f / sqrt(HEAD_DIM);
-
   if constexpr (NUM_WARP_Q == 4) {
     constexpr uint32_t num_frags_z = BLOCK_SIZE / 16;
     constexpr uint32_t smem_size =
@@ -963,7 +957,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1020,7 +1014,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1069,8 +1063,7 @@ void MultiQueryAppendC8Attention(
       } else {
         constexpr int blockx = HEAD_DIM / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
-        dim3 grids_merge(min(sm_count * 4, token_num),
-                         num_heads);
+        dim3 grids_merge(min(sm_count * 4, token_num), num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
@@ -1186,7 +1179,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1256,7 +1249,7 @@ void MultiQueryAppendC8Attention(
           max_seq_len,
           max_dec_len,
           max_block_num_per_seq,
-          scale,
+          softmax_scale,
           quant_max_bound,
           quant_min_bound,
           in_scale,
@@ -1300,8 +1293,7 @@ void MultiQueryAppendC8Attention(
       } else {
         constexpr int blockx = HEAD_DIM / vec_size;
         constexpr int blocky = (128 + blockx - 1) / blockx;
-        dim3 grids_merge(min(sm_count * 4, token_num),
-                         num_heads);
+        dim3 grids_merge(min(sm_count * 4, token_num), num_heads);
         dim3 blocks_merge(blockx, blocky);
         merge_multi_chunks_v2_kernel<NV_TYPE,
                                      vec_size,
@@ -1341,37 +1333,39 @@ void MultiQueryAppendC8Attention(
 
 template <typename T, typename OutT>
 void CascadeAppendAttentionC8Kernel(
-    const AppendAttnMetaData& meta_data,
-    const paddle::Tensor& qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
-    const paddle::Tensor&
-        cache_k,  // [max_block_num, num_heads, block_size, head_dim]
-    const paddle::Tensor&
-        cache_v,  // [max_block_num, num_heads, head_dim, block_size]
-    const paddle::optional<paddle::Tensor>& attn_mask,
-    const paddle::optional<paddle::Tensor>&
-        cache_k_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_scale,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_k_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        cache_v_zp,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        shift_bias,  // [num_kv_heads, head_dim]
-    const paddle::optional<paddle::Tensor>&
-        smooth_weight,  // [num_kv_heads, head_dim]
-    const paddle::Tensor& seq_lens_q,
-    const paddle::Tensor& seq_lens_kv,
-    const paddle::Tensor& seq_lens_encoder,
-    const paddle::Tensor& padding_offsets,
-    const paddle::Tensor& cum_offsets,
-    const paddle::Tensor& block_table,
-    const paddle::Tensor& batch_ids,
-    const paddle::Tensor& tile_ids_per_batch,
+    const AppendAttnMetaData &meta_data,
+    const paddle::Tensor
+        &qkv,  // [token_num, (num_heads + 2* kv_num_head) * head_dim]
+    const paddle::Tensor
+        &cache_k,  // [max_block_num, num_heads, block_size, head_dim]
+    const paddle::Tensor
+        &cache_v,  // [max_block_num, num_heads, head_dim, block_size]
+    const paddle::optional<paddle::Tensor> &attn_mask,
+    const paddle::optional<paddle::Tensor>
+        &cache_k_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_scale,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_k_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &cache_v_zp,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &shift_bias,  // [num_kv_heads, head_dim]
+    const paddle::optional<paddle::Tensor>
+        &smooth_weight,  // [num_kv_heads, head_dim]
+    const paddle::Tensor &seq_lens_q,
+    const paddle::Tensor &seq_lens_kv,
+    const paddle::Tensor &seq_lens_encoder,
+    const paddle::Tensor &padding_offsets,
+    const paddle::Tensor &cum_offsets,
+    const paddle::Tensor &block_table,
+    const paddle::Tensor &batch_ids,
+    const paddle::Tensor &tile_ids_per_batch,
     const int num_blocks,
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -1379,8 +1373,8 @@ void CascadeAppendAttentionC8Kernel(
     const bool causal,
     const bool is_decoder,
     const bool enable_prefill,
-    cudaStream_t& stream,
-    paddle::Tensor* out) {
+    cudaStream_t &stream,
+    paddle::Tensor *out) {
   const auto token_num = meta_data.token_nums;
   const auto block_size = meta_data.block_size;
   const auto bsz = meta_data.batch_size;
@@ -1434,6 +1428,7 @@ void CascadeAppendAttentionC8Kernel(
                                 num_blocks,
                                 max_seq_len,
                                 max_dec_len,
+                                softmax_scale,
                                 quant_max_bound,
                                 quant_min_bound,
                                 in_scale,
diff --git a/csrc/gpu/append_attn/append_attention_kernel.h b/csrc/gpu/append_attn/append_attention_kernel.h
index b0fabcf893d3..b34c2a044733 100644
--- a/csrc/gpu/append_attn/append_attention_kernel.h
+++ b/csrc/gpu/append_attn/append_attention_kernel.h
@@ -49,6 +49,7 @@ void CascadeAppendAttentionC16Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -92,6 +93,7 @@ void CascadeAppendAttentionC8Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -135,6 +137,7 @@ void CascadeAppendAttentionC4Kernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -179,6 +182,7 @@ void CascadeAppendAttentionKernel(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
@@ -212,6 +216,7 @@ void CascadeAppendAttentionKernel(
                                              block_shape_q,
                                              max_seq_len,
                                              max_dec_len,
+                                             softmax_scale,
                                              quant_max_bound,
                                              quant_min_bound,
                                              in_scale,
@@ -245,6 +250,7 @@ void CascadeAppendAttentionKernel(
                                             block_shape_q,
                                             max_seq_len,
                                             max_dec_len,
+                                            softmax_scale,
                                             quant_max_bound,
                                             quant_min_bound,
                                             in_scale,
@@ -278,6 +284,7 @@ void CascadeAppendAttentionKernel(
                                             block_shape_q,
                                             max_seq_len,
                                             max_dec_len,
+                                            softmax_scale,
                                             quant_max_bound,
                                             quant_min_bound,
                                             in_scale,
diff --git a/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh b/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh
index 1a8e73759022..5fbb53f05801 100644
--- a/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh
+++ b/csrc/gpu/append_attn/decoder_write_cache_with_rope_impl.cuh
@@ -122,6 +122,91 @@ __global__ void append_decode_cache_T_rope_kernel(
   }
 }
 
+template <typename T, int VecSize = 1>
+__global__ void append_decode_cache_T_kernel(
+    const T* __restrict__ qkv,  // [bsz, num_heads + 2 * kv_num_heads,
+                                      // head_size]
+    T* __restrict__ key_cache,    // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    T* __restrict__ value_cache,  // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    const int* __restrict__ block_tables,     // [bsz, max_blocks_per_seq]
+    const int* __restrict__ padding_offsets,  // [num_tokens]
+    const int* __restrict__ cum_offsets,
+    const int* __restrict__ seq_lens,          // [bsz]
+    const int* __restrict__ seq_lens_encoder,  // [bsz]
+    const int max_seq_len,
+    const int max_blocks_per_seq,
+    const int num_heads,
+    const int head_size_qk,
+    const int head_size_v,
+    const int block_size,
+    const uint32_t elem_cnt,
+    const int kv_num_heads) {
+  using LoadT = AlignedVector<T, VecSize>;
+  using LoadBiasT = AlignedVector<T, VecSize>;
+  using LoadKVT = AlignedVector<T, VecSize>;
+  constexpr int HalfVecSize = VecSize / 2;
+  using LoadEmbT = AlignedVector<float, HalfVecSize>;
+  LoadT src_vec;
+  LoadBiasT out_vec;
+  LoadKVT cache_vec;
+
+  int64_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  // const int64_t hidden_size = (num_heads + 2 * kv_num_heads) * head_size;
+  const uint32_t hidden_size_q = num_heads * head_size_qk;
+  const uint32_t hidden_size_k = kv_num_heads * head_size_qk;
+  const uint32_t hidden_size_v = kv_num_heads * head_size_v;
+  const int64_t hidden_size = hidden_size_q + hidden_size_k + hidden_size_v;
+  const uint32_t offset = kv_num_heads * (head_size_qk + head_size_v);
+  // const int64_t offset = 2 * hidden_size;
+  // const int half_head_size = head_size / 2;
+  for (int32_t linear_index = global_thread_idx * VecSize,
+               step = gridDim.x * blockDim.x * VecSize;
+       linear_index < elem_cnt;
+       linear_index += step) {
+    const int ori_bi = linear_index / offset;
+    const int bias = linear_index % offset;
+    const int start_token_idx = ori_bi * max_seq_len - cum_offsets[ori_bi];
+    if (seq_lens_encoder[ori_bi] > 0) return;
+    const int write_seq_id = seq_lens[ori_bi];
+    
+    if (write_seq_id == 0) continue;
+
+    const int* block_table_now = nullptr;
+
+    block_table_now = block_tables + ori_bi * max_blocks_per_seq;
+    const int block_idx = block_table_now[write_seq_id / block_size];
+    const int block_offset = write_seq_id % block_size;
+
+    if (bias < hidden_size_k) {
+      const uint32_t qkv_bias = bias;
+      const uint32_t hi = qkv_bias / head_size_qk;
+      const uint32_t h_bias = qkv_bias % head_size_qk;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_qk +
+                             hi * block_size * head_size_qk +
+                             block_offset * head_size_qk + h_bias;
+      const uint32_t ori_idx =
+          start_token_idx * hidden_size +
+          hidden_size_q + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &key_cache[tgt_idx]);
+    } else {
+      const uint32_t qkv_bias = bias - hidden_size_k;
+      const uint32_t hi = qkv_bias / head_size_v;
+      const uint32_t h_bias = qkv_bias % head_size_v;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_v +
+                             hi * block_size * head_size_v +
+                             block_offset * head_size_v + h_bias;
+      const uint32_t ori_idx =
+          start_token_idx * hidden_size +
+          hidden_size_q + hidden_size_k + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &value_cache[tgt_idx]);
+    }
+  }
+}
+
 template <typename T, int VecSize = 1>
 __global__ void append_decode_cache_T_rope_kernel(
     const int* __restrict__ quant_qkv,  // [bsz, num_heads + 2 * kv_num_heads,
diff --git a/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu b/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu
index ee0cd57e307c..08483feb2a5c 100644
--- a/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu
+++ b/csrc/gpu/append_attn/decoder_write_cache_with_rope_kernel.cu
@@ -15,6 +15,54 @@
 #include "decoder_write_cache_with_rope_kernel.h"
 #include "utils.cuh"
 
+
+template <typename T>
+void DecoderWriteCacheKV(const AppendAttnMetaData& meta_data,
+                         const paddle::Tensor& qkv,
+                         const paddle::Tensor& seq_lens,
+                         const paddle::Tensor& seq_lens_encoder,
+                         const paddle::Tensor& padding_offsets,
+                         const paddle::Tensor& cum_offsets,
+                         const paddle::Tensor& block_tables,
+                         const int max_seq_len,
+                         cudaStream_t& stream,
+                         paddle::Tensor* key_cache_out,
+                         paddle::Tensor* value_cache_out) {
+  auto max_blocks_per_seq = meta_data.max_blocks_per_seq;
+  auto bsz = meta_data.batch_size;
+  auto block_size = meta_data.block_size;
+  auto head_dim_qk = meta_data.head_dims;
+  auto head_dim_v = meta_data.head_dims_v;
+  auto num_heads = meta_data.q_num_heads;
+  auto kv_num_heads = meta_data.kv_num_heads;
+  const uint32_t elem_nums = bsz * kv_num_heads * (head_dim_qk + head_dim_v);
+
+  constexpr int PackSize = 16 / sizeof(T);
+  const int pack_num = elem_nums / PackSize;
+  const int blocksize = 128;
+  int grid_size = 1;
+  GetNumBlocks<128>(pack_num, &grid_size);
+
+  append_decode_cache_T_kernel<T, PackSize>
+      <<<grid_size, blocksize, 0, stream>>>(
+          reinterpret_cast<T*>(const_cast<T*>(qkv.data<T>())),
+          reinterpret_cast<T*>(key_cache_out->data<T>()),
+          reinterpret_cast<T*>(value_cache_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          head_dim_qk,
+          head_dim_v,
+          block_size,
+          elem_nums,
+          kv_num_heads);
+}
+
 template <typename T, typename QKV_TYPE>
 void append_decode_cache_rope(const QKV_TYPE* qkv,
                               T* key_cache,
@@ -449,115 +497,125 @@ void DecoderWriteCacheWithRoPEKernel(
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
 
-  const float* cos_emb =
-      rotary_embs ? rotary_embs.get().data<float>() : nullptr;
-  const float* sin_emb;
   if (rotary_embs) {
-    sin_emb =
+    const float* cos_emb = rotary_embs.get().data<float>();
+    const float* sin_emb =
         use_neox_rotary_style
             ? rotary_embs.get().data<float>() + max_seq_len * dim_head
             : rotary_embs.get().data<float>() + max_seq_len * dim_head / 2;
-  }
-  if (cache_quant_type_str == "none") {
-    append_decode_cache_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int8") {
-    append_decode_cache_int8_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int4_zp") {
-    append_decode_cache_int4_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        cache_k_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_k_zp.get().data<T>()))
-                   : nullptr,
-        cache_v_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_v_zp.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        stream,
-        use_neox_rotary_style);
+    if (cache_quant_type_str == "none") {
+      append_decode_cache_rope(
+          reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+          reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
+          reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
+          reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          cos_emb,
+          sin_emb,
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(qkv_biases.get().data<T>()))
+                     : nullptr,
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          kv_num_heads,
+          dim_head,
+          block_size,
+          bsz,
+          stream,
+          use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int8") {
+      append_decode_cache_int8_rope(
+          reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+          key_cache_out->data<uint8_t>(),
+          value_cache_out->data<uint8_t>(),
+          reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          cos_emb,
+          sin_emb,
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(qkv_biases.get().data<T>()))
+                     : nullptr,
+          cache_k_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+          cache_v_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          kv_num_heads,
+          dim_head,
+          block_size,
+          bsz,
+          stream,
+          use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int4_zp") {
+      append_decode_cache_int4_rope(
+          reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+          key_cache_out->data<uint8_t>(),
+          value_cache_out->data<uint8_t>(),
+          reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          seq_lens_encoder.data<int>(),
+          cos_emb,
+          sin_emb,
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(qkv_biases.get().data<T>()))
+                     : nullptr,
+          cache_k_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+          cache_v_scale ? reinterpret_cast<DataType_*>(
+                              const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+          cache_k_zp ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(cache_k_zp.get().data<T>()))
+                     : nullptr,
+          cache_v_zp ? reinterpret_cast<DataType_*>(
+                           const_cast<T*>(cache_v_zp.get().data<T>()))
+                     : nullptr,
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          kv_num_heads,
+          dim_head,
+          block_size,
+          bsz,
+          stream,
+          use_neox_rotary_style);
+    } else {
+      PD_THROW(
+          "cache_quant_type_str should be one of [none, cache_int8, "
+          "cache_int4_zp]");
+    }
   } else {
-    PD_THROW(
-        "cache_quant_type_str should be one of [none, cache_int8, "
-        "cache_int4_zp]");
+    DecoderWriteCacheKV<QKV_TYPE>(meta_data,
+                                  qkv,
+                                  seq_lens,
+                                  seq_lens_encoder,
+                                  padding_offsets,
+                                  cum_offsets,
+                                  block_tables,
+                                  max_seq_len,
+                                  stream,
+                                  key_cache_out,
+                                  value_cache_out);
   }
 }
 
diff --git a/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh b/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh
index eef8bcf2038d..c1dd09d4b3de 100644
--- a/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh
+++ b/csrc/gpu/append_attn/encoder_write_cache_with_rope_impl.cuh
@@ -405,19 +405,18 @@ __global__ void GQAVariableLengthRotaryKernel(
 }
 
 template <typename T, int VecSize = 1>
-__global__ void GQAVariableLengthRotaryKernel(
-    const T *qkv,
-    const float *cos_emb,
-    const float *sin_emb,
-    const int *padding_offsets,
-    const int *seq_lens,
-    const int *seq_lens_decoder,
-    T *qkv_out,
-    const int64_t elem_cnt,
-    const int q_num_head,
-    const int kv_num_head,
-    const int seq_len,
-    const int last_dim) {
+__global__ void GQAVariableLengthRotaryKernel(const T *qkv,
+                                              const float *cos_emb,
+                                              const float *sin_emb,
+                                              const int *padding_offsets,
+                                              const int *seq_lens,
+                                              const int *seq_lens_decoder,
+                                              T *qkv_out,
+                                              const int64_t elem_cnt,
+                                              const int q_num_head,
+                                              const int kv_num_head,
+                                              const int seq_len,
+                                              const int last_dim) {
   using LoadT = AlignedVector<T, VecSize>;
   constexpr int HalfVecSize = VecSize / 2;
   using LoadEmbT = AlignedVector<float, HalfVecSize>;
@@ -555,21 +554,20 @@ __global__ void GQANeoxVariableLengthRotaryKernel(
 }
 
 template <typename T, int VecSize = 1>
-__global__ void GQANeoxVariableLengthRotaryKernel(
-    const T *qkv,
-    const float *cos_emb,
-    const float *sin_emb,
-    const int *padding_offsets,
-    const int *seq_lens,
-    const int *seq_lens_decoder,
-    const float *qkv_out_scales,
-    const T *qkv_biases,
-    T *qkv_out,
-    const int64_t elem_cnt,
-    const int q_num_head,
-    const int kv_num_head,
-    const int seq_len,
-    const int last_dim) {
+__global__ void GQANeoxVariableLengthRotaryKernel(const T *qkv,
+                                                  const float *cos_emb,
+                                                  const float *sin_emb,
+                                                  const int *padding_offsets,
+                                                  const int *seq_lens,
+                                                  const int *seq_lens_decoder,
+                                                  const float *qkv_out_scales,
+                                                  const T *qkv_biases,
+                                                  T *qkv_out,
+                                                  const int64_t elem_cnt,
+                                                  const int q_num_head,
+                                                  const int kv_num_head,
+                                                  const int seq_len,
+                                                  const int last_dim) {
   using LoadT = AlignedVector<T, VecSize>;
   using LoadEmbT = AlignedVector<float, VecSize>;
   LoadT left_vec;
@@ -634,7 +632,8 @@ __global__ void cache_kernel(
     const int max_seq_len,
     const int max_blocks_per_seq,
     const int num_heads,
-    const int head_size,
+    const int head_size_qk,
+    const int head_size_v,
     const int block_size,
     const uint32_t elem_cnt,
     const int kv_num_heads) {
@@ -642,24 +641,21 @@ __global__ void cache_kernel(
   LoadT src_vec;
 
   uint32_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  const uint32_t hidden_size = kv_num_heads * head_size;
-  const uint32_t offset = 2 * hidden_size;
+  const uint32_t hidden_size_q = num_heads * head_size_qk;
+  const uint32_t hidden_size_k = kv_num_heads * head_size_qk;
+  const uint32_t hidden_size_v = kv_num_heads * head_size_v;
+  const uint32_t offset = hidden_size_k + hidden_size_v;
   for (uint32_t linear_index = global_thread_idx * VecSize,
                 step = gridDim.x * blockDim.x * VecSize;
        linear_index < elem_cnt;
        linear_index += step) {
     const uint32_t token_idx = linear_index / offset;
     const uint32_t bias = linear_index % offset;
-    const uint32_t qkv_id = bias / hidden_size;  // skip q
-    const uint32_t qkv_bias = bias % hidden_size;
-    const uint32_t hi = qkv_bias / head_size;
-    const uint32_t h_bias = qkv_bias % head_size;
     const uint32_t ori_token_idx = token_idx + padding_offsets[token_idx];
     const uint32_t ori_bi = ori_token_idx / max_seq_len;
     if (seq_lens[ori_bi] == 0) continue;
     const uint32_t ori_seq_id =
         ori_token_idx % max_seq_len + seq_lens_decoder[ori_bi];
-
     const int32_t *block_table_now = nullptr;
 
     block_table_now = block_tables + ori_bi * max_blocks_per_seq;
@@ -667,16 +663,29 @@ __global__ void cache_kernel(
     const uint32_t block_idx = block_table_now[ori_seq_id / block_size];
     const uint32_t block_offset = ori_seq_id % block_size;
 
-    const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size +
-                             hi * block_size * head_size +
-                             block_offset * head_size + h_bias;
-    const uint32_t ori_idx =
-        token_idx * (num_heads + 2 * kv_num_heads) * head_size +
-        num_heads * head_size + qkv_id * hidden_size + hi * head_size + h_bias;
-    Load<T, VecSize>(&qkv[ori_idx], &src_vec);
-    if (qkv_id == 0) {
+    if (bias < hidden_size_k) {
+      const uint32_t qkv_bias = bias;
+      const uint32_t hi = qkv_bias / head_size_qk;
+      const uint32_t h_bias = qkv_bias % head_size_qk;
+      const uint32_t tgt_idx =
+          block_idx * kv_num_heads * block_size * head_size_qk +
+          hi * block_size * head_size_qk + block_offset * head_size_qk + h_bias;
+      const uint32_t ori_idx =
+          token_idx * (hidden_size_q + hidden_size_k + hidden_size_v) +
+          hidden_size_q + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
       Store<T, VecSize>(src_vec, &key_cache[tgt_idx]);
     } else {
+      const uint32_t qkv_bias = bias - hidden_size_k;
+      const uint32_t hi = qkv_bias / head_size_v;
+      const uint32_t h_bias = qkv_bias % head_size_v;
+      const uint32_t tgt_idx =
+          block_idx * kv_num_heads * block_size * head_size_v +
+          hi * block_size * head_size_v + block_offset * head_size_v + h_bias;
+      const uint32_t ori_idx =
+          token_idx * (hidden_size_q + hidden_size_k + hidden_size_v) +
+          hidden_size_q + hidden_size_k + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
       Store<T, VecSize>(src_vec, &value_cache[tgt_idx]);
     }
   }
@@ -736,8 +745,11 @@ __global__ void append_write_cache_kv_c8_qkv(
       batch_id * max_seq_len - cum_offsets[batch_id];
   const uint32_t kv_batch_stride = (num_heads + 2 * kv_num_heads) * HEAD_DIM;
   const uint32_t kv_h_stride = HEAD_DIM;
-  __shared__ T k_smem_ori[num_rows_per_block * HEAD_DIM];
-  __shared__ T v_smem_ori[num_rows_per_block * HEAD_DIM];
+  extern __shared__ uint8_t smem[];
+  T *k_smem_ori = (T *)smem;  // [num_rows_per_block * HEAD_DIM];
+  T *v_smem_ori =
+      (T *)(smem + num_rows_per_block * HEAD_DIM *
+                       sizeof(T));  // [num_rows_per_block * HEAD_DIM];
 
   smem_t k_smem(k_smem_ori);
   smem_t v_smem(v_smem_ori);
@@ -983,12 +995,22 @@ __global__ void append_write_cache_kv_c4_qkv(
       batch_id * max_seq_len - cum_offsets[batch_id];
   const uint32_t kv_batch_stride = (num_heads + 2 * kv_num_heads) * HEAD_DIM;
   const uint32_t kv_h_stride = HEAD_DIM;
-  __shared__ T k_smem_ori[num_rows_per_block * HEAD_DIM];
-  __shared__ T v_smem_ori[num_rows_per_block * HEAD_DIM];
-  __shared__ T k_scale_smem[HEAD_DIM];
-  __shared__ T v_scale_smem[HEAD_DIM];
-  __shared__ T k_zero_point_smem[HEAD_DIM];
-  __shared__ T v_zero_point_smem[HEAD_DIM];
+  extern __shared__ uint8_t smem[];
+  T *k_smem_ori = (T *)smem;  // [num_rows_per_block * HEAD_DIM];
+  T *v_smem_ori =
+      (T *)(smem + num_rows_per_block * HEAD_DIM *
+                       sizeof(T));  // [num_rows_per_block * HEAD_DIM];
+  T *k_scale_smem = (T *)(smem + num_rows_per_block * HEAD_DIM * 2 *
+                                     sizeof(T));  // [HEAD_DIM];
+  T *v_scale_smem =
+      (T *)(smem + (num_rows_per_block * HEAD_DIM * 2 + HEAD_DIM) *
+                       sizeof(T));  // [HEAD_DIM];
+  T *k_zero_point_smem =
+      (T *)(smem + (num_rows_per_block * HEAD_DIM * 2 + HEAD_DIM * 2) *
+                       sizeof(T));  // [HEAD_DIM];
+  T *v_zero_point_smem =
+      (T *)(smem + (num_rows_per_block * HEAD_DIM * 2 + HEAD_DIM * 3) *
+                       sizeof(T));  // [HEAD_DIM];
   const T *cache_k_scale_now = cache_k_scales + kv_head_idx * HEAD_DIM;
   const T *cache_k_zp_now = cache_k_zero_points + kv_head_idx * HEAD_DIM;
   const T *cache_v_scale_now = cache_v_scales + kv_head_idx * HEAD_DIM;
@@ -1033,16 +1055,10 @@ __global__ void append_write_cache_kv_c4_qkv(
       for (uint32_t fy = 0; fy < num_frags_y / 4;
            ++fy) {  // (num_frags_y * 16) / (8 *  num_elems_per_128b<T>())
         if (chunk_start >= start_len && chunk_start < end_len) {
-          k_smem
-              .load_128b_async<SharedMemFillMode::kNoFill>(
-                  kv_smem_offset_w,
-                  qkv_input + k_read_idx,
-                  chunk_start < end_len);
-          v_smem
-              .load_128b_async<SharedMemFillMode::kNoFill>(
-                  kv_smem_offset_w,
-                  qkv_input + v_read_idx,
-                  chunk_start < end_len);
+          k_smem.load_128b_async<SharedMemFillMode::kNoFill>(
+              kv_smem_offset_w, qkv_input + k_read_idx, chunk_start < end_len);
+          v_smem.load_128b_async<SharedMemFillMode::kNoFill>(
+              kv_smem_offset_w, qkv_input + v_read_idx, chunk_start < end_len);
         }
         kv_smem_offset_w =
             k_smem.advance_offset_by_column<8>(kv_smem_offset_w, fy);
@@ -1248,9 +1264,8 @@ void rotary_qk_variable(
     const int dim_head,
     const cudaStream_t &stream,
     bool use_neox_style = false) {
-  int64_t elem_nums =
-      qkv_out_scales ? token_num * 3 * head_num * dim_head
-                     : token_num * 2 * head_num * dim_head;
+  int64_t elem_nums = qkv_out_scales ? token_num * 3 * head_num * dim_head
+                                     : token_num * 2 * head_num * dim_head;
   if (use_neox_style) {
     elem_nums /= 2;
   }
@@ -1458,11 +1473,12 @@ void CascadeAppendWriteCacheKVQKV(
   auto num_tokens = meta_data.token_nums;
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
-  auto head_dim = meta_data.head_dims;
+  auto head_dim_qk = meta_data.head_dims;
+  auto head_dim_v = meta_data.head_dims_v;
   auto block_size = meta_data.block_size;
 
   const uint32_t elem_nums =
-      num_tokens * 2 * kv_num_heads * head_dim;
+      num_tokens * kv_num_heads * (head_dim_qk + head_dim_v);
   constexpr int PackSize = 16 / sizeof(T);
   const int pack_num = elem_nums / PackSize;
   const int blocksize = 128;
@@ -1479,7 +1495,8 @@ void CascadeAppendWriteCacheKVQKV(
       max_seq_len,
       max_blocks_per_seq,
       num_heads,
-      head_dim,
+      head_dim_qk,
+      head_dim_v,
       block_size,
       elem_nums,
       kv_num_heads);
@@ -1511,7 +1528,6 @@ void CascadeAppendWriteCacheKVC8QKV(
   auto num_tokens = meta_data.token_nums;
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
-  auto head_dim = meta_data.head_dims;
 
   const uint32_t pad_len = BLOCK_SIZE;
 
@@ -1530,24 +1546,27 @@ void CascadeAppendWriteCacheKVC8QKV(
                                                 HEAD_DIM,
                                                 BLOCK_SIZE,
                                                 num_warps>;
-  cudaFuncSetAttribute(
-      kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
-  kernel_fn<<<grids, blocks, 0, stream>>>(cache_k_out->data<uint8_t>(),
-                                          cache_v_out->data<uint8_t>(),
-                                          qkv.data<T>(),
-                                          cache_k_scale.data<T>(),
-                                          cache_v_scale.data<T>(),
-                                          batch_ids.data<int>(),
-                                          tile_ids_per_batch.data<int>(),
-                                          seq_lens_this_time.data<int>(),
-                                          seq_lens_decoder.data<int>(),
-                                          padding_offsets.data<int>(),
-                                          cum_offsets.data<int>(),
-                                          block_table.data<int>(),
-                                          max_seq_len,
-                                          max_blocks_per_seq,
-                                          num_heads,
-                                          kv_num_heads);
+  if (smem_size >= 48 * 1024) {
+    cudaFuncSetAttribute(
+        kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
+  }
+  kernel_fn<<<grids, blocks, smem_size, stream>>>(
+      cache_k_out->data<uint8_t>(),
+      cache_v_out->data<uint8_t>(),
+      qkv.data<T>(),
+      cache_k_scale.data<T>(),
+      cache_v_scale.data<T>(),
+      batch_ids.data<int>(),
+      tile_ids_per_batch.data<int>(),
+      seq_lens_this_time.data<int>(),
+      seq_lens_decoder.data<int>(),
+      padding_offsets.data<int>(),
+      cum_offsets.data<int>(),
+      block_table.data<int>(),
+      max_seq_len,
+      max_blocks_per_seq,
+      num_heads,
+      kv_num_heads);
 }
 
 template <typename T, uint32_t HEAD_DIM, uint32_t BLOCK_SIZE>
@@ -1578,7 +1597,6 @@ void CascadeAppendWriteCacheKVC4QKV(
   auto num_tokens = meta_data.token_nums;
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
-  auto head_dim = meta_data.head_dims;
 
   const uint32_t pad_len = BLOCK_SIZE;
 
@@ -1598,24 +1616,27 @@ void CascadeAppendWriteCacheKVC4QKV(
                                                 HEAD_DIM,
                                                 BLOCK_SIZE,
                                                 num_warps>;
-  cudaFuncSetAttribute(
-      kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
-  kernel_fn<<<grids, blocks, 0, stream>>>(cache_k_out->data<uint8_t>(),
-                                          cache_v_out->data<uint8_t>(),
-                                          qkv.data<T>(),
-                                          cache_k_scale.data<T>(),
-                                          cache_v_scale.data<T>(),
-                                          cache_k_zp.data<T>(),
-                                          cache_v_zp.data<T>(),
-                                          batch_ids.data<int>(),
-                                          tile_ids_per_batch.data<int>(),
-                                          seq_lens_this_time.data<int>(),
-                                          seq_lens_decoder.data<int>(),
-                                          padding_offsets.data<int>(),
-                                          cum_offsets.data<int>(),
-                                          block_table.data<int>(),
-                                          max_seq_len,
-                                          max_blocks_per_seq,
-                                          num_heads,
-                                          kv_num_heads);
+  if (smem_size >= 48 * 1024) {
+    cudaFuncSetAttribute(
+        kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
+  }
+  kernel_fn<<<grids, blocks, smem_size, stream>>>(
+      cache_k_out->data<uint8_t>(),
+      cache_v_out->data<uint8_t>(),
+      qkv.data<T>(),
+      cache_k_scale.data<T>(),
+      cache_v_scale.data<T>(),
+      cache_k_zp.data<T>(),
+      cache_v_zp.data<T>(),
+      batch_ids.data<int>(),
+      tile_ids_per_batch.data<int>(),
+      seq_lens_this_time.data<int>(),
+      seq_lens_decoder.data<int>(),
+      padding_offsets.data<int>(),
+      cum_offsets.data<int>(),
+      block_table.data<int>(),
+      max_seq_len,
+      max_blocks_per_seq,
+      num_heads,
+      kv_num_heads);
 }
\ No newline at end of file
diff --git a/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h b/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h
index 6a14eaf3dde4..3c2f1100964a 100644
--- a/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h
+++ b/csrc/gpu/append_attn/encoder_write_cache_with_rope_kernel.h
@@ -48,43 +48,45 @@ void EncoderWriteCacheWithRopeKernel(
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
   auto head_dim = meta_data.head_dims;
-
-  if (num_heads == kv_num_heads) {
-    rotary_qk_variable(
-        qkv_out->data<T>(),
-        qkv.data<QKV_TYPE>(),
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? qkv_biases.get().data<T>() : nullptr,
-        rotary_embs.get().data<float>(),
-        padding_offsets.data<int>(),
-        seq_lens_encoder.data<int>(),
-        seq_lens_decoder.data<int>(),
-        token_num,
-        num_heads,
-        max_seq_len,
-        rotary_embs.get().dims()[2],
-        head_dim,
-        stream,
-        use_neox_style);
-  } else {
-    gqa_rotary_qk_variable(
-        qkv_out->data<T>(),
-        qkv.data<QKV_TYPE>(),
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? qkv_biases.get().data<T>() : nullptr,
-        rotary_embs.get().data<float>(),
-        padding_offsets.data<int>(),
-        seq_lens_encoder.data<int>(),
-        seq_lens_decoder.data<int>(),
-        token_num,
-        num_heads,
-        kv_num_heads,
-        max_seq_len,
-        rotary_embs.get().dims()[2],
-        head_dim,
-        stream,
-        use_neox_style);
+  if (rotary_embs) {
+    if (num_heads == kv_num_heads) {
+      rotary_qk_variable(
+          qkv_out->data<T>(),
+          qkv.data<QKV_TYPE>(),
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? qkv_biases.get().data<T>() : nullptr,
+          rotary_embs.get().data<float>(),
+          padding_offsets.data<int>(),
+          seq_lens_encoder.data<int>(),
+          seq_lens_decoder.data<int>(),
+          token_num,
+          num_heads,
+          max_seq_len,
+          rotary_embs.get().dims()[2],
+          head_dim,
+          stream,
+          use_neox_style);
+    } else {
+      gqa_rotary_qk_variable(
+          qkv_out->data<T>(),
+          qkv.data<QKV_TYPE>(),
+          qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+          qkv_biases ? qkv_biases.get().data<T>() : nullptr,
+          rotary_embs.get().data<float>(),
+          padding_offsets.data<int>(),
+          seq_lens_encoder.data<int>(),
+          seq_lens_decoder.data<int>(),
+          token_num,
+          num_heads,
+          kv_num_heads,
+          max_seq_len,
+          rotary_embs.get().dims()[2],
+          head_dim,
+          stream,
+          use_neox_style);
+    }
   }
+  
   const uint32_t block_size = meta_data.block_size;
   if (cache_quant_type_str == "none") {
     CascadeAppendWriteCacheKVQKV<T>(meta_data,
diff --git a/csrc/gpu/append_attn/get_block_shape_and_split_kv_block.cu b/csrc/gpu/append_attn/get_block_shape_and_split_kv_block.cu
index 7cf9ab9068eb..1dea30b8c5b3 100644
--- a/csrc/gpu/append_attn/get_block_shape_and_split_kv_block.cu
+++ b/csrc/gpu/append_attn/get_block_shape_and_split_kv_block.cu
@@ -102,40 +102,21 @@ std::vector<paddle::Tensor> GetBlockShapeAndSplitKVBlock(
     const paddle::Tensor& seq_lens_encoder,
     const paddle::Tensor& seq_lens_decoder,
     const paddle::Tensor& max_enc_len_this_time,
+    const paddle::Tensor& max_dec_len_this_time,
     const paddle::Tensor& seq_lens_this_time,
     const paddle::Tensor& cum_offsets,
     const int group_size,
     const int block_size,
     const int decoder_step_token_num) {
-  auto stream = seq_lens_encoder.stream();
+  paddle::Tensor encoder_batch_ids, encoder_tile_ids_per_batch, encoder_num_blocks_x_cpu,
+    kv_batch_ids, kv_tile_ids_per_batch, kv_num_blocks_x_cpu, decoder_batch_ids,
+    decoder_tile_ids_per_batch, decoder_num_blocks_x_cpu;
+  auto stream = seq_lens_this_time.stream();
   int bsz = cum_offsets.shape()[0];
   const int encoder_block_shape_q = get_encoder_block_shape_q();
   const int decoder_block_shape_q = get_decoder_block_shape_q();
 
-  // decoder
-  const uint32_t decoder_max_tile_size_per_bs_q =
-      div_up((decoder_step_token_num * group_size), decoder_block_shape_q);
-  auto decoder_batch_ids =
-      GetEmptyTensor({bsz * decoder_max_tile_size_per_bs_q},
-                     paddle::DataType::INT32,
-                     seq_lens_encoder.place());
-  auto decoder_tile_ids_per_batch =
-      GetEmptyTensor({bsz * decoder_max_tile_size_per_bs_q},
-                     paddle::DataType::INT32,
-                     seq_lens_encoder.place());
-  auto decoder_num_blocks_x =
-      GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
-  split_q_block<<<1, 32, 0, stream>>>(seq_lens_this_time.data<int>(),
-                                      seq_lens_encoder.data<int>(),
-                                      decoder_batch_ids.data<int>(),
-                                      decoder_tile_ids_per_batch.data<int>(),
-                                      decoder_num_blocks_x.data<int>(),
-                                      bsz,
-                                      decoder_block_shape_q,
-                                      group_size);
-  auto decoder_num_blocks_x_cpu =
-      decoder_num_blocks_x.copy_to(paddle::CPUPlace(), false);
-  
+  // max_len
   auto max_len_kv =
       GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_decoder.place());
   get_max_len_kv_ernel<128><<<1, 128, 0, stream>>>(
@@ -147,77 +128,100 @@ std::vector<paddle::Tensor> GetBlockShapeAndSplitKVBlock(
   auto max_len_kv_cpu =
       max_len_kv.copy_to(paddle::CPUPlace(), false);
 
+  // decoder
+  int max_dec_len_this_time_data = max_dec_len_this_time.data<int>()[0];
+  if (max_dec_len_this_time_data > 0) {
+    const uint32_t decoder_max_tile_size_per_bs_q =
+        div_up((decoder_step_token_num * group_size), decoder_block_shape_q);
+    decoder_batch_ids =
+        GetEmptyTensor({bsz * decoder_max_tile_size_per_bs_q},
+                      paddle::DataType::INT32,
+                      seq_lens_encoder.place());
+    decoder_tile_ids_per_batch =
+        GetEmptyTensor({bsz * decoder_max_tile_size_per_bs_q},
+                      paddle::DataType::INT32,
+                      seq_lens_encoder.place());
+    auto decoder_num_blocks_x =
+        GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
+    split_q_block<<<1, 32, 0, stream>>>(seq_lens_this_time.data<int>(),
+                                        seq_lens_encoder.data<int>(),
+                                        decoder_batch_ids.data<int>(),
+                                        decoder_tile_ids_per_batch.data<int>(),
+                                        decoder_num_blocks_x.data<int>(),
+                                        bsz,
+                                        decoder_block_shape_q,
+                                        group_size);
+    decoder_num_blocks_x_cpu =
+        decoder_num_blocks_x.copy_to(paddle::CPUPlace(), false);
+  } else {
+    decoder_batch_ids =
+        paddle::full({1}, -1, paddle::DataType::INT32, paddle::GPUPlace());
+    decoder_tile_ids_per_batch =
+        paddle::full({1}, -1, paddle::DataType::INT32, paddle::GPUPlace());
+    decoder_num_blocks_x_cpu =
+        paddle::full({1}, -1, paddle::DataType::INT32, paddle::CPUPlace());
+  }
+
+  // encoder
   int max_enc_len_this_time_data = max_enc_len_this_time.data<int>()[0];
-  if (max_enc_len_this_time_data <= 0) {
-    auto encoder_batch_ids =
+  if (max_enc_len_this_time_data > 0) {
+    const uint32_t encoder_max_tile_size_per_bs_q = div_up(
+        (max_enc_len_this_time_data * group_size), encoder_block_shape_q);
+    encoder_batch_ids =
+        GetEmptyTensor({bsz * encoder_max_tile_size_per_bs_q},
+                      paddle::DataType::INT32,
+                      seq_lens_encoder.place());
+    encoder_tile_ids_per_batch =
+        GetEmptyTensor({bsz * encoder_max_tile_size_per_bs_q},
+                      paddle::DataType::INT32,
+                      seq_lens_encoder.place());
+    auto encoder_num_blocks_x =
+        GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
+    split_q_block<<<1, 32, 0, stream>>>(seq_lens_encoder.data<int>(),
+                                        nullptr,
+                                        encoder_batch_ids.data<int>(),
+                                        encoder_tile_ids_per_batch.data<int>(),
+                                        encoder_num_blocks_x.data<int>(),
+                                        bsz,
+                                        encoder_block_shape_q,
+                                        group_size);
+    encoder_num_blocks_x_cpu =
+        encoder_num_blocks_x.copy_to(paddle::CPUPlace(), false);
+
+    // kv
+    const uint32_t max_tile_size_per_bs_kv =
+        div_up(max_enc_len_this_time_data, block_size);
+    kv_batch_ids = GetEmptyTensor({bsz * max_tile_size_per_bs_kv},
+                                      paddle::DataType::INT32,
+                                      seq_lens_encoder.place());
+    kv_tile_ids_per_batch = GetEmptyTensor({bsz * max_tile_size_per_bs_kv},
+                                                paddle::DataType::INT32,
+                                                seq_lens_encoder.place());
+    auto kv_num_blocks_x =
+        GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
+    split_kv_block<<<1, 32, 0, stream>>>(seq_lens_decoder.data<int>(),
+                                        seq_lens_encoder.data<int>(),
+                                        kv_batch_ids.data<int>(),
+                                        kv_tile_ids_per_batch.data<int>(),
+                                        kv_num_blocks_x.data<int>(),
+                                        bsz,
+                                        block_size,
+                                        block_size);
+    kv_num_blocks_x_cpu = kv_num_blocks_x.copy_to(paddle::CPUPlace(), false);
+  } else {
+    encoder_batch_ids =
         paddle::full({1}, -1, paddle::DataType::INT32, paddle::GPUPlace());
-    auto encoder_tile_ids_per_batch =
+    encoder_tile_ids_per_batch =
         paddle::full({1}, -1, paddle::DataType::INT32, paddle::GPUPlace());
-    auto encoder_num_blocks_x_cpu =
+    encoder_num_blocks_x_cpu =
         paddle::full({1}, -1, paddle::DataType::INT32, paddle::CPUPlace());
-    auto kv_batch_ids =
+    kv_batch_ids =
         paddle::full({1}, -1, paddle::DataType::INT32, paddle::GPUPlace());
-    auto kv_tile_ids_per_batch =
+    kv_tile_ids_per_batch =
         paddle::full({1}, -1, paddle::DataType::INT32, paddle::GPUPlace());
-    auto kv_num_blocks_x_cpu =
+    kv_num_blocks_x_cpu = 
         paddle::full({1}, -1, paddle::DataType::INT32, paddle::CPUPlace());
-
-    return {encoder_batch_ids,
-            encoder_tile_ids_per_batch,
-            encoder_num_blocks_x_cpu, /*cpu*/
-            kv_batch_ids,
-            kv_tile_ids_per_batch,
-            kv_num_blocks_x_cpu, /*cpu*/
-            decoder_batch_ids,
-            decoder_tile_ids_per_batch,
-            decoder_num_blocks_x_cpu, /*cpu*/
-            max_len_kv_cpu /*cpu*/};
   }
-
-  // encoder
-  const uint32_t encoder_max_tile_size_per_bs_q = div_up(
-      (max_enc_len_this_time_data * group_size), encoder_block_shape_q);
-  auto encoder_batch_ids =
-      GetEmptyTensor({bsz * encoder_max_tile_size_per_bs_q},
-                     paddle::DataType::INT32,
-                     seq_lens_encoder.place());
-  auto encoder_tile_ids_per_batch =
-      GetEmptyTensor({bsz * encoder_max_tile_size_per_bs_q},
-                     paddle::DataType::INT32,
-                     seq_lens_encoder.place());
-  auto encoder_num_blocks_x =
-      GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
-  split_q_block<<<1, 32, 0, stream>>>(seq_lens_encoder.data<int>(),
-                                      nullptr,
-                                      encoder_batch_ids.data<int>(),
-                                      encoder_tile_ids_per_batch.data<int>(),
-                                      encoder_num_blocks_x.data<int>(),
-                                      bsz,
-                                      encoder_block_shape_q,
-                                      group_size);
-  auto encoder_num_blocks_x_cpu =
-      encoder_num_blocks_x.copy_to(paddle::CPUPlace(), false);
-
-  // kv
-  const uint32_t max_tile_size_per_bs_kv =
-      div_up(max_enc_len_this_time_data, block_size);
-  auto kv_batch_ids = GetEmptyTensor({bsz * max_tile_size_per_bs_kv},
-                                     paddle::DataType::INT32,
-                                     seq_lens_encoder.place());
-  auto kv_tile_ids_per_batch = GetEmptyTensor({bsz * max_tile_size_per_bs_kv},
-                                              paddle::DataType::INT32,
-                                              seq_lens_encoder.place());
-  auto kv_num_blocks_x =
-      GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
-  split_kv_block<<<1, 32, 0, stream>>>(seq_lens_decoder.data<int>(),
-                                       seq_lens_encoder.data<int>(),
-                                       kv_batch_ids.data<int>(),
-                                       kv_tile_ids_per_batch.data<int>(),
-                                       kv_num_blocks_x.data<int>(),
-                                       bsz,
-                                       block_size,
-                                       block_size);
-  auto kv_num_blocks_x_cpu = kv_num_blocks_x.copy_to(paddle::CPUPlace(), false);
   return {encoder_batch_ids,
           encoder_tile_ids_per_batch,
           encoder_num_blocks_x_cpu, /*cpu*/
@@ -234,6 +238,7 @@ std::vector<paddle::DataType> GetBlockShapeAndSplitKVBlockInferDtype(
     const paddle::DataType& seq_lens_encoder_dtype,
     const paddle::DataType& seq_lens_decoder_dtype,
     const paddle::DataType& max_enc_len_this_time_dtype,
+    const paddle::DataType& max_dec_len_this_time_dtype,
     const paddle::DataType& seq_lens_this_time_dtype,
     const paddle::DataType& cum_offsets_dtype) {
   return {paddle::DataType::INT32,
@@ -252,6 +257,7 @@ std::vector<std::vector<int64_t>> GetBlockShapeAndSplitKVBlockInferShape(
     const std::vector<int64_t>& seq_lens_encoder_shape,
     const std::vector<int64_t>& seq_lens_decoder_shape,
     const std::vector<int64_t>& max_enc_len_this_time_shape,
+    const std::vector<int64_t>& max_dec_len_this_time_shape,
     const std::vector<int64_t>& seq_lens_this_time_shape,
     const std::vector<int64_t>& cum_offsets_shape) {
   std::vector<int64_t> dynamic_shape = {-1};
@@ -272,6 +278,7 @@ PD_BUILD_OP(get_block_shape_and_split_kv_block)
     .Inputs({"seq_lens_encoder",
              "seq_lens_decoder",
              "max_enc_len_this_time",
+             "max_dec_len_this_time",
              "seq_lens_this_time",
              "cum_offsets"})
     .Outputs({"encoder_batch_ids",
diff --git a/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh b/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh
index 50fa4e458e9a..7940dd3f94d3 100644
--- a/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh
+++ b/csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh
@@ -301,6 +301,96 @@ __global__ void append_speculate_cache_rope_kernel(
   }
 }
 
+template <typename T, int VecSize = 1>
+__global__ void append_speculate_cache_kernel(
+    const T* __restrict__ qkv,  // [bsz, num_heads + 2 * kv_num_heads,
+                                      // head_size]
+    T* __restrict__ key_cache,    // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    T* __restrict__ value_cache,  // [num_blocks, kv_num_heads, block_size,
+                                  // head_size // 2]
+    const int* __restrict__ block_tables,     // [bsz, max_blocks_per_seq]
+    const int* __restrict__ padding_offsets,  // [num_tokens]
+    const int* __restrict__ cum_offsets,
+    const int* __restrict__ seq_lens_decoder,  // [bsz]
+    const int max_seq_len,
+    const int max_blocks_per_seq,
+    const int num_heads,
+    const int head_size_qk,
+    const int head_size_v,
+    const int block_size,
+    const uint32_t elem_cnt,
+    const int kv_num_heads) {
+  using LoadT = AlignedVector<T, VecSize>;
+  constexpr int HalfVecSize = VecSize / 2;
+  LoadT src_vec;
+
+  int64_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  // const int64_t hidden_size = (num_heads + 2 * kv_num_heads) * head_size;
+  const uint32_t hidden_size_q = num_heads * head_size_qk;
+  const uint32_t hidden_size_k = kv_num_heads * head_size_qk;
+  const uint32_t hidden_size_v = kv_num_heads * head_size_v;
+  const int64_t hidden_size = hidden_size_q + hidden_size_k + hidden_size_v;
+  const uint32_t offset = kv_num_heads * (head_size_qk + head_size_v);
+  // const int64_t offset = 2 * hidden_size;
+  // const int half_head_size = head_size / 2;
+  for (int32_t linear_index = global_thread_idx * VecSize,
+               step = gridDim.x * blockDim.x * VecSize;
+       linear_index < elem_cnt;
+       linear_index += step) {
+    const int token_id = linear_index / offset;
+    const int ori_bi = (token_id + padding_offsets[token_id]) / max_seq_len;
+    if (seq_lens_decoder[ori_bi] == 0) continue;
+    const int bias = linear_index % offset;
+    const int start_token_idx = ori_bi * max_seq_len - cum_offsets[ori_bi];
+    const int write_seq_id = 
+        seq_lens_decoder[ori_bi] + token_id - start_token_idx;;
+    if (write_seq_id == 0) continue;
+
+    const int* block_table_now = nullptr;
+    block_table_now = block_tables + ori_bi * max_blocks_per_seq;
+    const int block_idx = block_table_now[write_seq_id / block_size];
+    if (block_idx < 0) {
+      printf(
+          "Fatal Error!!!, block idx %d when write_seq_id is %d\n some key var "
+          "%d %d %d %d\n",
+          block_idx,
+          write_seq_id,
+          ori_bi,
+          seq_lens_decoder[ori_bi],
+          token_id,
+          cum_offsets[ori_bi]);
+    }
+    const int block_offset = write_seq_id % block_size;
+
+    if (bias < hidden_size_k) {
+      const uint32_t qkv_bias = bias;
+      const uint32_t hi = qkv_bias / head_size_qk;
+      const uint32_t h_bias = qkv_bias % head_size_qk;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_qk +
+                             hi * block_size * head_size_qk +
+                             block_offset * head_size_qk + h_bias;
+      const uint32_t ori_idx =
+          token_id * hidden_size +
+          hidden_size_q + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &key_cache[tgt_idx]);
+    } else {
+      const uint32_t qkv_bias = bias - hidden_size_k;
+      const uint32_t hi = qkv_bias / head_size_v;
+      const uint32_t h_bias = qkv_bias % head_size_v;
+      const uint32_t tgt_idx = block_idx * kv_num_heads * block_size * head_size_v +
+                             hi * block_size * head_size_v +
+                             block_offset * head_size_v + h_bias;
+      const uint32_t ori_idx =
+          token_id * hidden_size +
+          hidden_size_q + hidden_size_k + qkv_bias;
+      Load<T, VecSize>(&qkv[ori_idx], &src_vec);
+      Store<T, VecSize>(src_vec, &value_cache[tgt_idx]);
+    }
+  }
+}
+
 template <typename T, int VecSize = 1, typename InT = T>
 __global__ void append_speculate_cache_neox_rope_kernel(
     const InT* __restrict__ qkv,  // [token_num, num_heads + 2 * gqa_group_size,
diff --git a/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu b/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu
index 588442183d1d..9aab503972d5 100644
--- a/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu
+++ b/csrc/gpu/append_attn/speculate_write_cache_with_rope_kernel.cu
@@ -15,6 +15,52 @@
 #include "speculate_write_cache_with_rope_kernel.h"
 #include "utils.cuh"
 
+template <typename T>
+void SpeculateWriteCacheKV(const AppendAttnMetaData& meta_data,
+                         const paddle::Tensor& qkv,
+                         const paddle::Tensor& seq_lens,
+                         const paddle::Tensor& padding_offsets,
+                         const paddle::Tensor& cum_offsets,
+                         const paddle::Tensor& block_tables,
+                         const int max_seq_len,
+                         cudaStream_t& stream,
+                         paddle::Tensor* key_cache_out,
+                         paddle::Tensor* value_cache_out) {
+  auto max_blocks_per_seq = meta_data.max_blocks_per_seq;
+  auto bsz = meta_data.batch_size;
+  auto block_size = meta_data.block_size;
+  auto head_dim_qk = meta_data.head_dims;
+  auto head_dim_v = meta_data.head_dims_v;
+  auto num_heads = meta_data.q_num_heads;
+  auto kv_num_heads = meta_data.kv_num_heads;
+  auto token_num = meta_data.token_nums;
+  const uint32_t elem_nums = token_num * kv_num_heads * (head_dim_qk + head_dim_v);
+
+  constexpr int PackSize = 16 / sizeof(T);
+  const int pack_num = elem_nums / PackSize;
+  const int blocksize = 128;
+  int grid_size = 1;
+  GetNumBlocks<128>(pack_num, &grid_size);
+
+  append_speculate_cache_kernel<T, PackSize>
+      <<<grid_size, blocksize, 0, stream>>>(
+          reinterpret_cast<T*>(const_cast<T*>(qkv.data<T>())),
+          reinterpret_cast<T*>(key_cache_out->data<T>()),
+          reinterpret_cast<T*>(value_cache_out->data<T>()),
+          block_tables.data<int>(),
+          padding_offsets.data<int>(),
+          cum_offsets.data<int>(),
+          seq_lens.data<int>(),
+          max_seq_len,
+          max_blocks_per_seq,
+          num_heads,
+          head_dim_qk,
+          head_dim_v,
+          block_size,
+          elem_nums,
+          kv_num_heads);
+}
+
 // rope + write
 template <typename T, typename QKV_TYPE>
 void append_speculate_cache_rope(const QKV_TYPE* qkv,
@@ -332,119 +378,129 @@ void SpeculateWriteCacheWithRoPEKernel(
   auto num_heads = meta_data.q_num_heads;
   auto kv_num_heads = meta_data.kv_num_heads;
 
-
-  const float* cos_emb =
-      rotary_embs ? rotary_embs.get().data<float>() : nullptr;
-  const float* sin_emb;
   if (rotary_embs) {
-    sin_emb =
-        use_neox_rotary_style
-            ? rotary_embs.get().data<float>() + max_seq_len * dim_head
-            : rotary_embs.get().data<float>() + max_seq_len * dim_head / 2;
-  }
-  if (cache_quant_type_str == "none") {
-    append_speculate_cache_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        token_nums,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int8") {
-    append_speculate_cache_int8_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(qkv_out->data<T>()),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        token_nums,
-        stream,
-        use_neox_rotary_style);
-  } else if (cache_quant_type_str == "cache_int4_zp") {
-    append_speculate_cache_int4_rope(
-        reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
-        key_cache_out->data<uint8_t>(),
-        value_cache_out->data<uint8_t>(),
-        reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
-        block_tables.data<int>(),
-        padding_offsets.data<int>(),
-        cum_offsets.data<int>(),
-        seq_lens.data<int>(),
-        seq_lens_encoder.data<int>(),
-        cos_emb,
-        sin_emb,
-        qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
-        qkv_biases ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(qkv_biases.get().data<T>()))
-                   : nullptr,
-        cache_k_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_k_scale.get().data<T>()))
-                      : nullptr,
-        cache_v_scale ? reinterpret_cast<DataType_*>(
-                            const_cast<T*>(cache_v_scale.get().data<T>()))
-                      : nullptr,
-        cache_k_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_k_zp.get().data<T>()))
-                   : nullptr,
-        cache_v_zp ? reinterpret_cast<DataType_*>(
-                         const_cast<T*>(cache_v_zp.get().data<T>()))
-                   : nullptr,
-        max_seq_len,
-        max_blocks_per_seq,
-        num_heads,
-        kv_num_heads,
-        dim_head,
-        block_size,
-        bsz,
-        token_nums,
-        stream,
-        use_neox_rotary_style);
+    const float* cos_emb =
+        rotary_embs ? rotary_embs.get().data<float>() : nullptr;
+    const float* sin_emb =
+            use_neox_rotary_style
+                ? rotary_embs.get().data<float>() + max_seq_len * dim_head
+                : rotary_embs.get().data<float>() + max_seq_len * dim_head / 2;
+    
+    if (cache_quant_type_str == "none") {
+        append_speculate_cache_rope(
+            reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+            reinterpret_cast<DataType_*>(key_cache_out->data<T>()),
+            reinterpret_cast<DataType_*>(value_cache_out->data<T>()),
+            reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+            block_tables.data<int>(),
+            padding_offsets.data<int>(),
+            cum_offsets.data<int>(),
+            seq_lens.data<int>(),
+            seq_lens_encoder.data<int>(),
+            cos_emb,
+            sin_emb,
+            qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+            qkv_biases ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(qkv_biases.get().data<T>()))
+                    : nullptr,
+            max_seq_len,
+            max_blocks_per_seq,
+            num_heads,
+            kv_num_heads,
+            dim_head,
+            block_size,
+            bsz,
+            token_nums,
+            stream,
+            use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int8") {
+        append_speculate_cache_int8_rope(
+            reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+            key_cache_out->data<uint8_t>(),
+            value_cache_out->data<uint8_t>(),
+            reinterpret_cast<DataType_*>(qkv_out->data<T>()),
+            block_tables.data<int>(),
+            padding_offsets.data<int>(),
+            cum_offsets.data<int>(),
+            seq_lens.data<int>(),
+            seq_lens_encoder.data<int>(),
+            cos_emb,
+            sin_emb,
+            qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+            qkv_biases ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(qkv_biases.get().data<T>()))
+                    : nullptr,
+            cache_k_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+            cache_v_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+            max_seq_len,
+            max_blocks_per_seq,
+            num_heads,
+            kv_num_heads,
+            dim_head,
+            block_size,
+            bsz,
+            token_nums,
+            stream,
+            use_neox_rotary_style);
+    } else if (cache_quant_type_str == "cache_int4_zp") {
+        append_speculate_cache_int4_rope(
+            reinterpret_cast<const QKV_TYPE*>(qkv_ptr),
+            key_cache_out->data<uint8_t>(),
+            value_cache_out->data<uint8_t>(),
+            reinterpret_cast<DataType_*>(const_cast<T*>(qkv_out->data<T>())),
+            block_tables.data<int>(),
+            padding_offsets.data<int>(),
+            cum_offsets.data<int>(),
+            seq_lens.data<int>(),
+            seq_lens_encoder.data<int>(),
+            cos_emb,
+            sin_emb,
+            qkv_out_scales ? qkv_out_scales.get().data<float>() : nullptr,
+            qkv_biases ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(qkv_biases.get().data<T>()))
+                    : nullptr,
+            cache_k_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_k_scale.get().data<T>()))
+                        : nullptr,
+            cache_v_scale ? reinterpret_cast<DataType_*>(
+                                const_cast<T*>(cache_v_scale.get().data<T>()))
+                        : nullptr,
+            cache_k_zp ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(cache_k_zp.get().data<T>()))
+                    : nullptr,
+            cache_v_zp ? reinterpret_cast<DataType_*>(
+                            const_cast<T*>(cache_v_zp.get().data<T>()))
+                    : nullptr,
+            max_seq_len,
+            max_blocks_per_seq,
+            num_heads,
+            kv_num_heads,
+            dim_head,
+            block_size,
+            bsz,
+            token_nums,
+            stream,
+            use_neox_rotary_style);
+    } else {
+        PD_THROW(
+            "cache_quant_type_str should be one of [none, cache_int8, "
+            "cache_int4_zp]");
+    }
   } else {
-    PD_THROW(
-        "cache_quant_type_str should be one of [none, cache_int8, "
-        "cache_int4_zp]");
+    SpeculateWriteCacheKV<QKV_TYPE>(meta_data,
+                                    qkv,
+                                    seq_lens,
+                                    padding_offsets,
+                                    cum_offsets,
+                                    block_tables,
+                                    max_seq_len,
+                                    stream,
+                                    key_cache_out,
+                                    value_cache_out);
   }
 }
 
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu
index 79ba5cd7bc85..78857845f61c 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_bfloat16_kernel.cu
@@ -46,6 +46,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::bfloat16, paddle::bfloat16
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu
index 6ba3604cf8c3..fce608b905b3 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::bfloat16, paddle::float8_e
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu
index b34b6be4058a..1b77674cb946 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_bfloat16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::bfloat16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu
index 09e149c25233..e10ebb01f08c 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_float16_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::float16, paddle::float16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu
index 648d301880b8..f60b0b079f12 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::float16, paddle::float8_e4
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu
index a0f4a87b6927..8526d6049d5b 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c16_float16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC16Kernel<paddle::float16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu
index a3f0c95f02e2..818bab5d3aa1 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_bfloat16_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::bfloat16, paddle::bfloat16>
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu
index 63b03741b0e7..5a483a5fff82 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::bfloat16, paddle::float8_e4
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu
index a9d560dfef9b..56a4db05f6ac 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_bfloat16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::bfloat16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu
index aae73a837de4..5ab5eb449ad2 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_float16_kernel.cu
@@ -46,6 +46,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::float16, paddle::float16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu
index 57c5e36fca93..6404610407c3 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::float16, paddle::float8_e4m
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu
index 89c4bb58d5f2..9c429d52185a 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c4_float16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC4Kernel<paddle::float16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu
index e5d85cad2b5e..dc0388814692 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_bfloat16_kernel.cu
@@ -47,6 +47,7 @@ CascadeAppendAttentionC8Kernel<paddle::bfloat16, paddle::bfloat16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu
index e115efacf907..5818d9b5a934 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_fp8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::bfloat16, paddle::float8_e4
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu
index 017018f118f0..01138421f11c 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_bfloat16_int8_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::bfloat16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu
index cfa10da809da..530b75dab128 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_float16_kernel.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::float16, paddle::float16>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu
index 842fb6415fca..bb92b986e603 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::float16, paddle::float8_e4m
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu
index 0d143e3d87b4..cfeec3fa5660 100644
--- a/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu
+++ b/csrc/gpu/append_attn/template_instantiation/append_attention_c8_float16_int8_kerne.cu
@@ -45,6 +45,7 @@ template void CascadeAppendAttentionC8Kernel<paddle::float16, int8_t>(
     const int block_shape_q,
     const int max_seq_len,
     const int max_dec_len,
+    const float softmax_scale,
     const float quant_max_bound,
     const float quant_min_bound,
     const float in_scale,
diff --git a/csrc/gpu/append_attn/utils.cuh b/csrc/gpu/append_attn/utils.cuh
index d5545caf103c..5c871a238025 100644
--- a/csrc/gpu/append_attn/utils.cuh
+++ b/csrc/gpu/append_attn/utils.cuh
@@ -25,6 +25,7 @@ struct AppendAttnMetaData {
   int kv_num_heads;
   int token_nums;
   int head_dims;
+  int head_dims_v;
   int max_blocks_per_seq;
 };
 
@@ -277,6 +278,16 @@ __forceinline__ __host__ __device__ void vec_cast<nv_bfloat16, float>(
       __VA_ARGS__                                  \
       break;                                       \
     }                                              \
+    case 192: {                                    \
+      constexpr size_t HEAD_DIM = 192;             \
+      __VA_ARGS__                                  \
+      break;                                       \
+    }                                              \
+    case 256: {                                    \
+      constexpr size_t HEAD_DIM = 256;             \
+      __VA_ARGS__                                  \
+      break;                                       \
+    }                                              \
     default: {                                     \
       PD_THROW("not support the head_dim: ", head_dim);        \
     }                                              \
diff --git a/csrc/gpu/fused_rotary_position_encoding.cu b/csrc/gpu/fused_rotary_position_encoding.cu
new file mode 100644
index 000000000000..c405045890cc
--- /dev/null
+++ b/csrc/gpu/fused_rotary_position_encoding.cu
@@ -0,0 +1,141 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+template <typename T, bool IS_NEOX>
+inline __device__ void apply_token_rotary_embedding_kernel(
+    T* __restrict__ arr,
+    const T* __restrict__ cos_ptr,
+    const T* __restrict__ sin_ptr,
+    int rot_offset,
+    int embed_dim) {
+  int x_index, y_index;
+  T cos, sin;
+  if (IS_NEOX) {
+    x_index = rot_offset;
+    y_index = embed_dim + rot_offset;
+    cos = cos_ptr[x_index];
+    sin = sin_ptr[x_index];
+  } else {
+    x_index = 2 * rot_offset;
+    y_index = 2 * rot_offset + 1;
+    cos = cos_ptr[x_index / 2];
+    sin = sin_ptr[x_index / 2];
+  }
+
+  const T x = arr[x_index];
+  const T y = arr[y_index];
+  arr[x_index] = x * cos - y * sin;
+  arr[y_index] = y * cos + x * sin;
+}
+
+
+template <typename T, bool IS_NEOX>
+__global__ void apply_rotary_embedding_kernel(
+    T* __restrict__ query,  // [num_tokens, num_heads, head_size]
+    T* __restrict__ key,    // [num_tokens, num_kv_heads, head_size]
+    const int* __restrict__ position_ids,  // [num_tokens]
+    const T* __restrict__ cos_sin_cache,   // [max_position, 2, rot_dim // 2]
+    const int rot_dim,
+    const int64_t query_stride,
+    const int64_t key_stride,
+    const int num_heads,
+    const int num_kv_heads,
+    const int head_size) {
+  // Each thread block is responsible for one token.
+  const int token_idx = blockIdx.x;
+  int pos = position_ids[token_idx];
+  const T* cache_ptr = cos_sin_cache + pos * rot_dim;
+
+  const int embed_dim = rot_dim / 2;
+  const T* cos_ptr = cache_ptr;
+  const T* sin_ptr = cache_ptr + embed_dim;
+
+  const int nq = num_heads * embed_dim;
+  for (int i = threadIdx.x; i < nq; i += blockDim.x) {
+    const int head_idx = i / embed_dim;
+    const int64_t token_head = token_idx * query_stride + head_idx * head_size;
+    const int rot_offset = i % embed_dim;
+    apply_token_rotary_embedding_kernel<T, IS_NEOX>(
+        query + token_head, cos_ptr, sin_ptr, rot_offset, embed_dim);
+  }
+
+  const int nk = num_kv_heads * embed_dim;
+  for (int i = threadIdx.x; i < nk; i += blockDim.x) {
+    const int head_idx = i / embed_dim;
+    const int64_t token_head = token_idx * key_stride + head_idx * head_size;
+    const int rot_offset = i % embed_dim;
+    apply_token_rotary_embedding_kernel<T, IS_NEOX>(
+        key + token_head, cos_ptr, sin_ptr, rot_offset, embed_dim);
+  }
+}
+
+
+void FusedRotaryPositionEncoding(
+    paddle::Tensor& query,  // [num_tokens, num_heads, head_size] or
+                            // [num_tokens, num_heads * head_size]
+    paddle::Tensor& key,
+    // [num_tokens, num_kv_heads, head_size] or [num_tokens, num_kv_heads *
+    // head_size]
+    const paddle::Tensor& position_ids,   // [num_tokens]
+    const paddle::Tensor& cos_sin_cache,  // [max_position, rot_dim]
+    int head_size,
+    bool is_neox) {
+  int64_t num_tokens = query.dims()[0];
+  int num_heads = query.numel() / num_tokens / head_size;
+  int num_kv_heads = key.numel() / num_tokens / head_size;
+  int rot_dim = cos_sin_cache.dims()[1];
+  int64_t query_stride = num_heads * head_size;
+  int64_t key_stride = num_kv_heads * head_size;
+
+  dim3 grid(num_tokens);
+  dim3 block(std::min<int64_t>(num_heads * rot_dim / 2, 512));
+  PD_DISPATCH_FLOATING_AND_HALF_TYPES(
+      query.dtype(), "apply_rotary_embedding_kernel", [&] {
+        if (is_neox) {
+          apply_rotary_embedding_kernel<data_t, true>
+              <<<grid, block, 0, query.stream()>>>(query.data<data_t>(),
+                                                   key.data<data_t>(),
+                                                   position_ids.data<int>(),
+                                                   cos_sin_cache.data<data_t>(),
+                                                   rot_dim,
+                                                   query_stride,
+                                                   key_stride,
+                                                   num_heads,
+                                                   num_kv_heads,
+                                                   head_size);
+        } else {
+          apply_rotary_embedding_kernel<data_t, false>
+              <<<grid, block, 0, query.stream()>>>(query.data<data_t>(),
+                                                   key.data<data_t>(),
+                                                   position_ids.data<int>(),
+                                                   cos_sin_cache.data<data_t>(),
+                                                   rot_dim,
+                                                   query_stride,
+                                                   key_stride,
+                                                   num_heads,
+                                                   num_kv_heads,
+                                                   head_size);
+        }
+      });
+}
+
+PD_BUILD_OP(fused_rotary_position_encoding)
+    .Inputs({"query", "key", "position_ids", "cos_sin_cache"})
+    .Outputs({"query_out", "key_out"})
+    .Attrs({"head_size: int", "is_neox: bool"})
+    .SetInplaceMap({{"query", "query_out"}, {"key", "key_out"}})
+    .SetKernelFn(PD_KERNEL(FusedRotaryPositionEncoding));
\ No newline at end of file
diff --git a/csrc/gpu/get_position_ids.cu b/csrc/gpu/get_position_ids.cu
new file mode 100644
index 000000000000..dbd25497a2fa
--- /dev/null
+++ b/csrc/gpu/get_position_ids.cu
@@ -0,0 +1,75 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+__global__ void GetPositionIdsKernel(
+    const int* seq_lens_encoder,  // [bsz] 每个批次的 encoder 长度
+    const int* seq_lens_decoder,  // [bsz] 每个批次的 decoder 长度
+    const int* seq_lens_this_time,
+    int* position_ids,            // 输出的一维 position_ids
+    const int bsz) {              // 批次大小
+  // 当前线程索引（每个线程对应一个批次）
+  int tid = threadIdx.x;
+  if (tid >= bsz) return;
+
+  // 动态计算当前批次的偏移量
+  int offset = 0;
+  for (int i = 0; i < tid; i++) {
+    offset += seq_lens_encoder[i];
+    if (seq_lens_decoder[i] > 0) {
+      offset += seq_lens_this_time[i];
+    }
+  }
+
+  // 当前批次的 encoder 和 decoder 长度
+  int encoder_len = seq_lens_encoder[tid];
+  int decoder_len = seq_lens_decoder[tid];
+  int seq_len_this_time = seq_lens_this_time[tid];
+
+  // 写入 encoder 的 position_ids
+  for (int i = 0; i < encoder_len; i++) {
+    position_ids[offset + i] = i;
+  }
+  offset += encoder_len;
+
+  // 写入 decoder 的 position_ids
+  if (decoder_len > 0) {
+    for (int i = 0; i < seq_len_this_time; i++) {
+      position_ids[offset + i] = decoder_len + i;  // 使用 decoder 长度本身
+    }
+  }
+}
+
+
+void GetPositionIds(const paddle::Tensor& seq_lens_encoder,
+                    const paddle::Tensor& seq_lens_decoder,
+                    const paddle::Tensor& seq_lens_this_time,
+                    const paddle::Tensor& position_ids) {
+  const int bsz = seq_lens_encoder.shape()[0];
+
+  GetPositionIdsKernel<<<1, bsz, 0, position_ids.stream()>>>(
+      seq_lens_encoder.data<int>(),
+      seq_lens_decoder.data<int>(),
+      seq_lens_this_time.data<int>(),
+      const_cast<int*>(position_ids.data<int>()),
+      bsz);
+}
+
+PD_BUILD_OP(get_position_ids)
+    .Inputs({"seq_lens_encoder", "seq_lens_decoder", "seq_lens_this_time", "position_ids"})
+    .Outputs({"position_ids_out"})
+    .SetInplaceMap({{"position_ids", "position_ids_out"}})
+    .SetKernelFn(PD_KERNEL(GetPositionIds));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_postprocess.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_postprocess.cu
new file mode 100644
index 000000000000..b4e465e11699
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_postprocess.cu
@@ -0,0 +1,75 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/extension.h"
+
+
+__global__ void draft_model_update_seq_lens_this_time_kernel(
+    const int64_t* base_model_draft_tokens,
+    int* base_model_seq_lens_this_time,
+    const int* base_model_seq_lens_encoder,
+    const bool* base_model_stop_flags,
+    int bsz,
+    int base_model_draft_token_len) {
+  int tid = threadIdx.x;
+  if (tid < bsz) {
+    if (!base_model_stop_flags[tid] && base_model_seq_lens_encoder[tid] == 0) {
+      const int64_t* base_model_draft_tokens_now =
+          base_model_draft_tokens + tid * base_model_draft_token_len;
+      int token_num = 0;
+
+      for (int i = 0; i < base_model_draft_token_len; ++i) {
+        if (base_model_draft_tokens_now[i] != -1) {
+          token_num++;
+        }
+      }
+      base_model_seq_lens_this_time[tid] = token_num;
+    } else if (base_model_stop_flags[tid]) {
+      base_model_seq_lens_this_time[tid] = 0;
+    }
+  }
+}
+
+
+void DraftModelPostprocess(const paddle::Tensor& base_model_draft_tokens,
+                           const paddle::Tensor& base_model_seq_lens_this_time,
+                           const paddle::Tensor& base_model_seq_lens_encoder,
+                           const paddle::Tensor& base_model_stop_flags) {
+  int real_bsz = base_model_seq_lens_this_time.shape()[0];
+  auto cu_stream = base_model_seq_lens_this_time.stream();
+  constexpr int BlockSize = 512;
+  int base_model_draft_token_len = base_model_draft_tokens.shape()[1];
+  draft_model_update_seq_lens_this_time_kernel<<<1, BlockSize, 0, cu_stream>>>(
+      base_model_draft_tokens.data<int64_t>(),
+      const_cast<int*>(base_model_seq_lens_this_time.data<int>()),
+      base_model_seq_lens_encoder.data<int>(),
+      base_model_stop_flags.data<bool>(),
+      real_bsz,
+      base_model_draft_token_len);
+}
+
+
+PD_BUILD_OP(draft_model_postprocess)
+    .Inputs({"base_model_draft_tokens",
+             "base_model_seq_lens_this_time",
+             "base_model_seq_lens_encoder",
+             "base_model_stop_flags"})
+    .Outputs({"base_model_draft_tokens_out",
+              "base_model_seq_lens_this_time_out",
+              "base_model_stop_flags_out"})
+    .SetInplaceMap({{"base_model_draft_tokens", "base_model_draft_tokens_out"},
+                    {"base_model_seq_lens_this_time",
+                     "base_model_seq_lens_this_time_out"},
+                    {"base_model_stop_flags", "base_model_stop_flags_out"}})
+    .SetKernelFn(PD_KERNEL(DraftModelPostprocess));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu
new file mode 100644
index 000000000000..d878ef32cdb6
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_preprocess.cu
@@ -0,0 +1,239 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+template <int THREADBLOCK_SIZE, bool EAGLE>
+__global__ void draft_model_preprocess_kernel(
+    int64_t* draft_tokens,
+    int64_t* input_ids,
+    bool* stop_flags,
+    int* seq_lens_this_time,
+    int* seq_lens_encoder,
+    int* seq_lens_decoder,
+    int64_t* step_idx,
+    int* first_token_record,
+    bool* not_need_stop,
+    const int64_t* accept_tokens,
+    const int* accept_num,
+    const int* base_model_seq_lens_encoder,
+    const int* base_model_seq_lens_decoder,
+    const int64_t* base_model_step_idx,
+    const bool* base_model_stop_flags,
+    int64_t* base_model_draft_tokens,
+    const int bsz,
+    const int max_draft_token,
+    const int accept_tokens_len,
+    const int draft_tokens_len,
+    const int input_ids_len,
+    const int base_model_draft_tokens_len) {
+  typedef cub::BlockReduce<int64_t, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  int64_t not_stop_flag = 0;
+
+  int tid = threadIdx.x;
+
+  if (tid < bsz) {
+    auto base_model_step_idx_now = base_model_step_idx[tid];
+    auto* accept_tokens_now = accept_tokens + tid * accept_tokens_len;
+    auto* draft_tokens_now = draft_tokens + tid * draft_tokens_len;
+    auto accept_num_now = accept_num[tid];
+    auto* input_ids_now = input_ids + tid * input_ids_len;
+    auto* base_model_draft_tokens_now =
+        base_model_draft_tokens + tid * base_model_draft_tokens_len;
+#pragma unroll
+    for (int i = 1; i < base_model_draft_tokens_len; i++) {
+      base_model_draft_tokens_now[i] = -1;
+    }
+
+    if (!base_model_stop_flags[tid]) {
+      not_stop_flag = 1;
+      // 1. first token
+      if (base_model_step_idx_now == 0) {
+        seq_lens_this_time[tid] = 0;
+        not_stop_flag = 0;
+      } else if (base_model_step_idx_now == 1 && first_token_record[tid] > 0) {
+        // Can be extended to first few tokens
+        seq_lens_encoder[tid] = first_token_record[tid];
+        first_token_record[tid] = -1;
+        stop_flags[tid] = false;
+        int64_t base_model_first_token = accept_tokens_now[0];
+        int position = base_model_seq_lens_decoder[tid];
+        if (EAGLE) {
+          input_ids_now[position - 1] = base_model_first_token;
+          seq_lens_this_time[tid] = base_model_seq_lens_decoder[tid];
+        } else {
+          input_ids_now[position] = base_model_first_token;
+          seq_lens_this_time[tid] = base_model_seq_lens_decoder[tid] + 1;
+        }
+      } else if (accept_num_now <=
+                 max_draft_token) /*Accept partial draft tokens*/ {
+        // Base Model reject stop
+        if (stop_flags[tid]) {
+          stop_flags[tid] = false;
+          seq_lens_decoder[tid] = base_model_seq_lens_decoder[tid];
+          step_idx[tid] = base_model_step_idx[tid];
+        } else {
+          seq_lens_decoder[tid] -= max_draft_token - accept_num_now;
+          step_idx[tid] -= max_draft_token - accept_num_now;
+        }
+        int64_t modified_token = accept_tokens_now[accept_num_now - 1];
+        draft_tokens_now[0] = modified_token;
+        seq_lens_this_time[tid] = 1;
+
+      } else /*Accept all draft tokens*/ {
+        draft_tokens_now[1] = accept_tokens_now[max_draft_token];
+        seq_lens_this_time[tid] = 2;
+      }
+    } else {
+      stop_flags[tid] = true;
+      seq_lens_this_time[tid] = 0;
+      seq_lens_decoder[tid] = 0;
+    }
+  }
+  __syncthreads();
+  int64_t not_stop_flag_sum = BlockReduce(temp_storage).Sum(not_stop_flag);
+  if (tid == 0) {
+    not_need_stop[0] = not_stop_flag_sum > 0;
+  }
+}
+
+
+void DraftModelPreprocess(const paddle::Tensor& draft_tokens,
+                          const paddle::Tensor& input_ids,
+                          const paddle::Tensor& stop_flags,
+                          const paddle::Tensor& seq_lens_this_time,
+                          const paddle::Tensor& seq_lens_encoder,
+                          const paddle::Tensor& seq_lens_decoder,
+                          const paddle::Tensor& step_idx,
+                          const paddle::Tensor& first_token_record,
+                          const paddle::Tensor& not_need_stop,
+                          const paddle::Tensor& accept_tokens,
+                          const paddle::Tensor& accept_num,
+                          const paddle::Tensor& base_model_seq_lens_encoder,
+                          const paddle::Tensor& base_model_seq_lens_decoder,
+                          const paddle::Tensor& base_model_step_idx,
+                          const paddle::Tensor& base_model_stop_flags,
+                          const paddle::Tensor& base_model_draft_tokens,
+                          const int max_draft_token,
+                          const bool truncate_first_token) {
+  int real_bsz = seq_lens_this_time.shape()[0];
+  int accept_tokens_len = accept_tokens.shape()[1];
+  int input_ids_len = input_ids.shape()[1];
+  int draft_tokens_len = draft_tokens.shape()[1];
+  auto cu_stream = seq_lens_this_time.stream();
+  constexpr int BlockSize = 256;
+  int base_model_draft_tokens_len = base_model_draft_tokens.shape()[1];
+  auto not_need_stop_gpu =
+      not_need_stop.copy_to(seq_lens_this_time.place(), false);
+
+
+  if (truncate_first_token) {
+    draft_model_preprocess_kernel<BlockSize, true>
+        <<<1, BlockSize, 0, cu_stream>>>(
+            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
+            const_cast<int64_t*>(input_ids.data<int64_t>()),
+            const_cast<bool*>(stop_flags.data<bool>()),
+            const_cast<int*>(seq_lens_this_time.data<int>()),
+            const_cast<int*>(seq_lens_encoder.data<int>()),
+            const_cast<int*>(seq_lens_decoder.data<int>()),
+            const_cast<int64_t*>(step_idx.data<int64_t>()),
+            const_cast<int*>(first_token_record.data<int>()),
+            const_cast<bool*>(not_need_stop_gpu.data<bool>()),
+            accept_tokens.data<int64_t>(),
+            accept_num.data<int>(),
+            base_model_seq_lens_encoder.data<int>(),
+            base_model_seq_lens_decoder.data<int>(),
+            base_model_step_idx.data<int64_t>(),
+            base_model_stop_flags.data<bool>(),
+            const_cast<int64_t*>(base_model_draft_tokens.data<int64_t>()),
+            real_bsz,
+            max_draft_token,
+            accept_tokens_len,
+            draft_tokens_len,
+            input_ids_len,
+            base_model_draft_tokens_len);
+  } else {
+    draft_model_preprocess_kernel<BlockSize, false>
+        <<<1, BlockSize, 0, cu_stream>>>(
+            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
+            const_cast<int64_t*>(input_ids.data<int64_t>()),
+            const_cast<bool*>(stop_flags.data<bool>()),
+            const_cast<int*>(seq_lens_this_time.data<int>()),
+            const_cast<int*>(seq_lens_encoder.data<int>()),
+            const_cast<int*>(seq_lens_decoder.data<int>()),
+            const_cast<int64_t*>(step_idx.data<int64_t>()),
+            const_cast<int*>(first_token_record.data<int>()),
+            const_cast<bool*>(not_need_stop_gpu.data<bool>()),
+            accept_tokens.data<int64_t>(),
+            accept_num.data<int>(),
+            base_model_seq_lens_encoder.data<int>(),
+            base_model_seq_lens_decoder.data<int>(),
+            base_model_step_idx.data<int64_t>(),
+            base_model_stop_flags.data<bool>(),
+            const_cast<int64_t*>(base_model_draft_tokens.data<int64_t>()),
+            real_bsz,
+            max_draft_token,
+            accept_tokens_len,
+            draft_tokens_len,
+            input_ids_len,
+            base_model_draft_tokens_len);
+  }
+
+
+  auto not_need_stop_cpu =
+      not_need_stop_gpu.copy_to(not_need_stop.place(), false);
+  bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
+  not_need_stop_data[0] = not_need_stop_cpu.data<bool>()[0];
+}
+
+
+PD_BUILD_OP(draft_model_preprocess)
+    .Inputs({"draft_tokens",
+             "input_ids",
+             "stop_flags",
+             "seq_lens_this_time",
+             "seq_lens_encoder",
+             "seq_lens_decoder",
+             "step_idx",
+             "first_token_record",
+             "not_need_stop",
+             "accept_tokens",
+             "accept_num",
+             "base_model_seq_lens_encoder",
+             "base_model_seq_lens_decoder",
+             "base_model_step_idx",
+             "base_model_stop_flags",
+             "base_model_draft_tokens"})
+    .Outputs({"draft_tokens_out",
+              "input_ids_out",
+              "stop_flags_out",
+              "seq_lens_this_time_out",
+              "seq_lens_encoder_out",
+              "seq_lens_decoder_out",
+              "step_idx_out",
+              "not_need_stop_out",
+              "first_token_record_out"})
+    .Attrs({"max_draft_token: int", "truncate_first_token: bool"})
+    .SetInplaceMap({{"draft_tokens", "draft_tokens_out"},
+                    {"input_ids", "input_ids_out"},
+                    {"stop_flags", "stop_flags_out"},
+                    {"seq_lens_this_time", "seq_lens_this_time_out"},
+                    {"seq_lens_encoder", "seq_lens_encoder_out"},
+                    {"seq_lens_decoder", "seq_lens_decoder_out"},
+                    {"step_idx", "step_idx_out"},
+                    {"not_need_stop", "not_need_stop_out"},
+                    {"first_token_record", "first_token_record_out"}})
+    .SetKernelFn(PD_KERNEL(DraftModelPreprocess));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_set_value_by_flags.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_set_value_by_flags.cu
new file mode 100644
index 000000000000..8bad9562f86c
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_set_value_by_flags.cu
@@ -0,0 +1,78 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+
+__global__ void update_pre_ids_kernel(const int64_t* draft_tokens,
+                                      int64_t* pre_ids_all,
+                                      const bool* stop_flags,
+                                      int* seq_lens_this_time,
+                                      const int64_t* step_idx,
+                                      int bs,
+                                      int pre_id_length,
+                                      int max_draft_token) {
+  int tid = threadIdx.x;
+  if (tid < bs && seq_lens_this_time[tid] != 0 && !stop_flags[tid]) {
+    int64_t* pre_ids_all_now = pre_ids_all + tid * pre_id_length;
+    const int64_t* draft_token_now = draft_tokens + tid * max_draft_token;
+    const int seq_len_this_time = seq_lens_this_time[tid];
+    if (step_idx[tid] - 1 > 0 /*Decoder Step*/) {
+      for (int i = 0; i < seq_len_this_time; ++i) {
+        pre_ids_all_now[step_idx[tid] - i] =
+            draft_token_now[seq_len_this_time - 1 - i];
+      }
+    } else if (step_idx[tid] == 1 /*Encoder Step*/) {
+      pre_ids_all_now[1] = draft_token_now[0];
+    }
+    seq_lens_this_time[tid] = 1;
+  }
+}
+
+
+void SpeculateDraftModelUpdate(const paddle::Tensor& draft_tokens,
+                               const paddle::Tensor& pre_ids_all,
+                               const paddle::Tensor& stop_flags,
+                               const paddle::Tensor& seq_lens_this_time,
+                               const paddle::Tensor& seq_lens_encoder,
+                               const paddle::Tensor& seq_lens_decoder,
+                               const paddle::Tensor& step_idx) {
+  int64_t real_bs = seq_lens_this_time.shape()[0];
+  int64_t pre_id_length = pre_ids_all.shape()[1];
+  auto cu_stream = seq_lens_this_time.stream();
+  int64_t max_draft_token = draft_tokens.shape()[1];
+
+  int block_size = (real_bs + 32 - 1) / 32 * 32;
+  update_pre_ids_kernel<<<1, block_size, 0, cu_stream>>>(
+      draft_tokens.data<int64_t>(),
+      const_cast<int64_t*>(pre_ids_all.data<int64_t>()),
+      stop_flags.data<bool>(),
+      const_cast<int*>(seq_lens_this_time.data<int>()),
+      step_idx.data<int64_t>(),
+      real_bs,
+      pre_id_length,
+      max_draft_token);
+}
+
+PD_BUILD_OP(draft_model_set_value_by_flags)
+    .Inputs({"draft_tokens",
+             "pre_ids_all",
+             "stop_flags",
+             "seq_lens_this_time",
+             "seq_lens_encoder",
+             "seq_lens_decoder",
+             "step_idx"})
+    .Outputs({"pre_ids_all_out"})
+    .SetInplaceMap({{"pre_ids_all", "pre_ids_all_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateDraftModelUpdate));
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_update.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_update.cu
new file mode 100644
index 000000000000..0a64239be1da
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/draft_model_update.cu
@@ -0,0 +1,201 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+template <int THREADBLOCK_SIZE>
+__global__ void draft_model_update_kernel(const int64_t* inter_next_tokens,
+                                          int64_t* draft_tokens,
+                                          int64_t* pre_ids,
+                                          int* seq_lens_this_time,
+                                          int* seq_lens_encoder,
+                                          int* seq_lens_decoder,
+                                          int64_t* step_idx,
+                                          const int* output_cum_offsets,
+                                          bool* stop_flags,
+                                          bool* not_need_stop,
+                                          const int64_t* max_dec_len,
+                                          const int64_t* end_ids,
+                                          int64_t* base_model_draft_tokens,
+                                          const int bsz,
+                                          const int max_draft_token,
+                                          const int pre_id_length,
+                                          const int max_base_model_draft_token,
+                                          const int end_ids_len,
+                                          const int max_seq_len,
+                                          const int substep) {
+  typedef cub::BlockReduce<int64_t, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  int64_t stop_flag_now_int = 0;
+
+  int tid = threadIdx.x;
+  if (tid < bsz) {
+    auto* draft_token_now = draft_tokens + tid * max_draft_token;
+    auto* pre_ids_now = pre_ids + tid * pre_id_length;
+    auto* base_model_draft_tokens_now =
+        base_model_draft_tokens + tid * max_base_model_draft_token;
+    const int next_tokens_start_id =
+        tid * max_seq_len - output_cum_offsets[tid];
+    auto* next_tokens_start = inter_next_tokens + next_tokens_start_id;
+    auto seq_len_this_time = seq_lens_this_time[tid];
+
+    // 1. update step_idx && seq_lens_dec
+    if (!stop_flags[tid] /* seq_lens_decoder > 0 or seq_lens_encoder > 0 */) {
+      int64_t token_this_time = -1;
+      // single and multi token
+      if (seq_lens_decoder[tid] > 0) {
+        seq_lens_decoder[tid] += seq_len_this_time;
+        token_this_time = next_tokens_start[seq_len_this_time - 1];
+        draft_token_now[0] = next_tokens_start[seq_len_this_time - 1];
+        base_model_draft_tokens_now[substep + 1] = token_this_time;
+        for (int i = 0; i < seq_len_this_time; ++i) {
+          pre_ids_now[step_idx[tid] + 1 + i] = next_tokens_start[i];
+        }
+        step_idx[tid] += seq_len_this_time;
+
+      } else {
+        token_this_time = next_tokens_start[0];
+
+        seq_lens_decoder[tid] = seq_lens_encoder[tid];
+        seq_lens_encoder[tid] = 0;
+        pre_ids_now[1] = token_this_time;
+        step_idx[tid] += 1;
+        draft_token_now[0] = token_this_time;
+        base_model_draft_tokens_now[substep + 1] = token_this_time;
+      }
+
+      // multi_end
+      if (is_in_end(token_this_time, end_ids, end_ids_len)) {
+        stop_flags[tid] = true;
+        stop_flag_now_int = 1;
+        // max_dec_len
+      } else if (step_idx[tid] >= max_dec_len[tid]) {
+        stop_flags[tid] = true;
+        draft_token_now[seq_len_this_time - 1] = end_ids[0];
+        base_model_draft_tokens_now[substep + 1] = end_ids[0];
+        stop_flag_now_int = 1;
+      }
+
+    } else {
+      draft_token_now[0] = -1;
+      base_model_draft_tokens_now[substep + 1] = -1;
+      stop_flag_now_int = 1;
+    }
+
+    // 2. set end
+    if (!stop_flags[tid]) {
+      seq_lens_this_time[tid] = 1;
+    } else {
+      seq_lens_this_time[tid] = 0;
+    }
+  }
+  __syncthreads();
+  int64_t stop_sum = BlockReduce(temp_storage).Sum(stop_flag_now_int);
+  if (tid == 0) {
+    not_need_stop[0] = stop_sum < bsz;
+  }
+}
+
+
+void DraftModelUpdate(const paddle::Tensor& inter_next_tokens,
+                      const paddle::Tensor& draft_tokens,
+                      const paddle::Tensor& pre_ids,
+                      const paddle::Tensor& seq_lens_this_time,
+                      const paddle::Tensor& seq_lens_encoder,
+                      const paddle::Tensor& seq_lens_decoder,
+                      const paddle::Tensor& step_idx,
+                      const paddle::Tensor& output_cum_offsets,
+                      const paddle::Tensor& stop_flags,
+                      const paddle::Tensor& not_need_stop,
+                      const paddle::Tensor& max_dec_len,
+                      const paddle::Tensor& end_ids,
+                      const paddle::Tensor& base_model_draft_tokens,
+                      const int max_seq_len,
+                      const int substep) {
+  auto seq_lens_this_time_shape = seq_lens_this_time.shape();
+  auto cu_stream = seq_lens_this_time.stream();
+  const int real_bsz = seq_lens_this_time_shape[0];
+  auto not_need_stop_gpu =
+      not_need_stop.copy_to(seq_lens_this_time.place(), false);
+  const int end_ids_len = end_ids.shape()[0];
+  const int max_draft_token = draft_tokens.shape()[1];
+  const int pre_id_length = pre_ids.shape()[1];
+  const int max_base_model_draft_token = base_model_draft_tokens.shape()[1];
+  constexpr int BlockSize = 512;
+
+  draft_model_update_kernel<BlockSize><<<1, BlockSize, 0, cu_stream>>>(
+      inter_next_tokens.data<int64_t>(),
+      const_cast<int64_t*>(draft_tokens.data<int64_t>()),
+      const_cast<int64_t*>(pre_ids.data<int64_t>()),
+      const_cast<int*>(seq_lens_this_time.data<int>()),
+      const_cast<int*>(seq_lens_encoder.data<int>()),
+      const_cast<int*>(seq_lens_decoder.data<int>()),
+      const_cast<int64_t*>(step_idx.data<int64_t>()),
+      output_cum_offsets.data<int>(),
+      const_cast<bool*>(stop_flags.data<bool>()),
+      not_need_stop_gpu.data<bool>(),
+      max_dec_len.data<int64_t>(),
+      end_ids.data<int64_t>(),
+      const_cast<int64_t*>(base_model_draft_tokens.data<int64_t>()),
+      real_bsz,
+      max_draft_token,
+      pre_id_length,
+      max_base_model_draft_token,
+      end_ids_len,
+      max_seq_len,
+      substep);
+
+
+  auto not_need_stop_cpu =
+      not_need_stop_gpu.copy_to(not_need_stop.place(), false);
+  bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
+  not_need_stop_data[0] = not_need_stop_cpu.data<bool>()[0];
+}
+
+
+PD_BUILD_OP(draft_model_update)
+    .Inputs({"inter_next_tokens",
+             "draft_tokens",
+             "pre_ids",
+             "seq_lens_this_time",
+             "seq_lens_encoder",
+             "seq_lens_decoder",
+             "step_idx",
+             "output_cum_offsets",
+             "stop_flags",
+             "not_need_stop",
+             "max_dec_len",
+             "end_ids",
+             "base_model_draft_tokens"})
+    .Attrs({"max_seq_len: int", "substep: int"})
+    .Outputs({"draft_tokens_out",
+              "pre_ids_out",
+              "seq_lens_this_time_out",
+              "seq_lens_encoder_out",
+              "seq_lens_decoder_out",
+              "step_idx_out",
+              "stop_flags_out",
+              "not_need_stop_out",
+              "base_model_draft_tokens_out"})
+    .SetInplaceMap({{"draft_tokens", "draft_tokens_out"},
+                    {"pre_ids", "pre_ids_out"},
+                    {"seq_lens_this_time", "seq_lens_this_time_out"},
+                    {"seq_lens_encoder", "seq_lens_encoder_out"},
+                    {"seq_lens_decoder", "seq_lens_decoder_out"},
+                    {"step_idx", "step_idx_out"},
+                    {"stop_flags", "stop_flags_out"},
+                    {"not_need_stop", "not_need_stop_out"},
+                    {"base_model_draft_tokens", "base_model_draft_tokens_out"}})
+    .SetKernelFn(PD_KERNEL(DraftModelUpdate));
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/eagle_get_base_model_hidden_states.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/eagle_get_base_model_hidden_states.cu
new file mode 100644
index 000000000000..41b59c57b884
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/eagle_get_base_model_hidden_states.cu
@@ -0,0 +1,244 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+// #define DEBUG_EAGLE_KERNEL
+
+__global__ void ComputeOrderKernel(const int* seq_lens_this_time,
+                                   const int* seq_lens_encoder,
+                                   const int* base_model_seq_lens_this_time,
+                                   const int* base_model_seq_lens_encoder,
+                                   const int* accept_nums,
+                                   int* positon_map,
+                                   int* output_token_num,
+                                   const int bsz,
+                                   const int actual_draft_token_num,
+                                   const int input_token_num) {
+  int in_offset = 0;   // input_offset(long)
+  int out_offset = 0;  // output_offset(short)
+  if (threadIdx.x == 0) {
+    for (int i = 0; i < bsz; ++i) {
+      int cur_base_model_seq_lens_this_time = base_model_seq_lens_this_time[i];
+      int cur_base_model_seq_lens_encoder = base_model_seq_lens_encoder[i];
+      int cur_seq_lens_this_time = seq_lens_this_time[i];
+      int accept_num = accept_nums[i];
+      int cur_seq_lens_encoder = seq_lens_encoder[i];
+#ifdef DEBUG_EAGLE_KERNEL
+      printf(
+          "batch %d: cur_base_model_seq_lens_this_time%d. "
+          "cur_seq_lens_this_time%d, accept_num %d\n",
+          i,
+          cur_base_model_seq_lens_this_time,
+          cur_seq_lens_this_time,
+          accept_num);
+#endif
+      // 1. eagle encoder. Base step=1
+      if (cur_seq_lens_encoder > 0) {
+#ifdef DEBUG_EAGLE_KERNEL
+        printf("batch %d: cur_seq_lens_encoder > 0 \n", i);
+#endif
+        for (int j = 0; j < cur_seq_lens_encoder; j++) {
+          positon_map[in_offset++] = out_offset++;
+        }
+        // 2. base model encoder. Base step=0
+      } else if (cur_base_model_seq_lens_encoder != 0) {
+        // 3. New end
+      } else if (cur_base_model_seq_lens_this_time != 0 &&
+                 cur_seq_lens_this_time == 0) {
+#ifdef DEBUG_EAGLE_KERNEL
+        printf("batch %d: base=0. draft !=0 \n", i);
+#endif
+
+        in_offset += cur_base_model_seq_lens_this_time;
+        // 4. stopped
+      } else if (cur_base_model_seq_lens_this_time == 0 &&
+                 cur_seq_lens_this_time == 0) /* end */ {
+      } else {
+        if (accept_num <=
+            actual_draft_token_num) /*Accept partial draft tokens*/ {
+#ifdef DEBUG_EAGLE_KERNEL
+          printf("batch %d: accept_num <= actual_draft_token_num \n", i);
+#endif
+          positon_map[in_offset + accept_num - 1] = out_offset++;
+          in_offset += cur_base_model_seq_lens_this_time;
+        } else /*Accept all draft tokens*/ {
+#ifdef DEBUG_EAGLE_KERNEL
+          printf("batch %d: accept_num > actual_draft_token_num \n", i);
+#endif
+          positon_map[in_offset + accept_num - 2] = out_offset++;
+          positon_map[in_offset + accept_num - 1] = out_offset++;
+          in_offset += cur_base_model_seq_lens_this_time;
+        }
+      }
+    }
+    output_token_num[0] = out_offset;
+#ifdef DEBUG_EAGLE_KERNEL
+    printf("position map output_token_num%d:\n", output_token_num[0]);
+    for (int i = 0; i < output_token_num[0]; i++) {
+      printf("%d ", positon_map[i]);
+    }
+    printf("\n");
+#endif
+  }
+}
+
+template <typename T, int VecSize>
+__global__ void rebuildHiddenStatesKernel(const T* input,
+                                          const int* position_map,
+                                          T* out,
+                                          const int dim_embed,
+                                          const int elem_cnt) {
+  using LoadT = AlignedVector<T, VecSize>;
+  LoadT src_vec;
+
+  int global_thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
+  for (int elem_idx = global_thread_idx * VecSize; elem_idx < elem_cnt;
+       elem_idx += blockDim.x * gridDim.x * VecSize) {
+    int ori_token_idx = elem_idx / dim_embed;
+    int token_idx = position_map[ori_token_idx];
+    if (token_idx >= 0) {
+      int offset = elem_idx % dim_embed;
+      if (token_idx == 0) {
+      }
+      Load<T, VecSize>(input + ori_token_idx * dim_embed + offset, &src_vec);
+      Store<T, VecSize>(src_vec, out + token_idx * dim_embed + offset);
+    }
+  }
+}
+
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> DispatchDtype(
+    const paddle::Tensor& input,
+    const paddle::Tensor& seq_lens_this_time,
+    const paddle::Tensor& seq_lens_encoder,
+    const paddle::Tensor& seq_lens_decoder,
+    const paddle::Tensor& stop_flags,
+    const paddle::Tensor& accept_nums,
+    const paddle::Tensor& base_model_seq_lens_this_time,
+    const paddle::Tensor& base_model_seq_lens_encoder,
+    const int actual_draft_token_num) {
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t;
+
+  auto input_token_num = input.shape()[0];
+
+  // auto output_token_num = padding_offset.shape()[0];
+  auto dim_embed = input.shape()[1];
+
+  int bsz = seq_lens_this_time.shape()[0];
+
+  auto position_map = paddle::full(
+      {input_token_num}, -1, seq_lens_this_time.dtype(), input.place());
+  auto output_token_num = paddle::full(
+      {1}, 0, seq_lens_this_time.dtype(), seq_lens_this_time.place());
+  ComputeOrderKernel<<<1, 1>>>(seq_lens_this_time.data<int>(),
+                               seq_lens_encoder.data<int>(),
+                               base_model_seq_lens_this_time.data<int>(),
+                               base_model_seq_lens_encoder.data<int>(),
+                               accept_nums.data<int>(),
+                               position_map.data<int>(),
+                               output_token_num.data<int>(),
+                               bsz,
+                               actual_draft_token_num,
+                               input_token_num);
+
+  int output_token_num_cpu =
+      output_token_num.copy_to(paddle::CPUPlace(), false).data<int>()[0];
+
+  auto out = paddle::full(
+      {output_token_num_cpu, dim_embed}, -1, input.dtype(), input.place());
+
+  constexpr int packSize = VEC_16B / (sizeof(DataType_));
+  int elem_cnt = input_token_num * dim_embed;
+
+  assert(elem_cnt % packSize == 0);
+
+  int pack_num = elem_cnt / packSize;
+
+  int grid_size = 1;
+
+  GetNumBlocks(pack_num, &grid_size);
+
+  constexpr int thread_per_block = 128;
+
+  rebuildHiddenStatesKernel<DataType_, packSize>
+      <<<grid_size, thread_per_block>>>(
+          reinterpret_cast<const DataType_*>(input.data<data_t>()),
+          position_map.data<int>(),
+          reinterpret_cast<DataType_*>(out.data<data_t>()),
+          dim_embed,
+          elem_cnt);
+
+  return {out};
+}
+
+
+std::vector<paddle::Tensor> EagleGetHiddenStates(
+    const paddle::Tensor& input,
+    const paddle::Tensor& seq_lens_this_time,
+    const paddle::Tensor& seq_lens_encoder,
+    const paddle::Tensor& seq_lens_decoder,
+    const paddle::Tensor& stop_flags,
+    const paddle::Tensor& accept_nums,
+    const paddle::Tensor& base_model_seq_lens_this_time,
+    const paddle::Tensor& base_model_seq_lens_encoder,
+    const int actual_draft_token_num) {
+  switch (input.dtype()) {
+    case paddle::DataType::FLOAT16: {
+      return DispatchDtype<paddle::DataType::FLOAT16>(
+          input,
+          seq_lens_this_time,
+          seq_lens_encoder,
+          seq_lens_decoder,
+          stop_flags,
+          accept_nums,
+          base_model_seq_lens_this_time,
+          base_model_seq_lens_encoder,
+          actual_draft_token_num);
+    }
+    case paddle::DataType::BFLOAT16: {
+      return DispatchDtype<paddle::DataType::BFLOAT16>(
+          input,
+          seq_lens_this_time,
+          seq_lens_encoder,
+          seq_lens_decoder,
+          stop_flags,
+          accept_nums,
+          base_model_seq_lens_this_time,
+          base_model_seq_lens_encoder,
+          actual_draft_token_num);
+    }
+    default: {
+      PD_THROW("Not support this data type");
+    }
+  }
+}
+
+
+PD_BUILD_OP(eagle_get_base_model_hidden_states)
+    .Inputs({"input",
+             "seq_lens_this_time",
+             "seq_lens_encoder",
+             "seq_lens_decoder",
+             "stop_flags",
+             "accept_nums",
+             "base_model_seq_lens_this_time",
+             "base_model_seq_lens_encoder"})
+    .Attrs({"actual_draft_token_num: int"})
+    .Outputs({"out"})
+    .SetKernelFn(PD_KERNEL(EagleGetHiddenStates));
diff --git a/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/eagle_get_self_hidden_states.cu b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/eagle_get_self_hidden_states.cu
new file mode 100644
index 000000000000..da3a29ff39da
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/draft_model_kernels/eagle_get_self_hidden_states.cu
@@ -0,0 +1,190 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+#include "paddle/extension.h"
+
+
+// #define DEBUG_EAGLE_KERNEL
+
+__global__ void computeOrderKernel(const int* last_seq_lens_this_time,
+                                   const int* seq_lens_this_time,
+                                   const int64_t* step_idx,
+                                   int* src_map,
+                                   int* output_token_num,
+                                   int bsz) {
+  int in_offset = 0;
+  int out_offset = 0;
+  if (threadIdx.x == 0) {
+    for (int i = 0; i < bsz; ++i) {
+      int cur_seq_lens_this_time = seq_lens_this_time[i];
+      int cur_last_seq_lens_this_time = last_seq_lens_this_time[i];
+#ifdef DEBUG_EAGLE_KERNEL
+      printf(
+          "batch %d: cur_seq_lens_this_time:%d. "
+          "cur_last_seq_lens_this_time:%d\n",
+          i,
+          cur_seq_lens_this_time,
+          cur_last_seq_lens_this_time);
+#endif
+      // 1. encoder
+      if (step_idx[i] == 1 && cur_seq_lens_this_time > 0) {
+#ifdef DEBUG_EAGLE_KERNEL
+        printf("batch %d last_step is encoder \n", i);
+#endif
+        in_offset += 1;
+        src_map[out_offset++] = in_offset - 1;
+#ifdef DEBUG_EAGLE_KERNEL
+        printf("batch %d finish. src_map[%d]=%d \n",
+               i,
+               out_offset - 1,
+               in_offset - 1);
+#endif
+        // 2. decoder
+      } else if (cur_seq_lens_this_time > 0) /* =1 */ {
+#ifdef DEBUG_EAGLE_KERNEL
+        printf("batch %d is decoder\n", i);
+#endif
+        in_offset += cur_last_seq_lens_this_time;
+        src_map[out_offset++] = in_offset - 1;
+        // 3. stop
+      } else {
+        // first token end
+        if (step_idx[i] == 1) {
+#ifdef DEBUG_EAGLE_KERNEL
+          printf("batch %d finished in first token \n", i);
+#endif
+          in_offset += cur_last_seq_lens_this_time > 0 ? 1 : 0;
+          // normal end
+        } else {
+#ifdef DEBUG_EAGLE_KERNEL
+          printf("batch %d finished in non-first token \n", i);
+#endif
+          in_offset += cur_last_seq_lens_this_time;
+        }
+      }
+    }
+    output_token_num[0] = out_offset;
+#ifdef DEBUG_EAGLE_KERNEL
+    printf("position map output_token_num%d:\n", output_token_num[0]);
+    for (int i = 0; i < output_token_num[0]; i++) {
+      printf("%d ", src_map[i]);
+    }
+    printf("\n");
+#endif
+  }
+}
+
+template <typename T, int PackSize>
+__global__ void rebuildSelfHiddenStatesKernel(
+    const T* input, int* src_map, T* output, int dim_embed, int elem_cnt) {
+  using LoadT = AlignedVector<T, PackSize>;
+  LoadT src_vec;
+
+  int global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  for (int elem_id = global_thread_idx * PackSize; elem_id < elem_cnt;
+       elem_id += blockDim.x * gridDim.x * PackSize) {
+    int output_token_idx = elem_id / dim_embed;
+    int input_token_idx = src_map[output_token_idx];
+    int offset = elem_id % dim_embed;
+    Load<T, PackSize>(input + input_token_idx * dim_embed + offset, &src_vec);
+    Store<T, PackSize>(src_vec, output + output_token_idx * dim_embed + offset);
+  }
+}
+
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> DispatchDtype(
+    const paddle::Tensor input,
+    const paddle::Tensor last_seq_lens_this_time,
+    const paddle::Tensor seq_lens_this_time,
+    const paddle::Tensor step_idx) {
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t;
+
+  int input_token_num = input.shape()[0];
+  int dim_embed = input.shape()[1];
+  int bsz = seq_lens_this_time.shape()[0];
+  auto src_map = paddle::full({input_token_num},
+                              -1,
+                              seq_lens_this_time.dtype(),
+                              seq_lens_this_time.place());
+  auto output_token_num = paddle::full(
+      {1}, 0, seq_lens_this_time.dtype(), seq_lens_this_time.place());
+
+  computeOrderKernel<<<1, 1, 0, seq_lens_this_time.stream()>>>(
+      last_seq_lens_this_time.data<int>(),
+      seq_lens_this_time.data<int>(),
+      step_idx.data<int64_t>(),
+      src_map.data<int>(),
+      output_token_num.data<int>(),
+      bsz);
+
+  int output_token_num_cpu =
+      output_token_num.copy_to(paddle::CPUPlace(), false).data<int>()[0];
+
+  auto out = paddle::full(
+      {output_token_num_cpu, dim_embed}, -1, input.type(), input.place());
+
+  constexpr int packSize = VEC_16B / (sizeof(DataType_));
+  int elem_cnt = output_token_num_cpu * dim_embed;
+  // printf("output_token_num: %d, dim_embed: %d, cnt: %d. packSize: %d\n",
+  // output_token_num_cpu, dim_embed,elem_cnt, packSize);
+  assert(elem_cnt % packSize == 0);
+
+  int pack_num = elem_cnt / packSize;
+
+  int grid_size = 1;
+
+  GetNumBlocks(pack_num, &grid_size);
+
+  constexpr int threadPerBlock = 128;
+
+  rebuildSelfHiddenStatesKernel<DataType_, packSize>
+      <<<grid_size, threadPerBlock, 0, input.stream()>>>(
+          reinterpret_cast<const DataType_*>(input.data<data_t>()),
+          src_map.data<int>(),
+          reinterpret_cast<DataType_*>(out.data<data_t>()),
+          dim_embed,
+          elem_cnt);
+
+
+  return {out};
+}
+
+
+std::vector<paddle::Tensor> EagleGetSelfHiddenStates(
+    const paddle::Tensor& input,
+    const paddle::Tensor& last_seq_lens_this_time,
+    const paddle::Tensor& seq_lens_this_time,
+    const paddle::Tensor& step_idx) {
+  switch (input.dtype()) {
+    case paddle::DataType::BFLOAT16:
+      return DispatchDtype<paddle::DataType::BFLOAT16>(
+          input, last_seq_lens_this_time, seq_lens_this_time, step_idx);
+    case paddle::DataType::FLOAT16:
+      return DispatchDtype<paddle::DataType::FLOAT16>(
+          input, last_seq_lens_this_time, seq_lens_this_time, step_idx);
+    default:
+      PD_THROW("Not support this data type");
+  }
+}
+
+
+PD_BUILD_OP(eagle_get_self_hidden_states)
+    .Inputs(
+        {"input", "last_seq_lens_this_time", "seq_lens_this_time", "step_idx"})
+    .Outputs({"out"})
+    .SetKernelFn(PD_KERNEL(EagleGetSelfHiddenStates));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/ngram_match.cc b/csrc/gpu/speculate_decoding_kernels/ngram_match.cc
index 3c19064b2f66..958b01ddece3 100644
--- a/csrc/gpu/speculate_decoding_kernels/ngram_match.cc
+++ b/csrc/gpu/speculate_decoding_kernels/ngram_match.cc
@@ -37,6 +37,7 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
         int32_t *seq_lens_this_time,
         int32_t *seq_lens_encoder,
         int32_t *seq_lens_decoder,
+        int64_t *max_dec_len,
         int64_t input_ids_stride,
         int64_t pre_ids_stride,
         int64_t draft_tokens_stride,
@@ -55,8 +56,8 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
         }
     }
     for (int batch_idx = 0; batch_idx < real_batch_size; batch_idx++) {
-        max_draft_tokens = draft_token_num[batch_idx];
-        // int local_draft_tokens = max_draft_tokens;
+        max_draft_tokens = std::min(static_cast<int64_t>(
+            draft_token_num[batch_idx]), max_dec_len[batch_idx] - step_idx[batch_idx] - 1);
         if (seq_lens_encoder[batch_idx] > 0) {
             continue;
         } else if (seq_lens_decoder[batch_idx] == 0) {
@@ -90,14 +91,7 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                 continue;
             }
             const int64_t *ngram = cur_pre_ids + (cur_step_idx + 1 - ngram_size);
-#ifdef _DEBUG
-            if (batch_idx == 0) {
-                for (int mm = 0; mm < ngram_size; mm++) {
-                    printf("idx %d: %lld\n", mm, ngram[mm]);
-                }
-            }
-            printf("cur_input_ids_len %d\n", cur_input_ids_len);
-#endif
+
             // Iterate through sliding windows of size ngram_size
             bool match_input = false;
             for (int64_t i = 0; i <= cur_input_ids_len - ngram_size; ++i) {
@@ -114,13 +108,7 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                     int64_t end_idx = std::min(start_idx + max_draft_tokens, cur_input_ids_len);
                     if (start_idx >= end_idx)
                         continue;
-#ifdef _DEBUG
-                    printf("batch_idx:%d. ngram_size:%d. idx:%lld. \n", batch_idx, ngram_size, i);
-                    printf("start:%d. end:%d.\n", start_idx, end_idx);
-#endif
-                    // Ensure we don't go beyond the length of input_ids and avoid self-match
-                    // if (end_idx <= cur_input_ids_len && start_idx < cur_input_ids_len - ngram_size) {
-                    // Return a pointer to the next num_pred_tokens
+
                     int64_t cur_draft_token_num = end_idx - start_idx;
 
                     seq_lens_this_time[batch_idx] = cur_draft_token_num + 1;
@@ -133,15 +121,10 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                 }
             }
             if (!match_input) {
-#ifdef _DEBUG
-                printf("match_input is false so match output\n");
-#endif
                 for (int64_t i = 0; i <= cur_step_idx - ngram_size; ++i) {
                     // Check if the current window matches the ngram
                     bool match = true;
-#ifdef _DEBUG
-                    printf("search %d token in pre_ids\n", i);
-#endif
+
                     for (int j = 0; j < ngram_size; j++) {
                         if (ngram[j] != cur_pre_ids[i + j]) {
                             match = false;
@@ -150,26 +133,14 @@ void find_candidate_pred_tokens(const int64_t *input_ids,
                     }
 
                     if (match) {
-#ifdef _DEBUG
-                        printf("%d token in pre_ids matched\n", i);
-#endif
                         int64_t start_idx = i + ngram_size;
                         int64_t end_idx = std::min(start_idx + max_draft_tokens, cur_step_idx);
                         int64_t cur_draft_token_num = end_idx - start_idx;
                         if (start_idx >= end_idx)
                             continue;
 
-#ifdef _DEBUG
-                        printf("cur_step_idx %d, start_idx %lld, end_idx %lld, cur_draft_token_num is %lld\n",
-                                cur_step_idx,
-                                start_idx,
-                                end_idx,
-                                cur_draft_token_num);
-#endif
-
                         seq_lens_this_time[batch_idx] = cur_draft_token_num + 1;
                         memcpy(cur_draft_tokens + 1, cur_pre_ids + start_idx, sizeof(int64_t) * cur_draft_token_num);
-                        // To break the current batch_idx for-loop
                         ngram_size = 0;
                         break;
                     }
@@ -188,6 +159,7 @@ void NgramMatch(const paddle::Tensor &input_ids,
         const paddle::Tensor &seq_lens_this_time,
         const paddle::Tensor &seq_lens_encoder,
         const paddle::Tensor &seq_lens_decoder,
+        const paddle::Tensor &max_dec_len,
         const int real_batch_size,
         const int max_ngram_size,
         const int max_draft_tokens) {
@@ -210,6 +182,7 @@ void NgramMatch(const paddle::Tensor &input_ids,
             const_cast<int32_t *>(seq_lens_this_time.data<int32_t>()),
             const_cast<int32_t *>(seq_lens_encoder.data<int32_t>()),
             const_cast<int32_t *>(seq_lens_decoder.data<int32_t>()),
+            const_cast<int64_t *>(max_dec_len.data<int64_t>()),
             input_ids_stride,
             pre_ids_stride,
             draft_tokens_stride,
@@ -227,7 +200,8 @@ PD_BUILD_OP(ngram_match)
                 "draft_tokens",
                 "seq_lens_this_time",
                 "seq_lens_encoder",
-                "seq_lens_decoder"})
+                "seq_lens_decoder",
+                "max_dec_len"})
         .Attrs({"real_batch_size: int", "max_ngram_size: int", "max_draft_tokens: int"})
         .Outputs({"draft_tokens_out", "seq_lens_this_time_out"})
         .SetKernelFn(PD_KERNEL(NgramMatch))
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_clear_accept_nums.cu b/csrc/gpu/speculate_decoding_kernels/speculate_clear_accept_nums.cu
new file mode 100644
index 000000000000..cbcd6c0b5a3f
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/speculate_clear_accept_nums.cu
@@ -0,0 +1,42 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+__global__ void speculate_clear_accept_nums_kernel(
+                                 int* accept_num,
+                                 const int* seq_lens_decoder,
+                                 const int max_bsz
+                                 ) {
+    const int bid = threadIdx.x;
+    if (bid >= max_bsz) return;
+    accept_num[bid] = seq_lens_decoder[bid] == 0 ? 0 : accept_num[bid];
+
+}
+
+void SpeculateClearAcceptNums(const paddle::Tensor& accept_num,
+                   const paddle::Tensor& seq_lens_decoder
+                   ) {
+    // printf("enter clear \n");
+    const int max_bsz = seq_lens_decoder.shape()[0];
+    speculate_clear_accept_nums_kernel<<<1, 1024, 0, accept_num.stream()>>>(const_cast<int*>(accept_num.data<int>()),
+                                                                            seq_lens_decoder.data<int>(), max_bsz);
+}
+
+PD_BUILD_OP(speculate_clear_accept_nums)
+    .Inputs({"accept_num", 
+             "seq_lens_decoder"})
+    .Outputs({"seq_lens_decoder_out"})
+    .SetInplaceMap({{"seq_lens_decoder", "seq_lens_decoder_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateClearAcceptNums));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_step.cu b/csrc/gpu/speculate_decoding_kernels/speculate_step.cu
deleted file mode 100644
index 8ef2c477adcc..000000000000
--- a/csrc/gpu/speculate_decoding_kernels/speculate_step.cu
+++ /dev/null
@@ -1,396 +0,0 @@
-// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
-// 
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-// 
-//     http://www.apache.org/licenses/LICENSE-2.0
-// 
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "helper.h"
-
-
-__global__ void speculate_free_and_dispatch_block(bool *stop_flags,
-                                                  int *seq_lens_this_time,
-                                                  int *seq_lens_decoder,
-                                                  int *block_tables,
-                                                  int *encoder_block_lens,
-                                                  bool *is_block_step,
-                                                  int *step_block_list, // [bsz]
-                                                  int *step_len,
-                                                  int *recover_block_list,
-                                                  int *recover_len,
-                                                  int *need_block_list,
-                                                  int *need_block_len,
-                                                  int *used_list_len,
-                                                  int *free_list,
-                                                  int *free_list_len,
-                                                  const int bsz,
-                                                  const int block_size,
-                                                  const int block_num_per_seq,
-                                                  const int max_decoder_block_num,
-                                                  const int max_draft_tokens) {
-    typedef cub::BlockReduce<cub::KeyValuePair<int, int>, 256> BlockReduce;
-    __shared__ typename BlockReduce::TempStorage temp_storage;
-    const int tid = threadIdx.x;
-    if (tid < bsz) {
-        int *block_table_now = block_tables + tid * block_num_per_seq;
-        if (stop_flags[tid] && !is_block_step[tid]) {
-            // 回收block块
-            const int encoder_block_len = encoder_block_lens[tid];
-            const int decoder_used_len = used_list_len[tid];
-            if (decoder_used_len > 0) {
-                const int ori_free_list_len = atomicAdd(free_list_len, decoder_used_len);
-#ifdef DEBUG_STEP
-                printf("free block seq_id: %d, free block num: %d, encoder_block_len: %d, ori_free_list_len: %d\n",
-                        tid,
-                        decoder_used_len,
-                        encoder_block_len,
-                        ori_free_list_len);
-#endif
-                for (int i = 0; i < decoder_used_len; i++) {
-                    free_list[ori_free_list_len + i] = block_table_now[encoder_block_len + i];
-                    block_table_now[encoder_block_len + i] = -1;
-                }
-                encoder_block_lens[tid] = 0;
-                used_list_len[tid] = 0;
-            }
-        } else if (seq_lens_this_time[tid] != 0 &&
-                block_table_now[(seq_lens_decoder[tid] + max_draft_tokens + 1) / block_size] == -1) {
-            // 统计需要分配block的位置和总数
-            const int ori_need_block_len = atomicAdd(need_block_len, 1);
-            need_block_list[ori_need_block_len] = tid;
-#ifdef DEBUG_STEP
-            printf("seq_id: %d need block\n", tid);
-#endif
-        }
-    }
-    __syncthreads();
-
-    while (need_block_len[0] > free_list_len[0]) {
-#ifdef DEBUG_STEP
-        if (tid == 0) {
-            printf("need_block_len: %d, free_list_len: %d\n", need_block_len[0], free_list_len[0]);
-        }
-#endif
-        // 调度block，根据used_list_len从大到小回收block，直到满足need_block_len，已解码到最后一个block的query不参与调度（马上就结束）
-        const int used_block_num =
-                tid < bsz && !is_block_step[tid]
-                ? used_list_len[tid]
-                : 0;
-        cub::KeyValuePair<int, int> kv_pair = {tid, used_block_num};
-        kv_pair = BlockReduce(temp_storage).Reduce(kv_pair, cub::ArgMax());
-
-        if (tid == 0) {
-            const int encoder_block_len = encoder_block_lens[kv_pair.key];
-#ifdef DEBUG_STEP
-            printf("max_id: %d, max_num: %d, encoder_block_len: %d\n",
-                    kv_pair.key,
-                    kv_pair.value,
-                    encoder_block_len);
-#endif
-            int *block_table_now = block_tables + kv_pair.key * block_num_per_seq;
-            for (int i = 0; i < kv_pair.value; i++) {
-                free_list[free_list_len[0] + i] = block_table_now[encoder_block_len + i];
-                block_table_now[encoder_block_len + i] = -1;
-            }
-            step_block_list[step_len[0]] = kv_pair.key;
-            step_len[0] += 1;
-            free_list_len[0] += kv_pair.value;
-            stop_flags[kv_pair.key] = true;
-            is_block_step[kv_pair.key] = true;
-            seq_lens_this_time[kv_pair.key] = 0;
-            seq_lens_decoder[kv_pair.key] = 0;
-        }
-        __syncthreads();
-    }
-
-    // 为需要block的位置分配block，每个位置分配一个block
-    if (tid < need_block_len[0]) {
-        const int need_block_id = need_block_list[tid];
-        if (!stop_flags[need_block_id]) {
-            // 如果需要的位置正好是上一步中被释放的位置，不做处理
-            used_list_len[need_block_id] += 1;
-            const int ori_free_list_len = atomicSub(free_list_len, 1);
-            int *block_table_now = block_tables + need_block_id * block_num_per_seq;
-#ifdef DEBUG_STEP
-            printf("need_block_id %d\n", need_block_id);
-            printf("ori_free_list_len %d\n", ori_free_list_len);
-            printf("max_draft_tokens %d\n", max_draft_tokens);
-            printf("seq_lens_decoder[need_block_id] %d\n", seq_lens_decoder[need_block_id]);
-            printf("free_list[ori_free_list_len - 1] %d\n", free_list[ori_free_list_len - 1]);
-#endif
-            block_table_now[(seq_lens_decoder[need_block_id] + max_draft_tokens + 1) / block_size] =
-                    free_list[ori_free_list_len - 1];
-        }
-        need_block_list[tid] = -1;
-    }
-    __syncthreads();
-
-    // 计算可以复原的query id
-    if (tid == 0) {
-        int ori_free_list_len = free_list_len[0];
-        int ori_step_len = step_len[0];
-        if (ori_step_len > 0) {
-            int ori_step_block_id = step_block_list[ori_step_len - 1];
-            int tmp_used_len = used_list_len[ori_step_block_id];
-            // 比之前调度时多分配一个block，防止马上恢复刚调度的query(比如回收的seq_id在need_block_list中）
-            int used_len = tmp_used_len < max_decoder_block_num ? tmp_used_len + 1 : tmp_used_len;
-            while (ori_step_len > 0 && ori_free_list_len >= used_len) {
-#ifdef DEBUG_STEP
-                printf("recover seq_id: %d, free_list_len: %d, used_list_len: %d\n", 
-                        ori_step_block_id, ori_free_list_len, used_len);
-#endif
-                recover_block_list[recover_len[0]] = ori_step_block_id;
-                is_block_step[ori_step_block_id] = false;
-                used_list_len[ori_step_block_id] = used_len;
-                ori_free_list_len -= used_len;
-                step_block_list[ori_step_len - 1] = -1;
-                step_len[0] -= 1;
-                recover_len[0] += 1;
-                ori_step_len = step_len[0];
-                if (ori_step_len > 0) {
-                    ori_step_block_id = step_block_list[ori_step_len - 1];
-                    tmp_used_len = used_list_len[ori_step_block_id];
-                    used_len = tmp_used_len < max_decoder_block_num ? tmp_used_len + 1 : tmp_used_len;
-                }
-            }
-        }
-        need_block_len[0] = 0;
-    }
-}
-
-// 根据上一步计算出的可以复原的query_id进行状态恢复
-__global__ void speculate_recover_block(int *recover_block_list, // [bsz]
-                                        int *recover_len,
-                                        bool *stop_flags,
-                                        int *seq_lens_this_time,
-                                        int *ori_seq_lens_encoder,
-                                        int *seq_lens_encoder,
-                                        int *seq_lens_decoder,
-                                        int *block_tables,
-                                        int *free_list,
-                                        int *free_list_len,
-                                        int64_t *input_ids,
-                                        int64_t *pre_ids,
-                                        int64_t *step_idx,
-                                        int *encoder_block_lens,
-                                        int *used_list_len,
-                                        const int64_t *next_tokens,
-                                        const int bsz,
-                                        const int block_num_per_seq,
-                                        const int length,
-                                        const int pre_id_length,
-                                        const int64_t first_token_ids) {
-    const int bid = blockIdx.x;
-    const int tid = threadIdx.x;
-    __shared__ int ori_free_list_len;
-    if (bid < recover_len[0]) {
-        const int recover_id = recover_block_list[bid];
-        const int ori_seq_len_encoder = ori_seq_lens_encoder[recover_id];
-        const int step_idx_now = step_idx[recover_id];
-        const int seq_len = ori_seq_len_encoder + step_idx_now;
-        const int encoder_block_len = encoder_block_lens[recover_id];
-        const int decoder_used_len = used_list_len[recover_id];
-        int *block_table_now = block_tables + recover_id * block_num_per_seq;
-        int64_t *input_ids_now = input_ids + recover_id * length;
-        int64_t *pre_ids_now = pre_ids + recover_id * pre_id_length;
-        if (tid == 0) {
-            seq_lens_this_time[recover_id] = seq_len;
-            seq_lens_encoder[recover_id] = seq_len;
-            stop_flags[recover_id] = false;
-            input_ids_now[ori_seq_len_encoder + step_idx_now - 1] = next_tokens[recover_id]; // next tokens
-            input_ids_now[0] = first_token_ids; // set first prompt token
-            const int ori_free_list_len_tid0 = atomicSub(free_list_len, decoder_used_len);
-            ori_free_list_len = ori_free_list_len_tid0;
-#ifdef DEBUG_STEP
-            printf("seq_id: %d, ori_seq_len_encoder: %d, step_idx_now: %d, seq_len: %d, ori_free_list_len_tid0: %d, "
-                   "ori_free_list_len: %d\n",
-                    recover_id,
-                    ori_seq_len_encoder,
-                    step_idx_now,
-                    seq_len,
-                    ori_free_list_len_tid0,
-                    ori_free_list_len);
-#endif
-        }
-        __syncthreads();
-        // 恢复block table
-        for (int i = tid; i < decoder_used_len; i += blockDim.x) {
-            block_table_now[encoder_block_len + i] = free_list[ori_free_list_len - i - 1];
-        }
-        // 恢复input_ids
-        for (int i = tid; i < step_idx_now - 1; i += blockDim.x) {
-            input_ids_now[ori_seq_len_encoder + i] = pre_ids_now[i + 1];
-        }
-    }
-
-    if (bid == 0 && tid == 0) {
-        recover_len[0] = 0;
-    }
-}
-
-void SpeculateStepPaddle(const paddle::Tensor &stop_flags,
-        const paddle::Tensor &seq_lens_this_time,
-        const paddle::Tensor &ori_seq_lens_encoder,
-        const paddle::Tensor &seq_lens_encoder,
-        const paddle::Tensor &seq_lens_decoder,
-        const paddle::Tensor &block_tables, // [bsz, block_num_per_seq]
-        const paddle::Tensor &encoder_block_lens,
-        const paddle::Tensor &is_block_step,
-        const paddle::Tensor &step_block_list,
-        const paddle::Tensor &step_lens,
-        const paddle::Tensor &recover_block_list,
-        const paddle::Tensor &recover_lens,
-        const paddle::Tensor &need_block_list,
-        const paddle::Tensor &need_block_len,
-        const paddle::Tensor &used_list_len,
-        const paddle::Tensor &free_list,
-        const paddle::Tensor &free_list_len,
-        const paddle::Tensor &input_ids,
-        const paddle::Tensor &pre_ids,
-        const paddle::Tensor &step_idx,
-        const paddle::Tensor &next_tokens,
-        const int block_size,
-        const int encoder_decoder_block_num,
-        const int64_t first_token_ids,
-        const int max_draft_tokens) {
-    auto cu_stream = seq_lens_this_time.stream();
-    const int bsz = seq_lens_this_time.shape()[0];
-    const int block_num_per_seq = block_tables.shape()[1];
-    const int length = input_ids.shape()[1];
-    const int pre_id_length = pre_ids.shape()[1];
-    constexpr int BlockSize = 256; // bsz <= 256
-    const int max_decoder_block_num = pre_id_length / block_size;
-    // const int max_decoder_block_num = 2048 / block_size - encoder_decoder_block_num;
-#ifdef DEBUG_STEP
-    printf("bsz: %d, block_num_per_seq: %d, length: %d, max_decoder_block_num: %d\n",
-            bsz,
-            block_num_per_seq,
-            length,
-            max_decoder_block_num);
-#endif
-    speculate_free_and_dispatch_block<<<1, BlockSize, 0, cu_stream>>>(
-            const_cast<bool *>(stop_flags.data<bool>()),
-            const_cast<int *>(seq_lens_this_time.data<int>()),
-            const_cast<int *>(seq_lens_decoder.data<int>()),
-            const_cast<int *>(block_tables.data<int>()),
-            const_cast<int *>(encoder_block_lens.data<int>()),
-            const_cast<bool *>(is_block_step.data<bool>()),
-            const_cast<int *>(step_block_list.data<int>()),
-            const_cast<int *>(step_lens.data<int>()),
-            const_cast<int *>(recover_block_list.data<int>()),
-            const_cast<int *>(recover_lens.data<int>()),
-            const_cast<int *>(need_block_list.data<int>()),
-            const_cast<int *>(need_block_len.data<int>()),
-            const_cast<int *>(used_list_len.data<int>()),
-            const_cast<int *>(free_list.data<int>()),
-            const_cast<int *>(free_list_len.data<int>()),
-            bsz,
-            block_size,
-            block_num_per_seq,
-            max_decoder_block_num,
-            max_draft_tokens);
-#ifdef DEBUG_STEP
-    cudaDeviceSynchronize();
-#endif
-    auto cpu_recover_lens = recover_lens.copy_to(paddle::CPUPlace(), false);
-    const int grid_size = cpu_recover_lens.data<int>()[0];
-#ifdef DEBUG_STEP
-    printf("grid_size2 %d\n", grid_size);
-#endif
-    if (grid_size > 0) {
-        speculate_recover_block<<<grid_size, BlockSize, 0, cu_stream>>>(
-                const_cast<int *>(recover_block_list.data<int>()),
-                const_cast<int *>(recover_lens.data<int>()),
-                const_cast<bool *>(stop_flags.data<bool>()),
-                const_cast<int *>(seq_lens_this_time.data<int>()),
-                const_cast<int *>(ori_seq_lens_encoder.data<int>()),
-                const_cast<int *>(seq_lens_encoder.data<int>()),
-                const_cast<int *>(seq_lens_decoder.data<int>()),
-                const_cast<int *>(block_tables.data<int>()),
-                const_cast<int *>(free_list.data<int>()),
-                const_cast<int *>(free_list_len.data<int>()),
-                const_cast<int64_t *>(input_ids.data<int64_t>()),
-                const_cast<int64_t *>(pre_ids.data<int64_t>()),
-                const_cast<int64_t *>(step_idx.data<int64_t>()),
-                const_cast<int *>(encoder_block_lens.data<int>()),
-                const_cast<int *>(used_list_len.data<int>()),
-                next_tokens.data<int64_t>(),
-                bsz,
-                block_num_per_seq,
-                length,
-                pre_id_length,
-                first_token_ids);
-#ifdef DEBUG_STEP
-        cudaDeviceSynchronize();
-#endif
-    }
-}
-
-PD_BUILD_OP(speculate_step_paddle)
-    .Inputs({"stop_flags", 
-             "seq_lens_this_time",
-             "ori_seq_lens_encoder",
-             "seq_lens_encoder",
-             "seq_lens_decoder",
-             "block_tables",
-             "encoder_block_lens",
-             "is_block_step",
-             "step_block_list",
-             "step_lens",
-             "recover_block_list",
-             "recover_lens",
-             "need_block_list",
-             "need_block_len",
-             "used_list_len",
-             "free_list",
-             "free_list_len",
-             "input_ids",
-             "pre_ids",
-             "step_idx",
-             "next_tokens"})
-    .Attrs({"block_size: int",
-            "encoder_decoder_block_num: int",
-            "first_token_id: int64_t",
-            "max_draft_tokens: int"})
-    .Outputs({"stop_flags_out",
-              "seq_lens_this_time_out",
-              "seq_lens_encoder_out",
-              "seq_lens_decoder_out",
-              "block_tables_out",
-              "encoder_block_lens_out",
-              "is_block_step_out",
-              "step_block_list_out",
-              "step_lens_out",
-              "recover_block_list_out",
-              "recover_lens_out",
-              "need_block_list_out",
-              "need_block_len_out",
-              "used_list_len_out",
-              "free_list_out",
-              "free_list_len_out",
-              "input_ids_out"})
-    .SetInplaceMap({{"stop_flags", "stop_flags_out"},
-                    {"seq_lens_this_time", "seq_lens_this_time_out"},
-                    {"seq_lens_encoder", "seq_lens_encoder_out"},
-                    {"seq_lens_decoder", "seq_lens_decoder_out"},
-                    {"block_tables", "block_tables_out"},
-                    {"encoder_block_lens", "encoder_block_lens_out"},
-                    {"is_block_step", "is_block_step_out"},
-                    {"step_block_list", "step_block_list_out"},
-                    {"step_lens", "step_lens_out"},
-                    {"recover_block_list", "recover_block_list_out"},
-                    {"recover_lens", "recover_lens_out"},
-                    {"need_block_list", "need_block_list_out"},
-                    {"need_block_len", "need_block_len_out"},
-                    {"used_list_len", "used_list_len_out"},
-                    {"free_list", "free_list_out"},
-                    {"free_list_len", "free_list_len_out"},
-                    {"input_ids", "input_ids_out"}})
-    .SetKernelFn(PD_KERNEL(SpeculateStepPaddle));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_update.cu b/csrc/gpu/speculate_decoding_kernels/speculate_update.cu
new file mode 100644
index 000000000000..596805d6b61d
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/speculate_update.cu
@@ -0,0 +1,140 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <int THREADBLOCK_SIZE>
+__global__ void speculate_update(int *seq_lens_encoder,
+                                 int *seq_lens_decoder,
+                                 bool *not_need_stop,
+                                 int64_t *draft_tokens,
+                                 int *actual_draft_token_nums,
+                                 const int64_t *accept_tokens,
+                                 const int *accept_num,
+                                 const bool *stop_flags,
+                                 const int *seq_lens_this_time,
+                                 const bool *is_block_step,
+                                 const int real_bsz,
+                                 const int max_draft_tokens) {
+    const int bid = threadIdx.x;
+    const int accept_num_now = accept_num[bid];
+    int stop_flag_now_int = 0;
+    if (!(is_block_step[bid] || bid >= real_bsz)) {
+        if (stop_flags[bid]) {
+            stop_flag_now_int = 1;
+        }
+        if (seq_lens_encoder[bid] == 0) {
+            seq_lens_decoder[bid] += accept_num_now;
+        }
+
+        if (seq_lens_this_time[bid] > 1 &&
+            seq_lens_encoder[bid] ==
+                0) {  // 对于append模式，需要根据接收与否确定是否要降低下次draft
+                      // token的数量
+            auto current_actual_draft_token_num = actual_draft_token_nums[bid];
+            if (accept_num_now - 1 == current_actual_draft_token_num) {
+                if (current_actual_draft_token_num + 2 <=
+                    max_draft_tokens - 1) {
+                    actual_draft_token_nums[bid] =
+                        current_actual_draft_token_num + 2;
+                } else if (current_actual_draft_token_num + 1 <=
+                           max_draft_tokens - 1) {
+                    actual_draft_token_nums[bid] =
+                        current_actual_draft_token_num + 1;
+                } else {
+                    actual_draft_token_nums[bid] = max_draft_tokens - 1;
+                }
+            } else {
+                actual_draft_token_nums[bid] =
+                    actual_draft_token_nums[bid] - 1 >= 1
+                        ? actual_draft_token_nums[bid] - 1
+                        : 1;
+            }
+        }
+
+        if (seq_lens_encoder[bid] != 0) {
+            seq_lens_decoder[bid] += seq_lens_encoder[bid];
+            seq_lens_encoder[bid] = 0;
+        }
+        if (!stop_flags[bid]) {
+            draft_tokens[bid * max_draft_tokens] =
+                accept_tokens[bid * max_draft_tokens + accept_num_now - 1];
+        }
+        if (stop_flag_now_int) {
+            seq_lens_decoder[bid] = 0;
+        }
+    }
+    __syncthreads();
+    typedef cub::BlockReduce<int64_t, THREADBLOCK_SIZE> BlockReduce;
+    __shared__ typename BlockReduce::TempStorage temp_storage;
+
+    int64_t stop_sum = BlockReduce(temp_storage).Sum(stop_flag_now_int);
+
+    if (threadIdx.x == 0) {
+        not_need_stop[0] = stop_sum < real_bsz;
+    }
+}
+
+void SpeculateUpdate(const paddle::Tensor &seq_lens_encoder,
+                       const paddle::Tensor &seq_lens_decoder,
+                       const paddle::Tensor &not_need_stop,
+                       const paddle::Tensor &draft_tokens,
+                       const paddle::Tensor &actual_draft_token_nums,
+                       const paddle::Tensor &accept_tokens,
+                       const paddle::Tensor &accept_num,
+                       const paddle::Tensor &stop_flags,
+                       const paddle::Tensor &seq_lens_this_time,
+                       const paddle::Tensor &is_block_step) {
+    int real_bsz = seq_lens_this_time.shape()[0];
+    auto max_draft_tokens = draft_tokens.shape()[1];
+
+    constexpr int BlockSize = 512;
+
+    speculate_update<BlockSize><<<1, BlockSize, 0, accept_tokens.stream()>>>(
+        const_cast<int *>(seq_lens_encoder.data<int>()),
+        const_cast<int *>(seq_lens_decoder.data<int>()),
+        const_cast<bool *>(not_need_stop.data<bool>()),
+        const_cast<int64_t *>(draft_tokens.data<int64_t>()),
+        const_cast<int *>(actual_draft_token_nums.data<int>()),
+        accept_tokens.data<int64_t>(),
+        accept_num.data<int>(),
+        stop_flags.data<bool>(),
+        seq_lens_this_time.data<int>(),
+        is_block_step.data<bool>(),
+        real_bsz,
+        max_draft_tokens);
+}
+
+PD_BUILD_OP(speculate_update)
+    .Inputs({"seq_lens_encoder",
+             "seq_lens_decoder",
+             "not_need_stop",
+             "draft_tokens",
+             "actual_draft_token_nums",
+             "accept_tokens",
+             "accept_num",
+             "stop_flags",
+             "seq_lens_this_time",
+             "is_block_step"})
+    .Outputs({"seq_lens_encoder_out",
+              "seq_lens_decoder_out",
+              "not_need_stop_out",
+              "draft_tokens_out",
+              "actual_draft_token_nums_out"})
+    .SetInplaceMap({{"seq_lens_encoder", "seq_lens_encoder_out"},
+                    {"seq_lens_decoder", "seq_lens_decoder_out"},
+                    {"not_need_stop", "not_need_stop_out"},
+                    {"draft_tokens", "draft_tokens_out"},
+                    {"actual_draft_token_nums", "actual_draft_token_nums_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateUpdate));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_verify.cu b/csrc/gpu/speculate_decoding_kernels/speculate_verify.cu
new file mode 100644
index 000000000000..e09cf785bb7f
--- /dev/null
+++ b/csrc/gpu/speculate_decoding_kernels/speculate_verify.cu
@@ -0,0 +1,260 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <curand_kernel.h>
+#include <cstdlib>
+#include <string>
+#include "helper.h"
+
+__device__ bool is_in(const int64_t *candidates,
+                      const int64_t draft,
+                      const int candidate_len) {
+    for (int i = 0; i < candidate_len; i++) {
+        if (draft == candidates[i]) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static uint64_t seed = 0;
+static uint64_t offset = 0;
+
+__device__ int64_t topp_sampling_kernel(const int64_t *candidate_ids,
+                                        const float *candidate_scores,
+                                        curandState_t *dev_curand_states,
+                                        const int candidate_len,
+                                        const float topp) {
+    const int tid = threadIdx.x;
+
+    float sum_scores = 0.0f;
+    float rand_top_p = curand_uniform(dev_curand_states + tid) * topp;
+    for (int i = 0; i < candidate_len; i++) {
+        sum_scores += candidate_scores[i];
+        if (rand_top_p <= sum_scores) {
+            return candidate_ids[i];
+        }
+    }
+    return candidate_ids[0];
+}
+
+__global__ void setup_kernel(curandState_t *state,
+                             const uint64_t seed,
+                             const uint64_t offset,
+                             const int bs,
+                             const bool need_batch_random) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    for (int i = idx; i < bs; i += gridDim.x * blockDim.x) {
+        if (need_batch_random) {
+            curand_init(seed, i, offset, &state[i]);
+        } else {
+            curand_init(seed, 0, offset, &state[i]);
+        }
+    }
+}
+
+__global__ void speculate_verify(int64_t *accept_tokens,
+                                 int *accept_num,
+                                 int64_t *step_idx,
+                                 bool *stop_flags,
+                                 const int *seq_lens_encoder,
+                                 const int *seq_lens_decoder,
+                                 const int64_t *draft_tokens,
+                                 const int *actual_draft_token_nums,
+                                 curandState_t *dev_curand_states,
+                                 const float *topp,
+                                 const int *seq_lens_this_time,
+                                 const int64_t *verify_tokens,
+                                 const float *verify_scores,
+                                 const int64_t *max_dec_len,
+                                 const int64_t *end_tokens,
+                                 const bool *is_block_step,
+                                 const int *output_cum_offsets,
+                                 const int *actual_candidate_len,
+                                 const int real_bsz,
+                                 const int max_draft_tokens,
+                                 const int end_length,
+                                 const int max_seq_len,
+                                 const int max_candidate_len,
+                                 const int verify_window) {
+    const int bid = threadIdx.x;
+    const int start_token_id = bid * max_seq_len - output_cum_offsets[bid];
+    int accept_num_now = 1;
+    int stop_flag_now_int = 0;
+
+    if (!(is_block_step[bid] || bid >= real_bsz)) {
+        if (stop_flags[bid]) {
+            stop_flag_now_int = 1;
+        // 这里 prefill 阶段也会进入，但是因为 draft tokens 会置零，因此会直接到最后的采样阶段
+        } else {
+            auto *verify_tokens_now =
+                verify_tokens + start_token_id * max_candidate_len;
+            auto *draft_tokens_now = draft_tokens + bid * max_draft_tokens;
+            auto *actual_candidate_len_now =
+                actual_candidate_len + start_token_id;
+
+            int i = 0;
+            if (seq_lens_encoder[bid] == 0) {
+                for (; i < seq_lens_this_time[bid] - 1; i++) {
+                    if (verify_tokens_now[i * max_candidate_len] == draft_tokens_now[i + 1]) {
+                        step_idx[bid]++;
+                        auto accept_token = draft_tokens_now[i + 1];
+                        accept_tokens[bid * max_draft_tokens + i] =
+                            accept_token;
+                        if (is_in_end(accept_token, end_tokens, end_length) ||
+                            step_idx[bid] >= max_dec_len[bid]) {
+                            stop_flags[bid] = true;
+                            stop_flag_now_int = 1;
+                            if (step_idx[bid] >= max_dec_len[bid])
+                                accept_tokens[bid * max_draft_tokens + i] =
+                                    end_tokens[0];
+                            break;
+                        } else {
+                            accept_num_now++;
+                        }
+                    } else {
+                        break;
+                    }
+                }
+            }
+            // sampling 阶段
+            // 第一种，draft_token[i+1]被拒绝，需要从 verify_tokens_now[i] 中选一个
+            // 第二种，i == seq_lens_this_time[bid]-1,
+            // 也是从verify_tokens_now[i]中选一个 但是停止的情况不算
+            if (!stop_flag_now_int) {
+                int64_t accept_token;
+                const float *verify_scores_now =
+                    verify_scores + start_token_id * max_candidate_len;
+                step_idx[bid]++;
+                // sampling
+                auto actual_candidate_len_value =
+                    actual_candidate_len_now[i] > max_candidate_len
+                        ? max_candidate_len
+                        : actual_candidate_len_now[i];
+
+                accept_token = topp_sampling_kernel(
+                    verify_tokens_now + i * max_candidate_len,
+                    verify_scores_now + i * max_candidate_len,
+                    dev_curand_states,
+                    actual_candidate_len_value,
+                    topp[bid]);
+
+                accept_tokens[bid * max_draft_tokens + i] = accept_token;
+                if (is_in_end(accept_token, end_tokens, end_length) ||
+                    step_idx[bid] >= max_dec_len[bid]) {
+                    stop_flags[bid] = true;
+                    stop_flag_now_int = 1;
+                    if (step_idx[bid] >= max_dec_len[bid])
+                        accept_tokens[bid * max_draft_tokens + i] =
+                            end_tokens[0];
+                }
+            }
+            accept_num[bid] = accept_num_now;
+        }
+    }
+}
+
+void SpeculateVerify(const paddle::Tensor &accept_tokens,
+                     const paddle::Tensor &accept_num,
+                     const paddle::Tensor &step_idx,
+                     const paddle::Tensor &stop_flags,
+                     const paddle::Tensor &seq_lens_encoder,
+                     const paddle::Tensor &seq_lens_decoder,
+                     const paddle::Tensor &draft_tokens,
+                     const paddle::Tensor &seq_lens_this_time,
+                     const paddle::Tensor &verify_tokens,
+                     const paddle::Tensor &verify_scores,
+                     const paddle::Tensor &max_dec_len,
+                     const paddle::Tensor &end_tokens,
+                     const paddle::Tensor &is_block_step,
+                     const paddle::Tensor &output_cum_offsets,
+                     const paddle::Tensor &actual_candidate_len,
+                     const paddle::Tensor &actual_draft_token_nums,
+                     const paddle::Tensor &topp,
+                     int max_seq_len,
+                     int verify_window) {
+    //   printf("Enter speculate update\n");
+    auto bsz = accept_tokens.shape()[0];
+    int real_bsz = seq_lens_this_time.shape()[0];
+    auto max_draft_tokens = draft_tokens.shape()[1];
+    auto end_length = end_tokens.shape()[0];
+    auto max_candidate_len = verify_tokens.shape()[1];
+
+    constexpr int BlockSize = 512;
+
+    curandState_t *dev_curand_states;
+    cudaMalloc(&dev_curand_states, sizeof(curandState_t) * bsz);
+    setup_kernel<<<1, BlockSize, 0, accept_tokens.stream()>>>(
+        dev_curand_states, seed, offset, bsz, true);
+    seed++;
+    offset++;
+
+    speculate_verify<<<1, BlockSize, 0, accept_tokens.stream()>>>(
+            const_cast<int64_t *>(accept_tokens.data<int64_t>()),
+            const_cast<int *>(accept_num.data<int>()),
+            const_cast<int64_t *>(step_idx.data<int64_t>()),
+            const_cast<bool *>(stop_flags.data<bool>()),
+            seq_lens_encoder.data<int>(),
+            seq_lens_decoder.data<int>(),
+            draft_tokens.data<int64_t>(),
+            actual_draft_token_nums.data<int>(),
+            dev_curand_states,
+            topp.data<float>(),
+            seq_lens_this_time.data<int>(),
+            verify_tokens.data<int64_t>(),
+            verify_scores.data<float>(),
+            max_dec_len.data<int64_t>(),
+            end_tokens.data<int64_t>(),
+            is_block_step.data<bool>(),
+            output_cum_offsets.data<int>(),
+            actual_candidate_len.data<int>(),
+            real_bsz,
+            max_draft_tokens,
+            end_length,
+            max_seq_len,
+            max_candidate_len,
+            verify_window);
+
+
+    cudaFree(dev_curand_states);
+}
+
+PD_BUILD_OP(speculate_verify)
+    .Inputs({"accept_tokens",
+             "accept_num",
+             "step_idx",
+             "seq_lens_encoder",
+             "seq_lens_decoder",
+             "stop_flags",
+             "draft_tokens",
+             "seq_lens_this_time",
+             "verify_tokens",
+             "verify_scores",
+             "max_dec_len",
+             "end_tokens",
+             "is_block_step",
+             "output_cum_offsets",
+             "actual_candidate_len",
+             "actual_draft_token_nums",
+             "topp"})
+    .Outputs({"accept_tokens_out",
+              "accept_num_out",
+              "step_idx_out",
+              "stop_flags_out"})
+    .Attrs({"max_seq_len: int", "verify_window: int", "enable_topp: bool"})
+    .SetInplaceMap({{"accept_tokens", "accept_tokens_out"},
+                    {"accept_num", "accept_num_out"},
+                    {"step_idx", "step_idx_out"},
+                    {"stop_flags", "stop_flags_out"}})
+    .SetKernelFn(PD_KERNEL(SpeculateVerify));
\ No newline at end of file
diff --git a/csrc/gpu/speculate_decoding_kernels/speculate_verify_and_update.cu b/csrc/gpu/speculate_decoding_kernels/speculate_verify_and_update.cu
deleted file mode 100644
index aac0a0c9fdac..000000000000
--- a/csrc/gpu/speculate_decoding_kernels/speculate_verify_and_update.cu
+++ /dev/null
@@ -1,451 +0,0 @@
-// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
-// 
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-// 
-//     http://www.apache.org/licenses/LICENSE-2.0
-// 
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "helper.h"
-#include <curand_kernel.h>
-#include <cstdlib>
-#include <string>
-
-__device__ bool is_in(const int64_t* candidates, const int64_t draft, const int candidate_len) {
-    for (int i = 0; i < candidate_len; i++) {
-        if (draft == candidates[i]) {
-            return true;
-        }
-    }
-    return false;
-}
-
-static uint64_t seed = 0;
-static uint64_t offset = 0;
-
-__device__ int64_t topp_sampling_kernel(const int64_t* candidate_ids,
-        const float* candidate_scores,
-        curandState_t* dev_curand_states,
-        const int candidate_len,
-        const float topp) {
-
-    const int tid = threadIdx.x;
-
-    float sum_scores = 0.0f;
-    float rand_top_p = curand_uniform(dev_curand_states + tid) * topp;
-    for (int i = 0; i < candidate_len; i++) {
-        sum_scores += candidate_scores[i];
-        if (rand_top_p <= sum_scores) {
-            return candidate_ids[i];
-        }
-    }
-    return candidate_ids[0];
-}
-
-__global__ void setup_kernel(curandState_t* state,
-        const uint64_t seed,
-        const uint64_t offset,
-        const int bs,
-        const bool need_batch_random) {
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    for (int i = idx; i < bs; i += gridDim.x * blockDim.x) {
-        if (need_batch_random) {
-            curand_init(seed, i, offset, &state[i]);
-        } else {
-            curand_init(seed, 0, offset, &state[i]);
-        }
-    }
-}
-
-template <int THREADBLOCK_SIZE, bool ENABLE_TOPP, bool USE_TOPK>
-__global__ void speculate_verify_and_update_kernel(int64_t* accept_tokens,
-        int* accept_num,
-        int64_t* step_idx,
-        int* seq_lens_encoder,
-        int* seq_lens_decoder,
-        bool* stop_flags,
-        bool* not_need_stop,
-        int64_t* draft_tokens,
-        int* actual_draft_token_nums,
-        curandState_t* dev_curand_states,
-        const float* topp,
-        const int* seq_lens_this_time,
-        const int64_t* verify_tokens,
-        const float* verify_scores,
-        const int64_t* max_dec_len,
-        const int64_t* end_tokens,
-        const bool* is_block_step,
-        const int* output_cum_offsets,
-        const int* actual_candidate_len,
-        const int real_bsz,
-        const int max_draft_tokens,
-        const int end_length,
-        const int max_seq_len,
-        const int max_candidate_len,
-        const int verify_window) {
-    const int bid = threadIdx.x;
-    // start token's id of bid batch
-    const int start_token_id = bid * max_seq_len - output_cum_offsets[bid];
-    // verify and set stop flags
-    int accept_num_now = 1;
-    int stop_flag_now_int = 0;
-
-    if (!(is_block_step[bid] || bid >= real_bsz)) {
-
-        if (stop_flags[bid]) {
-            stop_flag_now_int = 1;
-        } else { // Here the prefill stage also goes in, but since the draft tokens are zero in prefill stage, it goes straight to the final sampling stage.
-            auto* verify_tokens_now = verify_tokens + start_token_id * max_candidate_len;
-            auto* draft_tokens_now = draft_tokens + bid * max_draft_tokens;
-            auto* actual_candidate_len_now = actual_candidate_len + start_token_id;
-
-            int i = 0;
-            for (; i < seq_lens_this_time[bid] - 1; i++) {
-                if (USE_TOPK) {
-                    if (verify_tokens_now[i * max_candidate_len] == draft_tokens_now[i + 1]) {
-                        accept_num_now++;
-                        step_idx[bid]++;
-                        auto accept_token = draft_tokens_now[i + 1];
-                        accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                        if (is_in_end(accept_token, end_tokens, end_length) || step_idx[bid] >= max_dec_len[bid]) {
-                            stop_flags[bid] = true;
-                            stop_flag_now_int = 1;
-                            if (step_idx[bid] >= max_dec_len[bid])
-                                accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                            break;
-                        }
-                    } else {
-                        break;
-                    }
-                } else {
-                    auto actual_candidate_len_value = actual_candidate_len_now[i] > max_candidate_len
-                            ? max_candidate_len
-                            : actual_candidate_len_now[i];
-                    if (is_in(verify_tokens_now + i * max_candidate_len,
-                                draft_tokens_now[i + 1],
-                                actual_candidate_len_value)) {
-                        // Top P verify
-                        accept_num_now++;
-                        step_idx[bid]++;
-                        auto accept_token = draft_tokens_now[i + 1];
-                        accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                        if (is_in_end(accept_token, end_tokens, end_length) || step_idx[bid] >= max_dec_len[bid]) {
-                            stop_flags[bid] = true;
-                            stop_flag_now_int = 1;
-                            if (step_idx[bid] >= max_dec_len[bid])
-                                accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                            break;
-                        }
-                    } else {
-                        // TopK verify
-                        int ii = i;
-                        if (max_candidate_len >= 2 &&
-                                verify_tokens_now[ii * max_candidate_len + 1] == draft_tokens_now[ii + 1]) { // top-2
-                            int j = 0;
-                            ii += 1;
-                            for (; j < verify_window && ii < seq_lens_this_time[bid] - 1; j++, ii++) {
-                                if (verify_tokens_now[ii * max_candidate_len] != draft_tokens_now[ii + 1]) {
-                                    break;
-                                }
-                            }
-                            if (j >= verify_window) { // accept all
-                                accept_num_now += verify_window + 1;
-                                step_idx[bid] += verify_window + 1;
-                                for (; i < ii; i++) {
-                                    auto accept_token = draft_tokens_now[i + 1];
-                                    accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                                    if (is_in_end(accept_token, end_tokens, end_length) ||
-                                            step_idx[bid] >= max_dec_len[bid]) {
-                                        stop_flags[bid] = true;
-                                        stop_flag_now_int = 1;
-                                        if (step_idx[bid] >= max_dec_len[bid])
-                                            accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                                        break;
-                                    }
-                                }
-                            }
-                        }
-                        break;
-                    }
-                }
-            }
-
-            if (!stop_flag_now_int) {
-                int64_t accept_token;
-                const float* verify_scores_now = verify_scores + start_token_id * max_candidate_len;
-                if (ENABLE_TOPP) {
-                    auto actual_candidate_len_value = actual_candidate_len_now[i] > max_candidate_len
-                            ? max_candidate_len
-                            : actual_candidate_len_now[i];
-                    accept_token = topp_sampling_kernel(verify_tokens_now + i * max_candidate_len,
-                            verify_scores_now + i * max_candidate_len,
-                            dev_curand_states,
-                            actual_candidate_len_value,
-                            topp[bid]);
-                } else {
-                    accept_token = verify_tokens_now[i * max_candidate_len];
-                }
-                accept_tokens[bid * max_draft_tokens + i] = accept_token;
-                if (is_in_end(accept_token, end_tokens, end_length) || step_idx[bid] >= max_dec_len[bid]) {
-                    stop_flags[bid] = true;
-                    stop_flag_now_int = 1;
-                    if (step_idx[bid] >= max_dec_len[bid])
-                        accept_tokens[bid * max_draft_tokens + i] = end_tokens[0];
-                }
-                step_idx[bid]++;
-            }
-
-            seq_lens_decoder[bid] += accept_num_now;
-
-            // For append mode, determine whether to reduce the number of draft tokens depending on whether they are received or not.
-            if (seq_lens_this_time[bid] > 1 && seq_lens_encoder[bid] == 0) {
-                auto current_actual_draft_token_num = actual_draft_token_nums[bid];
-                if (accept_num_now - 1 == current_actual_draft_token_num) {
-                    if (current_actual_draft_token_num + 2 <= max_draft_tokens - 1) {
-                        actual_draft_token_nums[bid] = current_actual_draft_token_num + 2;
-                    } else if (current_actual_draft_token_num + 1 <= max_draft_tokens - 1) {
-                        actual_draft_token_nums[bid] = current_actual_draft_token_num + 1;
-                    } else {
-                        actual_draft_token_nums[bid] = max_draft_tokens - 1;
-                    }
-                } else {
-                    actual_draft_token_nums[bid] =
-                            actual_draft_token_nums[bid] - 1 >= 1 ? actual_draft_token_nums[bid] - 1 : 1;
-                }
-            }
-
-            if (seq_lens_encoder[bid] != 0) {
-                seq_lens_decoder[bid] = seq_lens_encoder[bid];
-                seq_lens_encoder[bid] = 0;
-            }
-
-            accept_num[bid] = accept_num_now;
-            draft_tokens[bid * max_draft_tokens] = accept_tokens[bid * max_draft_tokens + accept_num_now - 1];
-        }
-    }
-    if (stop_flag_now_int) {
-        seq_lens_decoder[bid] = 0;
-    }
-
-    __syncthreads();
-    typedef cub::BlockReduce<int64_t, THREADBLOCK_SIZE> BlockReduce;
-    __shared__ typename BlockReduce::TempStorage temp_storage;
-
-    int64_t stop_sum = BlockReduce(temp_storage).Sum(stop_flag_now_int);
-
-    if (threadIdx.x == 0) {
-        not_need_stop[0] = stop_sum < real_bsz;
-    }
-}
-
-void SpeculateVerifyAndUpdate(const paddle::Tensor& accept_tokens,
-        const paddle::Tensor& accept_num,
-        const paddle::Tensor& step_idx,
-        const paddle::Tensor& seq_lens_encoder,
-        const paddle::Tensor& seq_lens_decoder,
-        const paddle::Tensor& stop_flags,
-        const paddle::Tensor& not_need_stop,
-        const paddle::Tensor& draft_tokens,
-        const paddle::Tensor& seq_lens_this_time,
-        const paddle::Tensor& verify_tokens,
-        const paddle::Tensor& verify_scores,
-        const paddle::Tensor& max_dec_len,
-        const paddle::Tensor& end_tokens,
-        const paddle::Tensor& is_block_step,
-        const paddle::Tensor& output_cum_offsets,
-        const paddle::Tensor& actual_candidate_len,
-        const paddle::Tensor& actual_draft_token_nums,
-        const paddle::Tensor& topp,
-        int max_seq_len,
-        int verify_window,
-        bool enable_topp) {
-    auto bsz = accept_tokens.shape()[0];
-    int real_bsz = seq_lens_this_time.shape()[0];
-    auto max_draft_tokens = draft_tokens.shape()[1];
-    auto end_length = end_tokens.shape()[0];
-    auto max_candidate_len = verify_tokens.shape()[1];
-
-    constexpr int BlockSize = 512;
-
-    curandState_t* dev_curand_states;
-    cudaMalloc(&dev_curand_states, sizeof(curandState_t) * bsz);
-    setup_kernel<<<1, BlockSize, 0, accept_tokens.stream()>>>(dev_curand_states, seed, offset, bsz, true);
-    seed++;
-    offset++;
-
-    auto err = cudaDeviceSynchronize();
-    if (err != 0) {
-        printf("err %d\n", err);
-    }
-
-    err = cudaGetLastError();
-
-    if (err != 0) {
-        printf("err %d\n", err);
-    }
-
-    bool use_topk = false;
-    char* env_var = getenv("SPECULATE_VERIFY_USE_TOPK");
-    if (env_var) {
-        use_topk = (bool)std::stoi(env_var);
-    }
-    if (use_topk) {
-        if (enable_topp) {
-            speculate_verify_and_update_kernel<BlockSize, true, true>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        } else {
-            speculate_verify_and_update_kernel<BlockSize, false, true>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        }
-    } else {
-        if (enable_topp) {
-            speculate_verify_and_update_kernel<BlockSize, true, false>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        } else {
-            speculate_verify_and_update_kernel<BlockSize, false, false>
-                    <<<1, BlockSize, 0, accept_tokens.stream()>>>(const_cast<int64_t*>(accept_tokens.data<int64_t>()),
-                            const_cast<int*>(accept_num.data<int>()),
-                            const_cast<int64_t*>(step_idx.data<int64_t>()),
-                            const_cast<int*>(seq_lens_encoder.data<int>()),
-                            const_cast<int*>(seq_lens_decoder.data<int>()),
-                            const_cast<bool*>(stop_flags.data<bool>()),
-                            const_cast<bool*>(not_need_stop.data<bool>()),
-                            const_cast<int64_t*>(draft_tokens.data<int64_t>()),
-                            const_cast<int*>(actual_draft_token_nums.data<int>()),
-                            dev_curand_states,
-                            topp.data<float>(),
-                            seq_lens_this_time.data<int>(),
-                            verify_tokens.data<int64_t>(),
-                            verify_scores.data<float>(),
-                            max_dec_len.data<int64_t>(),
-                            end_tokens.data<int64_t>(),
-                            is_block_step.data<bool>(),
-                            output_cum_offsets.data<int>(),
-                            actual_candidate_len.data<int>(),
-                            real_bsz,
-                            max_draft_tokens,
-                            end_length,
-                            max_seq_len,
-                            max_candidate_len,
-                            verify_window);
-        }
-    }
-
-    cudaFree(dev_curand_states);
-}
-
-PD_BUILD_OP(speculate_verify_and_update)
-        .Inputs({"accept_tokens",
-                "accept_num",
-                "step_idx",
-                "seq_lens_encoder",
-                "seq_lens_decoder",
-                "stop_flags",
-                "not_need_stop",
-                "draft_tokens",
-                "seq_lens_this_time",
-                "verify_tokens",
-                "verify_scores",
-                "max_dec_len",
-                "end_tokens",
-                "is_block_step",
-                "output_cum_offsets",
-                "actual_candidate_len",
-                "actual_draft_token_nums",
-                "topp"})
-        .Outputs({"accept_tokens_out",
-                "accept_num_out",
-                "step_idx_out",
-                "seq_lens_encoder_out",
-                "seq_lens_decoder_out",
-                "stop_flags_out",
-                "not_need_stop_out",
-                "draft_tokens_out"})
-        .Attrs({"max_seq_len: int", "verify_window: int", "enable_topp: bool"})
-        .SetInplaceMap({{"accept_tokens", "accept_tokens_out"},
-                {"accept_num", "accept_num_out"},
-                {"step_idx", "step_idx_out"},
-                {"seq_lens_encoder", "seq_lens_encoder_out"},
-                {"seq_lens_decoder", "seq_lens_decoder_out"},
-                {"stop_flags", "stop_flags_out"},
-                {"not_need_stop", "not_need_stop_out"},
-                {"draft_tokens", "draft_tokens_out"}})
-        .SetKernelFn(PD_KERNEL(SpeculateVerifyAndUpdate));
\ No newline at end of file
diff --git a/csrc/gpu/step.cu b/csrc/gpu/step.cu
index f17cd2e421df..b2091464181a 100644
--- a/csrc/gpu/step.cu
+++ b/csrc/gpu/step.cu
@@ -31,10 +31,12 @@ __global__ void free_and_dispatch_block(bool *stop_flags,
                                         int *used_list_len,
                                         int *free_list,
                                         int *free_list_len,
+                                        int64_t *first_token_ids,
                                         const int bsz,
                                         const int block_size,
                                         const int block_num_per_seq,
-                                        const int max_decoder_block_num) {
+                                        const int max_decoder_block_num,
+                                        const int speculate_step_token_num) {
     typedef cub::BlockReduce<cub::KeyValuePair<int, int>, 512> BlockReduce;
     __shared__ typename BlockReduce::TempStorage temp_storage;
     const int tid = threadIdx.x;
@@ -42,6 +44,7 @@ __global__ void free_and_dispatch_block(bool *stop_flags,
         int *block_table_now = block_tables + tid * block_num_per_seq;
         if (stop_flags[tid] && !is_block_step[tid]) {
             // 回收block块
+            first_token_ids[tid] = -1;
             const int encoder_block_len = encoder_block_lens[tid];
             const int decoder_used_len = used_list_len[tid];
             if (decoder_used_len > 0) {
@@ -57,7 +60,7 @@ __global__ void free_and_dispatch_block(bool *stop_flags,
                 encoder_block_lens[tid] = 0;
                 used_list_len[tid] = 0;
             }
-        } else if (seq_lens_decoder[tid] != 0 && block_table_now[seq_lens_decoder[tid] / block_size] == -1) {
+        } else if (seq_lens_decoder[tid] != 0 && block_table_now[(seq_lens_decoder[tid] + speculate_step_token_num) / block_size] == -1) {
             // 统计需要分配block的位置和总数
             const int ori_need_block_len = atomicAdd(need_block_len, 1);
             need_block_list[ori_need_block_len] = tid;
@@ -110,7 +113,7 @@ __global__ void free_and_dispatch_block(bool *stop_flags,
             used_list_len[need_block_id] += 1;
             const int ori_free_list_len = atomicSub(free_list_len, 1);
             int *block_table_now = block_tables + need_block_id * block_num_per_seq;
-            block_table_now[seq_lens_decoder[need_block_id] / block_size] = free_list[ori_free_list_len - 1];
+            block_table_now[(seq_lens_decoder[need_block_id] + speculate_step_token_num) / block_size] = free_list[ori_free_list_len - 1];
         }
         need_block_list[tid] = -1;
     }
@@ -165,11 +168,11 @@ __global__ void recover_block(int *recover_block_list, // [bsz]
                               int *encoder_block_lens,
                               int *used_list_len,
                               const int64_t *next_tokens,
+                              const int64_t *first_token_ids,
                               const int bsz,
                               const int block_num_per_seq,
                               const int length,
-                              const int pre_id_length,
-                              const int first_token_id) {
+                              const int pre_id_length) {
     const int bid = blockIdx.x;
     const int tid = threadIdx.x;
     __shared__ int ori_free_list_len;
@@ -188,7 +191,8 @@ __global__ void recover_block(int *recover_block_list, // [bsz]
             seq_lens_encoder[recover_id] = seq_len;
             stop_flags[recover_id] = false;
             input_ids_now[ori_seq_len_encoder + step_idx_now - 1] = next_tokens[recover_id]; // next tokens
-            input_ids_now[0] = first_token_id; // set first prompt token
+            input_ids_now[0] =
+                first_token_ids[recover_id];  // set first prompt token
             const int ori_free_list_len_tid0 = atomicSub(free_list_len, decoder_used_len);
             ori_free_list_len = ori_free_list_len_tid0;
 #ifdef DEBUG_STEP
@@ -233,9 +237,10 @@ void StepPaddle(const paddle::Tensor& stop_flags,
                 const paddle::Tensor& pre_ids,
                 const paddle::Tensor& step_idx,
                 const paddle::Tensor& next_tokens,
+                const paddle::Tensor &first_token_ids,
                 const int block_size,
                 const int encoder_decoder_block_num,
-                const int64_t first_token_id) {
+                const int speculate_step_token_num) {
     auto cu_stream = seq_lens_this_time.stream();
     const int bsz = seq_lens_this_time.shape()[0];
     const int block_num_per_seq = block_tables.shape()[1];
@@ -262,10 +267,12 @@ void StepPaddle(const paddle::Tensor& stop_flags,
         const_cast<int*>(used_list_len.data<int>()),
         const_cast<int*>(free_list.data<int>()),
         const_cast<int*>(free_list_len.data<int>()),
+        const_cast<int64_t *>(first_token_ids.data<int64_t>()),
         bsz,
         block_size,
         block_num_per_seq,
-        max_decoder_block_num
+        max_decoder_block_num,
+        speculate_step_token_num
     );
 #ifdef DEBUG_STEP
 #ifdef PADDLE_WITH_HIP
@@ -297,11 +304,11 @@ void StepPaddle(const paddle::Tensor& stop_flags,
             const_cast<int*>(encoder_block_lens.data<int>()),
             const_cast<int*>(used_list_len.data<int>()),
             next_tokens.data<int64_t>(),
+            first_token_ids.data<int64_t>(),
             bsz,
             block_num_per_seq,
             length,
-            pre_id_length,
-            first_token_id
+            pre_id_length
         );
 #ifdef DEBUG_STEP
 #ifdef PADDLE_WITH_HIP
@@ -334,10 +341,11 @@ PD_BUILD_OP(step_paddle)
              "input_ids",
              "pre_ids",
              "step_idx",
-             "next_tokens"})
+             "next_tokens",
+             "first_token_ids",})
     .Attrs({"block_size: int",
             "encoder_decoder_block_num: int",
-            "first_token_id: int64_t"})
+            "speculate_step_token_num: int"})
     .Outputs({"stop_flags_out",
               "seq_lens_this_time_out",
               "seq_lens_encoder_out",
@@ -354,7 +362,8 @@ PD_BUILD_OP(step_paddle)
               "used_list_len_out",
               "free_list_out",
               "free_list_len_out",
-              "input_ids_out"})
+              "input_ids_out",
+              "first_token_ids_out",})
     .SetInplaceMap({{"stop_flags", "stop_flags_out"},
                     {"seq_lens_this_time", "seq_lens_this_time_out"},
                     {"seq_lens_encoder", "seq_lens_encoder_out"},
@@ -371,5 +380,6 @@ PD_BUILD_OP(step_paddle)
                     {"used_list_len", "used_list_len_out"},
                     {"free_list", "free_list_out"},
                     {"free_list_len", "free_list_len_out"},
-                    {"input_ids", "input_ids_out"}})
+                    {"input_ids", "input_ids_out"},
+                    {"first_token_ids", "first_token_ids_out"}})
     .SetKernelFn(PD_KERNEL(StepPaddle));
\ No newline at end of file
diff --git a/csrc/paddlenlp_ops/__init__.py b/csrc/paddlenlp_ops/__init__.py
new file mode 100644
index 000000000000..afed279ac87c
--- /dev/null
+++ b/csrc/paddlenlp_ops/__init__.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+cuda_version = float(paddle.version.cuda())
+SUPPORTED_SM_VERSIONS = {70, 75, 80, 86, 89, 90} if cuda_version >= 12.4 else {70, 75, 80, 86, 89}
+
+
+def get_sm_version():
+    prop = paddle.device.cuda.get_device_properties()
+    cc = prop.major * 10 + prop.minor
+    return cc
+
+
+sm_version = get_sm_version()
+if sm_version not in SUPPORTED_SM_VERSIONS:
+    raise RuntimeError("Unsupported SM version")
+module_name = f"paddlenlp_ops.sm{sm_version}"
+
+try:
+    module = importlib.import_module(module_name)
+    globals().update(vars(module))
+except ImportError:
+    logger.WARNING(f"No {module_name} ")
diff --git a/csrc/paddlenlp_ops/sm70/__init__.py b/csrc/paddlenlp_ops/sm70/__init__.py
new file mode 100644
index 000000000000..b3507acbba74
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm70/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_70 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_70 ops")
diff --git a/csrc/paddlenlp_ops/sm75/__init__.py b/csrc/paddlenlp_ops/sm75/__init__.py
new file mode 100644
index 000000000000..667f5061f0e9
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm75/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_75 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_75 ops")
diff --git a/csrc/paddlenlp_ops/sm80/__init__.py b/csrc/paddlenlp_ops/sm80/__init__.py
new file mode 100644
index 000000000000..6bfec0821b27
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm80/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_80 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_80 ops")
diff --git a/csrc/paddlenlp_ops/sm86/__init__.py b/csrc/paddlenlp_ops/sm86/__init__.py
new file mode 100644
index 000000000000..47a614e6c81f
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm86/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_86 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_86 ops")
diff --git a/csrc/paddlenlp_ops/sm89/__init__.py b/csrc/paddlenlp_ops/sm89/__init__.py
new file mode 100644
index 000000000000..32f36383e056
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm89/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_89 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_89 ops")
diff --git a/csrc/paddlenlp_ops/sm90/__init__.py b/csrc/paddlenlp_ops/sm90/__init__.py
new file mode 100644
index 000000000000..5a5ba3a1da85
--- /dev/null
+++ b/csrc/paddlenlp_ops/sm90/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+try:
+    from .paddlenlp_ops_90 import *
+except ImportError:
+    logger.WARNING("No paddlenlp_ops_90 ops")
diff --git a/csrc/setup.py b/csrc/setup.py
new file mode 100644
index 000000000000..bc5e8dde8834
--- /dev/null
+++ b/csrc/setup.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" setup for EfficentLLM """
+
+import os
+
+from setuptools import find_packages, setup
+
+description = "Paddlenlp_ops : inference framework implemented based on PaddlePaddle"
+VERSION = "0.0.0"
+
+
+def read(file: str):
+    """
+    read file and return content
+    """
+    current_dir = os.path.dirname(__file__)
+    path = os.path.join(current_dir, file)
+    with open(path, "r", encoding="utf-8") as f:
+        content = f.read().strip()
+    return content
+
+
+def read_version():
+    """
+    read version and return content
+    """
+    return VERSION
+
+
+def read_readme():
+    """
+    read README.md and return content
+    """
+    return read("README.md")
+
+
+setup(
+    name="paddlenlp_ops",
+    packages=find_packages(),
+    version="0.0.0",
+    author="Paddle Infernce Team",
+    author_email="paddle-inference@baidu.com",
+    description=description,
+    long_description=read_readme(),
+    long_description_content_type="text/markdown",
+    url="",
+    python_requires=">=3.8",
+    package_dir={"paddlenlp_ops": "paddlenlp_ops/"},
+    package_data={"paddlenlp_ops": ["sm70/*", "sm75/*", "sm80/*", "sm86/*", "sm89/*", "sm90/*"]},
+    include_package_data=True,
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: OS Independent",
+    ],
+    license="Apache 2.0",
+)
diff --git a/csrc/setup_cuda.py b/csrc/setup_cuda.py
index 6e0ce8e20658..7f240cdcf575 100644
--- a/csrc/setup_cuda.py
+++ b/csrc/setup_cuda.py
@@ -19,6 +19,8 @@
 import paddle
 from paddle.utils.cpp_extension import CUDAExtension, setup
 
+sm_version = int(os.getenv("CUDA_SM_VERSION", "0"))
+
 
 def update_git_submodule():
     try:
@@ -38,9 +40,12 @@ def find_end_files(directory, end_str):
 
 
 def get_sm_version():
-    prop = paddle.device.cuda.get_device_properties()
-    cc = prop.major * 10 + prop.minor
-    return cc
+    if sm_version > 0:
+        return sm_version
+    else:
+        prop = paddle.device.cuda.get_device_properties()
+        cc = prop.major * 10 + prop.minor
+        return cc
 
 
 def strtobool(v):
@@ -77,8 +82,6 @@ def get_gencode_flags():
 gencode_flags = get_gencode_flags()
 library_path = os.environ.get("LD_LIBRARY_PATH", "/usr/local/cuda/lib64")
 
-sm_version = get_sm_version()
-
 sources = [
     "./gpu/save_with_output.cc",
     "./gpu/set_value_by_flags.cu",
@@ -103,6 +106,8 @@ def get_gencode_flags():
     "./gpu/step.cu",
     "./gpu/quant_int8.cu",
     "./gpu/dequant_int8.cu",
+    "./gpu/get_position_ids.cu",
+    "./gpu/fused_rotary_position_encoding.cu",
     "./gpu/flash_attn_bwd.cc",
     "./gpu/tune_cublaslt_gemm.cu",
     "./gpu/sample_kernels/top_p_sampling_reject.cu",
@@ -174,8 +179,9 @@ def get_gencode_flags():
         "gpu/fp8_gemm_with_cutlass/fp8_fp8_fp8_dual_gemm.cu",
     ]
 
+ops_name = f"paddlenlp_ops_{sm_version}" if sm_version != 0 else "paddlenlp_ops"
 setup(
-    name="paddlenlp_ops",
+    name=ops_name,
     ext_modules=CUDAExtension(
         sources=sources,
         extra_compile_args={"cxx": ["-O3"], "nvcc": nvcc_compile_args},
diff --git a/csrc/tools/build_wheel.sh b/csrc/tools/build_wheel.sh
new file mode 100644
index 000000000000..ab93c7120c5b
--- /dev/null
+++ b/csrc/tools/build_wheel.sh
@@ -0,0 +1,190 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+PYTHON_VERSION=python
+PYTHON_VERSION=${1:-$PYTHON_VERSION}
+export python=$PYTHON_VERSION
+
+# directory config
+DIST_DIR="dist"
+BUILD_DIR="build"
+EGG_DIR="paddlenlp_ops.egg-info"
+
+# custom_ops directory config
+OPS_SRC_DIR="./"
+OPS_BUILD_DIR="build"
+OPS_EGG_DIR="paddlenlp_ops_*.egg-info"
+# OPS_TMP_DIR_BASE="tmp_base"
+OPS_TMP_DIR="tmp_*"
+
+# TEST_DIR="tests"
+
+# command line log config
+RED='\033[0;31m'
+BLUE='\033[0;34m'
+GREEN='\033[1;32m'
+BOLD='\033[1m'
+NONE='\033[0m'
+
+
+function python_version_check() {
+  PY_MAIN_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $1}'`
+  PY_SUB_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $2}'`
+  echo -e "find python version ${PY_MAIN_VERSION}.${PY_SUB_VERSION}"
+  if [ $PY_MAIN_VERSION -ne "3" -o $PY_SUB_VERSION -lt "8" ]; then
+    echo -e "${RED}FAIL:${NONE} please use Python >= 3.8 !"
+    exit 1
+  fi
+}
+
+function init() {
+    echo -e "${BLUE}[init]${NONE} removing building directory..."
+    rm -rf $DIST_DIR $BUILD_DIR $EGG_DIR
+    if [ `${python} -m pip list | grep paddlenlp_ops | wc -l` -gt 0  ]; then
+      echo -e "${BLUE}[init]${NONE} uninstalling paddlenlp_ops..."
+      ${python} -m pip uninstall -y paddlenlp_ops
+    fi
+
+    ${python} -m pip install setuptools_scm
+    echo -e "${BLUE}[init]${NONE} ${GREEN}init success\n"
+}
+
+function generate_sm_versions_and_build_ops() {
+   cuda_version=`${python} -c "import paddle; print(float(paddle.version.cuda()))"`
+   echo "CUDA version is: $cuda_version"
+   if echo "$cuda_version >= 12.4" | awk '{if ($0) exit 0; exit 1}'; then
+       sm_versions=(70 80 80 86 89 90 )
+   else
+       sm_versions=(70 75 80 86 89 ) 
+    fi 
+    
+    for sm_version in "${sm_versions[@]}"; do
+        echo "Building and installing for sm_version: $sm_version"
+        build_and_install_ops $sm_version
+    done
+    return 
+}
+
+function copy_ops(){
+    local sm_version="$1"
+    OPS_VERSION="0.0.0"
+    PY_MAIN_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $1}'`
+    PY_SUB_VERSION=`${python} -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $2}'`
+    PY_VERSION="py${PY_MAIN_VERSION}.${PY_SUB_VERSION}"
+    SYSTEM_VERSION=`${python} -c "import platform; print(platform.system().lower())"`
+    PROCESSER_VERSION=`${python} -c "import platform; print(platform.processor())"`
+    WHEEL_NAME="paddlenlp_ops_${sm_version}-${OPS_VERSION}-${PY_VERSION}-${SYSTEM_VERSION}-${PROCESSER_VERSION}.egg"
+    echo -e "gpu ops -- paddlenlp_ops_${sm_version} ..."
+    cp -r ./tmp_${sm_version}/${WHEEL_NAME}/* ./paddlenlp_ops/sm${sm_version}
+    return
+}
+
+function build_and_install_ops() {
+  local sm_version="$1"
+  cd $OPS_SRC_DIR
+  export no_proxy=bcebos.com,paddlepaddle.org.cn,${no_proxy}
+  echo -e "${BLUE}[build]${NONE} build and install paddlenlp_ops_sm${sm_version} ops..."
+  CUDA_SM_VERSION=${sm_version} ${python} setup_cuda.py install --install-lib tmp_${sm_version}
+  echo -e "${BLUE}[build]${NONE} build and install paddlenlp_ops_${sm_version}..."
+  if [ $? -ne 0 ]; then
+    echo -e "${RED}[FAIL]${NONE} build paddlenlp_ops_${sm_version} wheel failed !"
+    exit 1
+  fi
+  echo -e "${BLUE}[build]${NONE} ${GREEN}build paddlenlp_ops_sm${sm_version} wheel success\n"
+
+  copy_ops "${sm_version}"
+}
+
+function build_and_install() {
+  echo -e "${BLUE}[build]${NONE} building paddlenlp_ops wheel..."
+  ${python} setup.py bdist_wheel
+  if [ $? -ne 0 ]; then
+    echo -e "${RED}[FAIL]${NONE} build paddlenlp_ops wheel failed !"
+    exit 1
+  fi
+  echo -e "${BLUE}[build]${NONE} ${GREEN}build paddlenlp_ops wheel success\n"
+
+  echo -e "${BLUE}[install]${NONE} installing paddlenlp_ops..."
+  cd $DIST_DIR
+  find . -name "paddlenlp_ops*.whl" | xargs ${python} -m pip install
+  if [ $? -ne 0 ]; then
+    cd ..
+    echo -e "${RED}[FAIL]${NONE} install paddlenlp_ops wheel failed !"
+    exit 1
+  fi
+  echo -e "${BLUE}[install]${NONE} ${GREEN}paddlenlp_ops install success\n"
+  cd ..
+}
+
+
+function unittest() {
+  # run UT
+  echo -e "${BLUE}[unittest]${NONE} ${GREEN}unittests success\n${NONE}"
+}
+
+function cleanup() {
+  rm -rf $BUILD_DIR $EGG_DIR
+  ${python} -m pip uninstall -y paddlenlp_ops
+
+  rm -rf $OPS_SRC_DIR/$BUILD_DIR $OPS_SRC_DIR/$EGG_DIR $OPS_SRC_DIR/$OPS_TMP_DIR
+}
+
+function abort() {
+  echo -e "${RED}[FAIL]${NONE} build wheel and unittest failed !
+          please check your code" 1>&2
+
+  cur_dir=`basename "$pwd"`
+
+  rm -rf $BUILD_DIR $EGG_DIR $DIST_DIR
+  ${python} -m pip uninstall -y paddlenlp_ops
+
+  rm -rf $OPS_SRC_DIR/$OPS_BUILD_DIR $OPS_SRC_DIR/$OPS_EGG_DIR $OPS_SRC_DIR/$OPS_TMP_DIR
+}
+
+python_version_check
+
+trap 'abort' 0
+set -e
+
+init
+generate_sm_versions_and_build_ops
+build_and_install
+unittest
+cleanup
+
+# get Paddle version
+PADDLE_VERSION=`${python} -c "import paddle; print(paddle.version.full_version)"`
+PADDLE_COMMIT=`${python} -c "import paddle; print(paddle.version.commit)"`
+
+# get paddlenlp_ops version
+EFFLLM_BRANCH=`git rev-parse --abbrev-ref HEAD`
+EFFLLM_COMMIT=`git rev-parse --short HEAD`
+
+# get Python version
+PYTHON_VERSION=`${python} -c "import platform; print(platform.python_version())"`
+
+echo -e "\n${GREEN}paddlenlp_ops wheel compiled and checked success !${NONE}
+        ${BLUE}Python version:${NONE} $PYTHON_VERSION
+        ${BLUE}Paddle version:${NONE} $PADDLE_VERSION ($PADDLE_COMMIT)
+        ${BLUE}paddlenlp_ops branch:${NONE} $EFFLLM_BRANCH ($EFFLLM_COMMIT)\n"
+
+echo -e "${GREEN}wheel saved under${NONE} ${RED}${BOLD}./dist${NONE}"
+
+# install wheel
+${python} -m pip install ./dist/paddlenlp_ops*.whl
+echo -e "${GREEN}wheel install success!${NONE}\n"
+
+trap 0
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
index 7234c729bcfc..2e5299367819 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -49,18 +49,15 @@
    :maxdepth: 1
    :caption: 飞桨大模型
 
-   大模型预训练文档 <llm/docs/pretrain.rst>
-   大模型精调文档 <llm/docs/finetune.md>
-   大模型FlashMask算法 <llm/docs/flashmask.md>
-   大模型常用算法文档 <llm/docs/algorithm_overview.md>
-   大模型RLHF文档 <llm/docs/rlhf.md>
-   大模型量化教程 <llm/docs/quantization.md>
-   大模型推理教程 <llm/docs/inference.md>
-   大模型统一存储文档 <llm/docs/unified_checkpoint.md>
-   混合并行训练教程 <llm/docs/llm_trainer.rst>
-   模型权重转换教程 <llm/docs/torch2paddle.md>
-   大模型DPO文档 <llm/docs/dpo.md>
-
+   飞桨大模型主文档 <llm/README.md>
+   大模型-预训练文档 <llm/docs/pretrain.rst>
+   大模型-精调文档 <llm/docs/finetune.md>
+   大模型-DPO文档 <llm/docs/dpo.md>
+   大模型-RLHF文档 <llm/docs/rlhf.md>
+   大模型-推理部署教程 <llm/docs/predict/index.rst>
+   大模型-量化教程 <llm/docs/quantization.md>
+   大模型-高级技术文档 <llm/docs/advanced.rst>
+   
 .. toctree::
    :maxdepth: 1
    :caption: 模型库
diff --git a/docs/llm/docs/advanced.rst b/docs/llm/docs/advanced.rst
new file mode 100644
index 000000000000..fd237eaed1f0
--- /dev/null
+++ b/docs/llm/docs/advanced.rst
@@ -0,0 +1,15 @@
+============
+大模型技术文档
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   unified_checkpoint.md
+   llm_trainer.rst
+   flashmask.md
+   mergekit.md
+   chat_template.md
+   torch2paddle.md
+   
+
diff --git a/docs/llm/llm_trainer.rst b/docs/llm/docs/llm_trainer.rst
similarity index 100%
rename from docs/llm/llm_trainer.rst
rename to docs/llm/docs/llm_trainer.rst
diff --git a/docs/llm/docs/predict/devices.rst b/docs/llm/docs/predict/devices.rst
new file mode 100644
index 000000000000..e0d8e5bbb92b
--- /dev/null
+++ b/docs/llm/docs/predict/devices.rst
@@ -0,0 +1,13 @@
+============
+大模型异构设备推理
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   昆仑 XPU <../../devices/xpu/llama/README.md>
+   昇腾 NPU <../../devices/npu/llama/README.md>
+   海光 K100 <../dcu_install.md>
+   燧原 GCU <../../devices/gcu/llama/README.md>
+   太初 SDAA <../../devices/sdaa/llama/README.md>
+   X86 CPU <../cpu_install.md>
\ No newline at end of file
diff --git a/docs/llm/docs/predict/index.rst b/docs/llm/docs/predict/index.rst
new file mode 100644
index 000000000000..6f4a986b2239
--- /dev/null
+++ b/docs/llm/docs/predict/index.rst
@@ -0,0 +1,14 @@
+============
+大模型推理
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   installation.md
+   inference.md
+   ../../server/docs/deploy_usage_tutorial.md
+   best_practices.md
+   speculative_decoding.md
+   各个模型推理量化教程 <models.rst>
+   大模型异构设备推理 <devices.rst>
\ No newline at end of file
diff --git a/docs/llm/docs/predict/models.rst b/docs/llm/docs/predict/models.rst
new file mode 100644
index 000000000000..907ef5948b99
--- /dev/null
+++ b/docs/llm/docs/predict/models.rst
@@ -0,0 +1,10 @@
+============
+各个模型推理、量化教程
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   llama.md
+   qwen.md
+   mixtral.md
\ No newline at end of file
diff --git a/docs/llm/server/index.rst b/docs/llm/server/index.rst
new file mode 100644
index 000000000000..2731969abfd4
--- /dev/null
+++ b/docs/llm/server/index.rst
@@ -0,0 +1,9 @@
+============
+大模型服务化模型部署
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   README.md
+   docs/deploy_usage_tutorial.md
\ No newline at end of file
diff --git a/fix_time b/fix_time
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/llm/README.md b/llm/README.md
index c5cb84602e4a..a216d6f9fb2a 100644
--- a/llm/README.md
+++ b/llm/README.md
@@ -37,6 +37,11 @@
 
 ## 🚀 快速开始 🚀
 
+开始之前，您可以安装先 PaddleNLP 最新 develop 版本:
+```shell
+pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+```
+
 ### 1. 预训练
 
 PaddleNLP 将飞桨4D 并行策略加入到 Trainer API 中， 用户只需修改 Trainer 配置即可使用不同的分布式策略。目前大模型套件提供[LLaMA/LLaMA2/LLaMA3](./config/llama)、[GPT-3](./config/gpt-3)、[Qwen](./config/qwen)、[Baichuan/Baichuan2](./config/baichuan)、[Mixtral](./config/mixtral) 等模型预训练功能，更多模型支持持续更新中。
@@ -73,19 +78,30 @@ mkdir data
 mv llama_openwebtext_100k.bin ./data
 mv llama_openwebtext_100k.idx ./data
 ```
+单卡训练:
+```shell
+# 16G 显存可训练
+python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
+```
+- 该配置16G 显存可训练，可以开启 use_flash_attention,use_fused_rms_norm,recompute 进一步省显存
+- 如果上述配置无法开启，或显存依然不够，可以开启`offload_optim`,此时显存约为11G  `python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json  --offload_optim  1`
 
+高性能、多卡、多机训练:
 ```shell
 # 编译自定义算子，可选
 cd ../slm/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
 
-# 模型预训练参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
+# 多卡模型预训练参考:
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
+# 多机训练参考: 占用45G显存左右
+python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7"  --master=192.168.1.1:8090 --nnodes=2  run_pretrain.py ./config/llama/pretrain_argument.json
 ```
+- 更详细的分布式启动命令请参考[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.6/api/paddle/distributed/launch_cn.html#launch)。
 
 注意：
 
 1. 建议使用 paddle develop 版本训练，需要安装`pip install fast_dataindex visualdl==2.5.3`等相关缺失 whl 包
-2. `use_flash_attention` 需要在 A100机器开启，建议使用 cuda11.8环境。
+2. `use_flash_attention` 需要在 A100 以上机器开启，建议使用 cuda11.8以上环境。
 3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子，需要额外设置 PYTHONPATH
 4. `continue_training` 表示从现有的预训练模型加载训练。7b 模型初始 loss 大概为2.xx, 随机初始化模型 loss 从11.x 左右下降。
 5. 多机训练时，若各机器使用的训练数据文件位置相同（例如挂载共享硬盘情况），请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据，
@@ -125,29 +141,45 @@ PaddleNLP 支持多个主流大模型的 SFT、PEFT 等精调策略，提供统
 为了方便测试，我们也提供了[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)demo 数据集可以直接使用：
 
 ```shell
+# 在 PaddleNLP/llm 目录执行
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz
 tar -xvf alpaca_demo.gz
 ```
 
 #### 2.2 全参精调：SFT
 
+单卡
+```bash
+# 需要12G显存左右
+python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
+# 单卡性能最佳实践，16G显存，可以参考打开开关。
+# ./config/qwen/sft_argument_0p5b_best.json
+```
+
+多卡
 ```bash
-# SFT 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
+# SFT 启动命令参考，需要45G显存左右
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_finetune.py ./config/qwen/sft_argument.json
 ```
 
 #### 2.3 LoRA
 
+LoRA 启动命令参考
 ```bash
-# LoRA 启动命令参考
-python  run_finetune.py ./config/llama/lora_argument.json
+# 需要9G左右显存
+python run_finetune.py ./config/qwen/lora_argument_0p5b.json
+# 需要29G左右显存
+python run_finetune.py ./config/qwen/lora_argument.json
 ```
 
 #### 2.4 Prefix Tuning
 
+Prefix Tuning 启动命令参考
 ```bash
-# Prefix Tuning 启动命令参考
-python  run_finetune.py ./config/llama/pt_argument.json
+# 需要10G左右显存
+python run_finetune.py ./config/qwen/pt_argument_0p5b.json
+# 需要30G左右显存
+python run_finetune.py ./config/qwen/pt_argument.json
 ```
 
 除了 LoRA、Prefix Tuning 外，还支持 LoKr、VeRA、MoRA、ReFT、rsLoRA、LoRA+、PiSSA、MoSLoRA 等多种精调算法，更多大模型精调使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)。
@@ -192,18 +224,26 @@ tar -zxvf ultrafeedback_binarized.tar.gz
 
 ##### 全参 DPO
 
+
 ```bash
-# DPO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
+# DPO 启动命令参考, 8卡训练， 需要大概40G显存
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
+
+# 单卡训练，大概需要26G显存左右
+python -u  ./alignment/dpo/run_dpo.py ./config/qwen/dpo_argument_0p5b.json
 ```
 
 ##### LoRA DPO
 
 ```bash
 # DPO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
 ```
 更多 DPO 技术细节和使用说明详见[DPO 文档](./docs/dpo.md)。
+```bash
+# 需要52G左右显存
+python -u  ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
+```
 
 #### 3.2 KTO
 
@@ -240,13 +280,13 @@ tar -zxvf ultrafeedback_binarized.tar.gz
 
 ```bash
 # KTO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
 ```
 ##### LoRA KTO
 
 ```bash
 # KTO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
 ```
 
 #### 3.3 RLHF
@@ -322,16 +362,30 @@ PaddleNLP 提供高性能推理，内置动态插入和全环节算子融合策
      </font>
 </div>
 
+
+<a id="paddlenlpops"></a>
+paddlenlp_ops 安装高性能推理算子教程（可选）
+```shell
+cd ../csrc/
+python setup_cuda.py install
+cd -
+```
+
 ```shell
 # 动态图模型推理命令参考
-python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
+python ./predict/predictor.py --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct --inference_model --dtype float16
 
 # 静态图模型推理命令参考
 # step1 : 静态图导出
-python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct --inference_model --output_path ./inference --dtype float16
 # step2: 静态图推理
 python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
 ```
+参数说明：
+1. **`--inference_model`** 参数表示使用高性能自定义算子推理，否则使用普通动态图推理(如果可以安装算子，建议打开此开关)。打开时，请前往[此处安装](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/csrc)高性能推理自定义算子，
+2. **`--mode`** 有两种模式可选 `dynamic`, `static`。分别表示动态图和静态图模式。静态图模型需要进行参数导出步骤，动态图不需要。具体可以参考上述命令执行。静态图情况下，导出和推理的参数`--inference_model`需要一致。
+3. 推理速度简要比较。`static+inference_model` > `dynamic+inference_model` >> `static w/o inference_model` > `dynamic w/o inference_mode`。推荐安装高性能算子，使用 `动态图+inference_model` 模式，方便快捷。
+
 
 更多模型推理使用方法详见[大模型推理文档](./docs/predict/inference.md)。
 
@@ -341,35 +395,80 @@ python ./predict/predictor.py --model_name_or_path ./inference --inference_model
 
 我们提供了一套基于动态图推理的简单易用 UI 服务化部署方法，用户可以快速部署服务化推理。
 
+请确保，在部署前请确保已正确安装 NLP，clone 本 repo 下位置代码。以及自定义算子库。本部署的服务是兼容 OpenAI API 接口
+
+
+
 环境准备
 
 - python >= 3.8
 - gradio
 - flask
+- paddlenlp_ops (可选，高性能自定义加速算子， 安装参考[这里](#paddlenlpops))
 
 
 服务化部署脚本
 
 ```shell
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./predict/flask_server.py \
-    --model_name_or_path meta-llama/Llama-2-7b-chat \
+# 单卡，可以使用 paddle.distributed.launch 启动多卡推理
+python  ./predict/flask_server.py \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
     --port 8010 \
     --flask_port 8011 \
     --dtype "float16"
 ```
 
-- `port`: Gradio UI 服务端口号，默认8011。
-- `flask_port`: Flask 服务端口号，默认8010。
+- `port`: Gradio UI 服务端口号，默认8010。
+- `flask_port`: Flask 服务端口号，默认8011。
 - 其他参数请参见[推理文档](./docs/predict/inference.md)中推理参数配置。
 
-此外，如果想通过 API 脚本的方式跑推理，可参考：`./predict/request_flask_server.py` 文件。
+图形化界面: 打开 `http://127.0.0.1:8010` 即可使用 gradio 图形化界面，即可开启对话。
+API 访问: 您也可用通过 flask 服务化 API 的形式
+
+1. 可参考：`./predict/request_flask_server.py` 文件访问。
+```shell
+python predict/request_flask_server.py
+```
+
+2. 或者直接使用 curl,调用开始对话
+```shell
+curl 127.0.0.1:8011/v1/chat/completions \
+-H 'Content-Type: application/json' \
+-d '{"message": [{"role": "user", "content": "你好"}]}'
+```
+3. 使用 OpenAI 客户端调用：
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://localhost:8011/v1/",
+)
+
+# Completion API
+stream = True
+completion = client.chat.completions.create(
+    model="paddlenlp",
+    messages=[
+        {"role": "user", "content": "PaddleNLP好厉害！这句话的感情色彩是？"}
+    ],
+    max_tokens=1024,
+    stream=stream,
+)
+
+if stream:
+    for c in completion:
+        print(c.choices[0].delta.content, end="")
+else:
+    print(completion.choices[0].message.content)
+```
 
 
 #### 7.2 大模型服务化部署工具
 
-该部署工具是基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计。它提供了支持gRPC、HTTP协议的服务接口，以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。
+该部署工具是基于英伟达 Triton 框架专为服务器场景的大模型服务化部署而设计。它提供了支持 gRPC、HTTP 协议的服务接口，以及流式 Token 输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。
 
-基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
+基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更细致的模型推理、量化教程可以参考[大模型推理教程](./docs/predict/inference.md)：
 
 ```shell
 # 下载模型
@@ -396,7 +495,8 @@ curl 127.0.0.1:9965/v1/chat/completions \
 Note:
 1. 请保证 shm-size >= 5，不然可能会导致服务启动失败
 
-更多关于该部署工具的使用方法，请查看[服务化部署流程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/server/docs/deploy_usage_tutorial.md)
+更多模型请参考[LLaMA](./docs/predict/llama.md)、[Qwen](./docs/predict/qwen.md)、[Mixtral](./docs/predict/mixtral.md)。
+更多关于该部署工具的使用方法，请查看[服务化部署流程](./server/docs/deploy_usage_tutorial.md)
 
 ### 8. PyTorch 模型权重转换
 
diff --git a/llm/alignment/dpo/run_dpo.py b/llm/alignment/dpo/run_dpo.py
index bc58fb51f612..8588a9a1bad6 100644
--- a/llm/alignment/dpo/run_dpo.py
+++ b/llm/alignment/dpo/run_dpo.py
@@ -43,7 +43,6 @@
     LlamaForCausalLMPipe,
     Qwen2ForCausalLM,
     Qwen2ForCausalLMPipe,
-    register_sequence_parallel_allreduce_hooks,
 )
 from paddlenlp.transformers.configuration_utils import LlmMetaConfig
 from paddlenlp.trl import (
@@ -128,6 +127,8 @@ def main():
 
     if training_args.pipeline_parallel_degree > 1:
         model_class = AutoModelForCausalLMPipe
+        if not dpo_config.reference_free and not dpo_config.lora:
+            ref_model_config.dpo_config = dpo_config
         model_config.dpo_config = dpo_config
     else:
         model_class = AutoModelForCausalLM
@@ -154,10 +155,6 @@ def main():
     if model_args.flash_mask and not any(isinstance(model, cls) for cls in flash_mask_support_list):
         raise NotImplementedError(f"{model.__class__} not support flash mask.")
 
-    if training_args.sequence_parallel:
-        register_sequence_parallel_allreduce_hooks(
-            model, training_args.gradient_accumulation_steps, training_args.fuse_sequence_parallel_allreduce
-        )
     if model_args.tokenizer_name_or_path is not None:
         tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
     else:
diff --git a/llm/alignment/kto/run_kto.py b/llm/alignment/kto/run_kto.py
index 94fe99dd8b75..01f4d17debab 100644
--- a/llm/alignment/kto/run_kto.py
+++ b/llm/alignment/kto/run_kto.py
@@ -38,7 +38,6 @@
     LlamaForCausalLM,
     LlamaForCausalLMPipe,
     Qwen2ForCausalLM,
-    register_sequence_parallel_allreduce_hooks,
 )
 from paddlenlp.transformers.configuration_utils import LlmMetaConfig
 from paddlenlp.trl import (
@@ -140,10 +139,6 @@ def main():
     if model_args.flash_mask and not any(isinstance(model, cls) for cls in flash_mask_support_list):
         raise NotImplementedError(f"{model.__class__} not support flash mask.")
 
-    if training_args.sequence_parallel:
-        register_sequence_parallel_allreduce_hooks(
-            model, training_args.gradient_accumulation_steps, training_args.fuse_sequence_parallel_allreduce
-        )
     if model_args.tokenizer_name_or_path is not None:
         tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
     else:
diff --git a/llm/alignment/ppo/ppo_trainer.py b/llm/alignment/ppo/ppo_trainer.py
index c2c72d6c5cd1..bdec462411e0 100644
--- a/llm/alignment/ppo/ppo_trainer.py
+++ b/llm/alignment/ppo/ppo_trainer.py
@@ -66,6 +66,7 @@
     speed_metrics,
 )
 from paddlenlp.transformers import PretrainedModel, PretrainedTokenizer
+from paddlenlp.utils import empty_device_cache
 
 
 class StepTrainer(Trainer):
@@ -1032,7 +1033,7 @@ def gen_epoch_data():
                     ptx_batches = [None for _ in range(len(rl_batches))]
                 self.timers and self.timers("ptx-batch").stop()
 
-                paddle.device.cuda.empty_cache()
+                empty_device_cache()
 
                 self.set_train()
                 for _ in range(self.args.update_iters):
@@ -1152,7 +1153,7 @@ def train(
 
         # ##### model and optimizer related setting #####
         policy_model, value_model = self.init_train_model_opt(max_steps, resume_from_checkpoint)
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
 
         # ##### traing statistic logging #####
         # Number of trainable parameters only account for policy_model
@@ -1208,7 +1209,7 @@ def train(
                     # with self.enable(self.value_trainer.optimizer):
                     with self.enable():  # put value optimizer guard in rl_step
                         rl_info = self.rl_step(rl_batch)
-                    paddle.device.cuda.empty_cache()
+                    empty_device_cache()
                     self.timers and self.timers("rl_step").stop()
 
                     if self.use_ptx:
@@ -1224,7 +1225,7 @@ def train(
                             ptx_info = self.ptx_step(ptx_batch)
                         rl_info.update(ptx_info)
                         self.timers and self.timers("ptx_step").stop()
-                paddle.device.cuda.empty_cache()
+                empty_device_cache()
 
                 self.state.global_step += 1
                 self.state.epoch = epoch + (step + 1) / steps_in_epoch
diff --git a/llm/alignment/rm/flashmask/reward_argument.py b/llm/alignment/rm/flashmask/reward_argument.py
index 06097a6d7cdc..0151a1995895 100644
--- a/llm/alignment/rm/flashmask/reward_argument.py
+++ b/llm/alignment/rm/flashmask/reward_argument.py
@@ -86,7 +86,3 @@ class ModelArgument:
         default=1,
         metadata={"help": "virtual_pp_degree"},
     )
-    sequence_parallel: bool = field(
-        default=False,
-        metadata={"help": "whether to use sequence parallel"},
-    )
diff --git a/llm/alignment/rm/flashmask/run_reward.py b/llm/alignment/rm/flashmask/run_reward.py
index e03cf4d9db43..c620b5429116 100644
--- a/llm/alignment/rm/flashmask/run_reward.py
+++ b/llm/alignment/rm/flashmask/run_reward.py
@@ -35,11 +35,7 @@
     get_last_checkpoint,
     set_seed,
 )
-from paddlenlp.transformers import (
-    AutoConfig,
-    AutoTokenizer,
-    register_sequence_parallel_allreduce_hooks,
-)
+from paddlenlp.transformers import AutoConfig, AutoTokenizer
 from paddlenlp.utils.log import logger
 
 
@@ -126,10 +122,6 @@ def main():
         logger.warning("`flash_mask` must use with zero padding and flash attention.")
         model.config.use_flash_attention = True
 
-    if model_args.sequence_parallel:
-        register_sequence_parallel_allreduce_hooks(
-            model, training_args.gradient_accumulation_steps, training_args.fuse_sequence_parallel_allreduce
-        )
     if model_args.tokenizer_name_or_path is not None:
         tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
     else:
diff --git a/llm/application/information_extraction/README.md b/llm/application/information_extraction/README.md
new file mode 100644
index 000000000000..29b542f1beaa
--- /dev/null
+++ b/llm/application/information_extraction/README.md
@@ -0,0 +1,419 @@
+# 通用信息抽取大模型 PP-UIE
+
+ **目录**
+
+- [1. 模型简介](#模型简介)
+- [2. 开箱即用](#开箱即用)
+  - [2.1 实体抽取](#实体抽取)
+  - [2.2 关系抽取](#关系抽取)
+  - [2.3 模型选择](#模型选择)
+  - [2.4 更多配置](#更多配置)
+- [3. 训练定制](#训练定制)
+  - [3.1 代码结构](#代码结构)
+  - [3.2 数据标注](#数据标注)
+  - [3.3 模型微调](#模型微调)
+  - [3.4 定制模型一键预测](#定制模型一键预测)
+  - [3.5 实验指标](#实验指标)
+
+<a name="模型简介"></a>
+
+## 1. 模型简介
+
+通用信息抽取大模型（PP-UIE）是 PaddleNLP 团队基于开源模型和高质量数据集构建的通用信息抽取大模型， PaddleNLP 基于百度 UIE 的建模思路，通过大模型的能力来训练并开源了一款面向中、英文通用信息抽取的大模型。 支持统一训练信息抽取任务包括命名实体识别（NER），关系抽取（RE）和事件抽取（EE）。模型共包含0.5B、1.5B、7B 和14B 共4个版本，以适配不同场景下信息抽取任务使用。在多个数据集（包含 Boson、CLUENER、CCIR2021等常见数据）相比其他通用信息抽取大模型在 ACC 和 F1 指标上有大幅度提升。
+
+
+
+<a name="开箱即用"></a>
+
+## 2. 开箱即用
+
+```paddlenlp.Taskflow```提供通用信息抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用，并满足各类信息抽取需求**
+
+<a name="实体抽取"></a>
+
+#### 2.1 实体抽取
+
+  命名实体识别（Named Entity Recognition，简称 NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。
+
+  - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema 构造如下：
+
+    ```text
+    ['时间', '选手', '赛事名称']
+    ```
+
+    调用示例：
+
+    ```python
+    from pprint import pprint
+    from paddlenlp import Taskflow
+
+    schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+    ie = Taskflow('information_extraction',
+                  schema= ['时间', '选手', '赛事名称'],
+                  schema_lang="zh",
+                  batch_size=1,
+                  model='paddlenlp/PP-UIE-0.5B')
+    pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
+    # 输出
+    [{'时间': [{'text': '2月8日上午'}],
+      '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+      '选手': [{'text': '谷爱凌'}]}]
+    ```
+
+
+<a name="关系抽取"></a>
+
+#### 2.2 关系抽取
+
+  关系抽取（Relation Extraction，简称 RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。
+
+  - 例如以"竞赛名称"作为抽取主体，抽取关系类型为"主办方"、"承办方"和"时间", schema 构造如下：
+
+    ```text
+    {
+      '竞赛名称': [
+        '主办方',
+        '承办方',
+        '时间'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    schema = {'竞赛名称': ['主办方', '承办方', '时间']} # Define the schema for relation extraction
+    ie.set_schema(schema) # Reset schema
+    pprint(ie('2022年语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))
+    # 输出
+    [{'竞赛名称': [{'relations': {'主办方': [{'text': '中国中文信息学会,中国计算机学会'}],
+                          '时间': [{'text': '2022年'}],
+                          '承办方': [{'text': '百度公司,中国中文信息学会评测工作委员会,中国计算机学会自然语言处理专委会'}]},
+            'text': '语言与智能技术竞赛'}]}]
+    ```
+
+<a name="模型选择"></a>
+
+#### 2.3 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  结构  | 语言 |
+  | :---: | :--------: | :--------: |
+  | `paddlenlp/PP-UIE-0.5B` | 24-layers, 896-hidden, 14-heads | 中、英文 |
+  | `paddlenlp/PP-UIE-1.5B` | 28-layers, 1536-hidden, 12-heads | 中、英文 |
+  | `paddlenlp/PP-UIE-7B` | 28-layers, 3584-hidden, 28-heads | 中、英文 |
+  | `paddlenlp/PP-UIE-14B` | 48-layers, 5120-hidden, 40-heads | 中、英文 |
+
+<a name="更多配置"></a>
+
+#### 2.4 更多配置
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema = {'竞赛名称': ['主办方', '承办方', '时间']},
+                  schema_lang="zh",
+                  batch_size=1,
+                  model='paddlenlp/PP-UIE-0.5B',
+                  precision='float16')
+```
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置 schema 的语言，默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同，因此需要指定 schema 的语言。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，默认为`paddlenlp/PP-UIE-0.5B`，可选有`paddlenlp/PP-UIE-0.5B`, `paddlenlp/PP-UIE-1.5B`, `paddlenlp/PP-UIE-7B`, `paddlenlp/PP-UIE-14B`。
+* `precision`：选择模型精度，默认为`float16`，可选有`float16`、`bfloat16`和`float32`和。如果选择`float16`，在 GPU 硬件环境下，请先确保机器正确安装 NVIDIA 相关驱动和基础软件，**确保 CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保 GPU 设备的 CUDA 计算能力（CUDA Compute Capability）大于7.0，典型的设备包括 V100、T4、A10、A100、GTX 20系列和30系列显卡等。如果选择`bfloat16`，能有效加速处理大模型和批量数据，尤其与混合精度结合使用时性能表现更优。但需确保硬件和软件环境支持该精度。支持 `bfloat16`的硬件包括 NVIDIA A100 和 H100 GPU，同时需要确保使用 CUDA>=11.2、cuDNN>=8.1.1 等软件环境。更多关于 CUDA Compute Capability 和精度支持情况请参考 NVIDIA 文档：[GPU 硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+
+
+除此之外，也可通过以下代码快速调用模型并进行推理
+
+```python
+from paddlenlp.transformers import AutoModelForCausalLM
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.generation import GenerationConfig
+from paddlenlp.trl import llm_utils
+
+model_id = "paddlenlp/PP-UIE-0.5B"
+
+model = AutoModelForCausalLM.from_pretrained(model_id, use_flash_attention=False)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
+generation_config = GenerationConfig.from_pretrained(model_id)
+
+
+template = """
+你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。
+ **句子开始**
+ {sentence}
+ **句子结束**
+ **问题开始**
+ {prompt}
+ **问题结束**
+ **回答开始**
+ """
+
+sentences = [
+    "如有单位或个人对公示人员申请廉租住房保障资格有异议的，可以信件和电话的形式向市住建局举报，监督电话：5641079",
+    "姓名：张三，年龄：30岁，手机：13854488452，性别：男，家庭住址：北京市海淀区西北旺",
+    "张三,30岁,13854488452,男,北京市海淀区西北旺",
+]
+
+prompts = [
+    "电话号码",
+    "姓名，年龄，手机号码，性别，地址",
+    "姓名",
+]
+
+inputs = [template.format(sentence=sentence, prompt=prompt) for sentence, prompt in zip(sentences, prompts)]
+inputs = [tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in inputs]
+input_features = tokenizer(
+    inputs,
+    max_length=512,
+    return_position_ids=False,
+    truncation=True,
+    truncation_side="left",
+    padding=True,
+    return_tensors="pd",
+    add_special_tokens=False,
+)
+
+outputs = model.generate(
+    **input_features,
+    max_new_tokens=200,
+    bos_token_id=tokenizer.bos_token_id,
+    eos_token_id=llm_utils.get_eos_token_id(tokenizer, generation_config),
+    pad_token_id=tokenizer.pad_token_id,
+    decode_strategy="greedy_search",
+    temperature=1.0,
+    top_k=1,
+    top_p=1.0,
+    repetition_penalty=1.0,
+)
+
+
+def get_clean_entity(text):
+    ind1 = text.find("\n**回答结束**\n\n")
+    if ind1 != -1:
+        pred = text[:ind1]
+    else:
+        pred = text
+    return pred
+
+
+results = tokenizer.batch_decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+results = [get_clean_entity(result) for result in results]
+
+for sentence, prompt, result in zip(sentences, prompts, results):
+    print("-" * 50)
+    print(f"Sentence: {sentence}")
+    print(f"Prompt: {prompt}")
+    print(f"Result: {result}")
+```
+
+<a name="训练定制"></a>
+
+## 3. 训练定制
+
+对于简单的抽取目标可以直接使用 ```paddlenlp.Taskflow```实现零样本（zero-shot）抽取，对于细分场景我们推荐使用轻定制功能（标注少量数据进行模型微调）以进一步提升效果。下面通过`报销工单信息抽取`的例子展示如何通过几十条训练数据进行 PP-UIE 模型微调。
+
+<a name="代码结构"></a>
+
+#### 3.1 代码结构
+
+```shell
+.
+├── utils.py          # 数据处理工具
+├── doccano.py        # 数据标注脚本
+├── doccano.md        # 数据标注文档
+└── README.md
+```
+
+<a name="数据标注"></a>
+
+#### 3.2 数据标注
+
+我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano) 进行数据标注，本示例也打通了从标注到训练的通道，即 doccano 导出数据后可通过[doccano.py](./doccano.py)脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考[doccano 数据标注指南](doccano.md)。
+
+原始数据示例：
+
+```text
+深大到双龙28块钱4月24号交通费
+```
+
+抽取的目标(schema)为：
+
+```python
+schema = ['出发地', '目的地', '费用', '时间']
+```
+
+标注步骤如下：
+
+- 在 doccano 平台上，创建一个类型为``序列标注``的标注项目。
+- 定义实体标签类别，上例中需要定义的实体标签有``出发地``、``目的地``、``费用``和``时间``。
+- 使用以上定义的标签开始标注数据，下面展示了一个 doccano 标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167336891-afef1ad5-8777-456d-805b-9c65d9014b80.png height=100 hspace='10'/>
+</div>
+
+- 标注完成后，在 doccano 平台上导出文件，并将其重命名为``doccano_ext.json``后，放入``./data``目录下。
+
+- 这里我们提供预先标注好的文件[doccano_ext.json](https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json)，可直接下载并放入`./data`目录。执行以下脚本进行数据转换，执行后会在`./data`目录下生成训练/验证/测试集文件。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --save_dir ./data \
+    --splits 0.8 0.2 0 \
+    --schema_lang ch
+```
+
+
+可配置参数说明：
+
+- ``doccano_file``: 从 doccano 导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，目前只有信息抽取这一种任务。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为 False。
+- ``seed``: 随机种子，默认为1000.
+- ``schema_lang``: 选择 schema 的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从 doccano 导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+
+<a name="模型微调"></a>
+
+#### 3.3 模型微调
+
+推荐使用 [大模型精调](../../docs/finetune.md) 对模型进行微调。只需输入模型、数据集等就可以高效快速地进行微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，并且针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+使用下面的命令，使用 `paddlenlp/PP-UIE-0.5B` 作为预训练模型进行模型微调，将微调后的模型保存至指定路径中。
+
+如果在 GPU 环境中使用，可以指定 gpus 参数进行多卡训练：
+
+```shell
+cd ../../
+# 返回llm目录
+python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/qwen/sft_argument.json
+```
+
+`sft_argument.json` 的参考配置如下：
+```shell
+{
+    "model_name_or_path": "paddlenlp/PP-UIE-0.5B",
+    "dataset_name_or_path": "./application/information_extraction/data",
+    "output_dir": "./checkpoints/ie_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 1,
+    "per_device_eval_batch_size": 1,
+    "eval_accumulation_steps":8,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": false,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "zero_padding": false,
+    "unified_checkpoint": true,
+    "use_flash_attention": false
+  }
+```
+更多 sft_argument.json 配置文件说明，请参考[大模型精调](../../docs/finetune.md)
+
+
+<a name="定制模型一键预测"></a>
+
+#### 3.4 定制模型一键预测
+
+1. 使用 PaddleNLP的高性能 predictor进行快速推理
+- 内置全环节融合算子策略
+- 支持 Weight Only INT8及 INT4推理，支持权重、激活、Cache KV 进行 INT8、FP8量化的推理
+- 支持动态图推理和静态图推理两种方式
+
+```shell
+# llm目录下
+python predict/predictor.py \
+    --model_name_or_path ./checkpoints/ie_ckpts \
+    --dtype float16 \
+    --data_file ./application/information_extraction/data/test.json \
+    --output_file ./output.json \
+    --src_length  512 \
+    --max_length  20 \
+    --batch_size  4 \
+```
+更多关于 `predictor.py` 的配置参数说明，请参考[大模型推理教程](../../docs/predict/inference.md)
+
+2. 使用 taskflow进行快速推理
+`paddlenlp.Taskflow`支持装载定制模型，通过`task_path`指定模型权重文件的路径，路径下需要包含训练好的模型权重文件
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = ['出发地', '目的地', '费用', '时间']
+# 设定抽取目标和定制化模型权重路径
+>>> my_ie = Taskflow("information_extraction", schema=schema, model='paddlenlp/PP-UIE-0.5B',precision = "float16", task_path='./checkpoints/ie_ckpts')
+>>> pprint(my_ie("城市内交通费7月5日金额114广州至佛山"))
+[{'出发地': [{'text': '广州'}],
+  '时间': [{'text': '7月5日'}],
+  '目的地': [{'text': '佛山'}],
+  '费用': [{'text': '114'}]}]
+```
+
+
+
+<a name="实验指标"></a>
+
+#### 3.5 实验指标
+
+我们在通用测试集和医疗、新闻、对话与金融等垂类测试集上进行了实验：
+
+<!-- <table>
+<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
+<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
+<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
+<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
+<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
+<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
+<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
+<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>
+<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
+</table> -->
+
+<table>
+<tr><td>模型名称</td><td>数据集名称</td><td>CMeEE-V2</td><td>Boson</td><td>CLUENER</td><td>CCIR2021-NER</td><td>任务对话2018-NER</td><td>银行借贷2021-NER</td><td>SKE2019</td><td>Avg</td></tr>
+<tr><td></td><td>数据集领域</td><td>医疗领域</td><td>通用领域</td><td>通用领域</td><td>新闻领域</td><td>对话领域</td><td>金融领域</td><td>金融领域</td><td></td></tr>
+<tr><td>PP-UIE-0.5B</td><td>F1(0-shot)</td><td>0.479</td><td>0.638</td><td>0.593</td><td>0.773</td><td>0.723</td><td>0.361</td><td>0.782</td><td>0.621</td></tr>
+<tr><td>PP-UIE-1.5B</td><td>F1(0-shot)</td><td>0.485</td><td>0.688</td><td>0.61</td><td>0.799</td><td>0.768</td><td>0.444</td><td>0.803</td><td>0.657</td></tr>
+<tr><td></td><td>F1(5-shot)</td><td>0.52</td><td>0.694</td><td>0.625</td><td>0.812</td><td>0.812</td><td>0.466</td><td>0.801</td><td>0.676</td></tr>
+<tr><td>PP-UIE-7B</td><td>F1(0-shot)</td><td>0.521</td><td>0.696</td><td>0.615</td><td>0.826</td><td>0.807</td><td>0.434</td><td>0.812</td><td>0.673</td></tr>
+<tr><td></td><td>F1(5-shot)</td><td>0.527</td><td>0.705</td><td>0.626</td><td>0.826</td><td>0.861</td><td>0.483</td><td>0.801</td><td>0.69</td></tr>
+<tr><td>PP-UIE-14B</td><td>F1(0-shot)</td><td>0.556</td><td>0.712</td><td>0.637</td><td>0.841</td><td>0.843</td><td>0.488</td><td>0.832</td><td>0.701</td></tr>
+<tr><td></td><td>F1(5-shot)</td><td>0.588</td><td>0.729</td><td>0.67</td><td>0.837</td><td>0.865</td><td>0.576</td><td>0.832</td><td>0.728</td></tr>
+</table>
+
+
+0-shot 表示无训练数据直接通过模型进行预测，5-shot 表示预测时使用五个数据样例作为提示。**实验表明 PP-UIE 在垂类场景可以通过少量数据（few-shot）进一步提升效果**。
\ No newline at end of file
diff --git a/llm/application/information_extraction/doccano.md b/llm/application/information_extraction/doccano.md
new file mode 100644
index 000000000000..eaa3f0a086ff
--- /dev/null
+++ b/llm/application/information_extraction/doccano.md
@@ -0,0 +1,260 @@
+# doccano
+
+ **目录**
+
+* [1. 安装](#安装)
+* [2. 项目创建](#项目创建)
+* [3. 数据上传](#数据上传)
+* [4. 标签构建](#标签构建)
+* [5. 任务标注](#任务标注)
+* [6. 数据导出](#数据导出)
+* [7. 数据转换](#数据转换)
+
+<a name="安装"></a>
+
+## 1. 安装
+
+参考[doccano 官方文档](https://github.com/doccano/doccano) 完成 doccano 的安装与初始配置。
+
+**以下标注示例用到的环境配置：**
+
+- doccano 1.6.2
+
+<a name="项目创建"></a>
+
+## 2. 项目创建
+
+PP-UIE 支持抽取类型的任务，根据实际需要创建一个新的项目：
+
+#### 2.1 抽取式任务项目创建
+
+创建项目时选择**序列标注**任务，并勾选**Allow overlapping entity**及**Use relation Labeling**。适配**命名实体识别、关系抽取、事件抽取、评价观点抽取**等任务。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249142-44885510-51dc-4359-8054-9c89c9633700.png height=230 hspace='15'/>
+</div>
+
+<a name="数据上传"></a>
+
+## 3. 数据上传
+
+上传的文件为 txt 格式，每一行为一条待标注文本，示例:
+
+```text
+2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌
+第十四届全运会在西安举办
+```
+
+上传数据类型**选择 TextLine**:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167247061-d5795c26-7a6f-4cdb-88ad-107a3cae5446.png height=300 hspace='15'/>
+</div>
+
+**NOTE**：doccano 支持`TextFile`、`TextLine`、`JSONL`和`CoNLL`四种数据上传格式，PP-UIE 定制训练中**统一使用 TextLine**这一文件格式，即上传的文件需要为 txt 格式，且在数据标注时，该文件的每一行待标注文本显示为一页内容。
+
+<a name="标签构建"></a>
+
+## 4. 标签构建
+
+#### 4.1 构建抽取式任务标签
+
+抽取式任务包含**Span**与**Relation**两种标签类型，Span 指**原文本中的目标信息片段**，如实体识别中某个类型的实体，事件抽取中的触发词和论元；Relation 指**原文本中 Span 之间的关系**，如关系抽取中两个实体（Subject&Object）之间的关系，事件抽取中论元和触发词之间的关系。
+
+Span 类型标签构建示例:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248034-afa3f637-65c5-4038-ada0-344ffbd776a2.png height=300 hspace='15'/>
+</div>
+
+Relation 类型标签构建示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248307-916c77f6-bf80-4d6b-aa71-30c719f68257.png height=260 hspace='16'/>
+</div>
+
+
+## 5. 任务标注
+
+#### 5.1 命名实体识别
+
+命名实体识别（Named Entity Recognition，简称 NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，**抽取的类别没有限制，用户可以自己定义**。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248557-f1da3694-1063-465a-be9a-1bb811949530.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`时间`、`选手`、`赛事名称`和`得分`四种 Span 类型标签。
+
+```text
+schema = [
+    '时间',
+    '选手',
+    '赛事名称',
+    '得分'
+]
+```
+
+#### 5.2 关系抽取
+
+关系抽取（Relation Extraction，简称 RE），是指从文本中识别实体并抽取实体之间的语义关系，即抽取三元组（实体一，关系类型，实体二）。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248502-16a87902-3878-4432-b5b8-9808bd8d4de5.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`作品名`、`人物名`和`时间`三种 Span 类型标签，以及`歌手`、`发行时间`和`所属专辑`三种 Relation 标签。Relation 标签**由 Subject 对应实体指向 Object 对应实体**。
+
+该标注示例对应的 schema 为：
+
+```text
+schema = {
+    '作品名': [
+        '歌手',
+        '发行时间',
+        '所属专辑'
+    ]
+}
+```
+
+#### 5.3 事件抽取
+
+事件抽取 (Event Extraction, 简称 EE)，是指从自然语言文本中抽取事件并识别事件类型和事件论元的技术。UIE 所包含的事件抽取任务，是指根据已知事件类型，抽取该事件所包含的事件论元。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248793-138a1e37-43c9-4933-bf89-f3ac7228bf9c.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`地震触发词`（触发词）、`等级`（事件论元）和`时间`（事件论元）三种 Span 标签，以及`时间`和`震级`两种 Relation 标签。触发词标签**统一格式为`XX 触发词`**，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。Relation 标签**由触发词指向对应的事件论元**。
+
+该标注示例对应的 schema 为：
+
+```text
+schema = {
+    '地震触发词': [
+        '时间',
+        '震级'
+    ]
+}
+```
+
+
+<a name="数据导出"></a>
+
+## 6. 数据导出
+
+#### 6.1 导出抽取式任务数据
+
+选择导出的文件类型为``JSONL(relation)``，导出数据示例：
+
+```text
+{
+    "id": 38,
+    "text": "百科名片你知道我要什么，是歌手高明骏演唱的一首歌曲，1989年发行，收录于个人专辑《丛林男孩》中",
+    "relations": [
+        {
+            "id": 20,
+            "from_id": 51,
+            "to_id": 53,
+            "type": "歌手"
+        },
+        {
+            "id": 21,
+            "from_id": 51,
+            "to_id": 55,
+            "type": "发行时间"
+        },
+        {
+            "id": 22,
+            "from_id": 51,
+            "to_id": 54,
+            "type": "所属专辑"
+        }
+    ],
+    "entities": [
+        {
+            "id": 51,
+            "start_offset": 4,
+            "end_offset": 11,
+            "label": "作品名"
+        },
+        {
+            "id": 53,
+            "start_offset": 15,
+            "end_offset": 18,
+            "label": "人物名"
+        },
+        {
+            "id": 54,
+            "start_offset": 42,
+            "end_offset": 46,
+            "label": "作品名"
+        },
+        {
+            "id": 55,
+            "start_offset": 26,
+            "end_offset": 31,
+            "label": "时间"
+        }
+    ]
+}
+```
+
+标注数据保存在同一个文本文件中，每条样例占一行且存储为``json``格式，其包含以下字段
+- ``id``: 样本在数据集中的唯一标识 ID。
+- ``text``: 原始文本数据。
+- ``entities``: 数据中包含的 Span 标签，每个 Span 标签包含四个字段：
+    - ``id``: Span 在数据集中的唯一标识 ID。
+    - ``start_offset``: Span 的起始 token 在文本中的下标。
+    - ``end_offset``: Span 的结束 token 在文本中下标的下一个位置。
+    - ``label``: Span 类型。
+- ``relations``: 数据中包含的 Relation 标签，每个 Relation 标签包含四个字段：
+    - ``id``: (Span1, Relation, Span2)三元组在数据集中的唯一标识 ID，不同样本中的相同三元组对应同一个 ID。
+    - ``from_id``: Span1对应的标识 ID。
+    - ``to_id``: Span2对应的标识 ID。
+    - ``type``: Relation 类型。
+
+
+<a name="数据转换"></a>
+
+## 7.数据转换
+
+该章节详细说明如何通过`doccano.py`脚本对 doccano 平台导出的标注数据进行转换，一键生成训练/验证/测试集。
+
+#### 7.1 抽取式任务数据转换
+
+- 当标注完成后，在 doccano 平台上导出 `JSONL(relation)` 形式的文件，并将其重命名为 `doccano_ext.json` 后，放入 `./data` 目录下。
+- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以开始进行相应模型训练。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --save_dir ./data \
+    --negative_ratio 5
+```
+
+可配置参数说明：
+
+- ``doccano_file``: 从 doccano 导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全正例。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，目前只有信息抽取这一种任务。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为 True。
+- ``seed``: 随机种子，默认为1000.
+- ``schema_lang``: 选择 schema 的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从 doccano 导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+## References
+- **[doccano](https://github.com/doccano/doccano)**
diff --git a/llm/application/information_extraction/doccano.py b/llm/application/information_extraction/doccano.py
new file mode 100644
index 000000000000..8f0ff50988b6
--- /dev/null
+++ b/llm/application/information_extraction/doccano.py
@@ -0,0 +1,146 @@
+# coding=utf-8
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import time
+from decimal import Decimal
+
+import numpy as np
+from utils import convert_llm_examples, set_seed
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.doccano_file):
+        raise ValueError("Please input the correct path of doccano file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if len(args.splits) == 3 and not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.doccano_file, "r", encoding="utf-8") as f:
+        raw_examples = f.readlines()
+
+    def _create_llm_examples(
+        examples,
+        negative_ratio,
+        shuffle=False,
+        is_train=True,
+        schema_lang="ch",
+    ):
+        entities, relations = convert_llm_examples(examples, negative_ratio, is_train, schema_lang)
+        examples = entities + relations
+        if shuffle:
+            indexes = np.random.permutation(len(examples))
+            examples = [examples[i] for i in indexes]
+        return examples
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    if len(args.splits) == 0:
+        examples = _create_llm_examples(
+            raw_examples,
+            args.negative_ratio,
+            args.is_shuffle,
+            schema_lang=args.schema_lang,
+        )
+
+        _save_examples(args.save_dir, "train.json", examples)
+
+    else:
+        if args.is_shuffle:
+            indexes = np.random.permutation(len(raw_examples))
+            index_list = indexes.tolist()
+            raw_examples = [raw_examples[i] for i in indexes]
+        else:
+            index_list = list(range(len(raw_examples)))
+
+        i1, i2, _ = args.splits
+        p1 = int(len(raw_examples) * i1)
+        p2 = int(len(raw_examples) * (i1 + i2))
+
+        train_ids = index_list[:p1]
+        dev_ids = index_list[p1:p2]
+        test_ids = index_list[p2:]
+
+        with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp:
+            maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids}
+            fp.write(json.dumps(maps))
+
+        train_examples = _create_llm_examples(
+            raw_examples[:p1],
+            args.negative_ratio,
+            args.is_shuffle,
+            schema_lang=args.schema_lang,
+        )
+        dev_examples = _create_llm_examples(
+            raw_examples[p1:p2],
+            -1,
+            is_train=False,
+            schema_lang=args.schema_lang,
+        )
+        test_examples = _create_llm_examples(
+            raw_examples[p2:],
+            -1,
+            is_train=False,
+            schema_lang=args.schema_lang,
+        )
+
+        _save_examples(args.save_dir, "train.json", train_examples)
+        _save_examples(args.save_dir, "dev.json", dev_examples)
+        _save_examples(args.save_dir, "test.json", test_examples)
+
+    logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--doccano_file", default="./data/doccano_ext.json", type=str, help="The doccano file exported from doccano platform.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--task_type", choices="ie", default="ie", type=str, help="Select task type, ie for the information extraction task used qwen2, defaults to ie.")
+    parser.add_argument("--is_shuffle", default="False", type=strtobool, help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_convert()
diff --git a/llm/application/information_extraction/utils.py b/llm/application/information_extraction/utils.py
new file mode 100644
index 000000000000..be4cde905a41
--- /dev/null
+++ b/llm/application/information_extraction/utils.py
@@ -0,0 +1,348 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import random
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.utils.log import logger
+
+prompt_format = """你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。
+**句子开始**
+{sentence}
+**句子结束**
+**问题开始**
+{prompt}
+**问题结束**
+**回答开始**
+"""
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
+    """
+    Create dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
+    return dataloader
+
+
+def add_entity_negative_example(examples, texts, prompts, label_set, negative_ratio):
+    negative_examples = []
+    positive_examples = []
+    with tqdm(total=len(prompts)) as pbar:
+        for i, prompt in enumerate(prompts):
+            redundants = list(set(label_set) ^ set(prompt))
+            redundants.sort()
+
+            num_positive = len(examples[i])
+            if num_positive != 0:
+                actual_ratio = math.ceil(len(redundants) / num_positive)
+            else:
+                # Set num_positive to 1 for text without positive example
+                num_positive, actual_ratio = 1, 0
+
+            if actual_ratio <= negative_ratio or negative_ratio == -1:
+                idxs = [k for k in range(len(redundants))]
+            else:
+                idxs = random.sample(range(0, len(redundants)), negative_ratio * num_positive)
+
+            for idx in idxs:
+                src = prompt_format.format_map({"sentence": texts[i], "prompt": redundants[idx]})
+                negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+                # negative_result = {"content": texts[i], "result_list": [], "prompt": redundants[idx]}
+                negative_examples.append(negative_result)
+            positive_examples.extend(examples[i])
+            pbar.update(1)
+    return positive_examples, negative_examples
+
+
+def add_relation_negative_example(redundants, text, num_positive, ratio):
+    added_example = []
+    rest_example = []
+
+    if num_positive != 0:
+        actual_ratio = math.ceil(len(redundants) / num_positive)
+    else:
+        # Set num_positive to 1 for text without positive example
+        num_positive, actual_ratio = 1, 0
+
+    all_idxs = [k for k in range(len(redundants))]
+    if actual_ratio <= ratio or ratio == -1:
+        idxs = all_idxs
+        rest_idxs = []
+    else:
+        idxs = random.sample(range(0, len(redundants)), ratio * num_positive)
+        rest_idxs = list(set(all_idxs) ^ set(idxs))
+
+    for idx in idxs:
+        src = prompt_format.format_map({"sentence": text, "prompt": redundants[idx]})
+        negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+        added_example.append(negative_result)
+
+    for rest_idx in rest_idxs:
+        src = prompt_format.format_map({"sentence": text, "prompt": redundants[idx]})
+        negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+        rest_example.append(negative_result)
+
+    return added_example, rest_example
+
+
+def add_full_negative_example(examples, texts, relation_prompts, predicate_set, subject_goldens, schema_lang="ch"):
+    with tqdm(total=len(relation_prompts)) as pbar:
+        for i, relation_prompt in enumerate(relation_prompts):
+            negative_sample = []
+            for subject in subject_goldens[i]:
+                for predicate in predicate_set:
+                    # The relation prompt is constructed as follows:
+                    # subject + "的" + predicate -> Chinese
+                    # predicate + " of " + subject -> English
+                    if schema_lang == "ch":
+                        prompt = subject + "的" + predicate
+                    else:
+                        prompt = predicate + " of " + subject
+                    if prompt not in relation_prompt:
+                        src = prompt_format.format_map({"sentence": texts[i], "prompt": prompt})
+                        negative_result = {"src": src, "tgt": "无相应实体\n**回答结束**\n\n"}
+                        negative_sample.append(negative_result)
+            examples[i].extend(negative_sample)
+            pbar.update(1)
+    return examples
+
+
+def convert_llm_examples(
+    raw_examples,
+    negative_ratio,
+    is_train=True,
+    schema_lang="ch",
+):
+    """
+    Convert labeled data export from doccano for extraction and aspect-level classification task.
+    """
+
+    texts = []
+    entity_examples = []
+    relation_examples = []
+    entity_prompts = []
+    relation_prompts = []
+    entity_label_set = []
+    entity_name_set = []
+    predicate_set = []
+    subject_goldens = []
+    inverse_relation_list = []
+    predicate_list = []
+
+    logger.info("Converting doccano data...")
+    with tqdm(total=len(raw_examples)) as pbar:
+        for line in raw_examples:
+            items = json.loads(line)
+            # Export file in JSONL format which doccano >= 1.7.0
+            # Export file in JSONL (relation) format
+            # e.g. {"text": "", "relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}
+            text, relations, entities = items["text"], items["relations"], items["entities"]
+            texts.append(text)
+            entity_example = []
+            entity_prompt = []
+            entity_example_map = {}
+            entity_map = {}  # id to entity name
+            for entity in entities:
+                entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                entity_label = entity["label"]
+                entity_map[entity["id"]] = {
+                    "name": entity_name,
+                    "start": entity["start_offset"],
+                    "end": entity["end_offset"],
+                }
+
+                src = prompt_format.format_map({"sentence": text, "prompt": entity_label})
+
+                if entity_label not in entity_example_map.keys():
+                    entity_example_map[entity_label] = {"src": src, "tgt": [entity_name]}
+                else:
+                    entity_example_map[entity_label]["tgt"].append(entity_name)
+
+                if entity_label not in entity_label_set:
+                    entity_label_set.append(entity_label)
+                if entity_name not in entity_name_set:
+                    entity_name_set.append(entity_name)
+                entity_prompt.append(entity_label)
+
+            for label, v in entity_example_map.items():
+                v["tgt"] = ",".join(v["tgt"]) + "\n**回答结束**\n\n"
+                entity_example.append(v)
+            entity_examples.append(entity_example)
+            entity_prompts.append(entity_prompt)
+
+            subject_golden = []  # Golden entity inputs
+            relation_example = []
+            relation_prompt = []
+            relation_example_map = {}
+            inverse_relation = []
+            predicates = []
+            for relation in relations:
+                predicate = relation["type"]
+                subject_id = relation["from_id"]
+                object_id = relation["to_id"]
+                # The relation prompt is constructed as follows:
+                # subject + "的" + predicate -> Chinese
+                # predicate + " of " + subject -> English
+                if schema_lang == "ch":
+                    prompt = entity_map[subject_id]["name"] + "的" + predicate
+                    inverse_negative = entity_map[object_id]["name"] + "的" + predicate
+                else:
+                    prompt = predicate + " of " + entity_map[subject_id]["name"]
+                    inverse_negative = predicate + " of " + entity_map[object_id]["name"]
+
+                if entity_map[subject_id]["name"] not in subject_golden:
+                    subject_golden.append(entity_map[subject_id]["name"])
+
+                src = prompt_format.format_map({"sentence": text, "prompt": prompt})
+
+                inverse_relation.append(inverse_negative)
+                predicates.append(predicate)
+
+                if prompt not in relation_example_map.keys():
+                    relation_example_map[prompt] = {"src": src, "tgt": [entity_map[object_id]["name"]]}
+                else:
+                    relation_example_map[prompt]["tgt"].append(entity_map[object_id]["name"])
+
+                if predicate not in predicate_set:
+                    predicate_set.append(predicate)
+                relation_prompt.append(prompt)
+
+            for v in relation_example_map.values():
+                v["tgt"] = ",".join(v["tgt"]) + "\n**回答结束**\n\n"
+                relation_example.append(v)
+
+            relation_examples.append(relation_example)
+            relation_prompts.append(relation_prompt)
+            subject_goldens.append(subject_golden)
+            inverse_relation_list.append(inverse_relation)
+            predicate_list.append(predicates)
+            pbar.update(1)
+
+    logger.info("Adding negative samples for first stage prompt...")
+    positive_examples, negative_examples = add_entity_negative_example(
+        entity_examples, texts, entity_prompts, entity_label_set, negative_ratio
+    )
+    if len(positive_examples) == 0:
+        all_entity_examples = []
+    else:
+        all_entity_examples = positive_examples + negative_examples
+
+    all_relation_examples = []
+    if len(predicate_set) != 0:
+        logger.info("Adding negative samples for second stage prompt...")
+        if is_train:
+
+            positive_examples = []
+            negative_examples = []
+            per_n_ratio = negative_ratio // 3
+
+            with tqdm(total=len(texts)) as pbar:
+                for i, text in enumerate(texts):
+                    negative_example = []
+                    collects = []
+                    num_positive = len(relation_examples[i])
+
+                    # 1. inverse_relation_list
+                    redundants1 = inverse_relation_list[i]
+                    # 2. entity_name_set ^ subject_goldens[i]
+                    redundants2 = []
+                    if len(predicate_list[i]) != 0:
+                        nonentity_list = list(set(entity_name_set) ^ set(subject_goldens[i]))
+                        nonentity_list.sort()
+
+                        if schema_lang == "ch":
+                            redundants2 = [
+                                nonentity + "的" + predicate_list[i][random.randrange(len(predicate_list[i]))]
+                                for nonentity in nonentity_list
+                            ]
+                        else:
+                            redundants2 = [
+                                predicate_list[i][random.randrange(len(predicate_list[i]))] + " of " + nonentity
+                                for nonentity in nonentity_list
+                            ]
+                    # 3. entity_label_set ^ entity_prompts[i]
+                    redundants3 = []
+                    if len(subject_goldens[i]) != 0:
+                        non_ent_label_list = list(set(entity_label_set) ^ set(entity_prompts[i]))
+                        non_ent_label_list.sort()
+
+                        if schema_lang == "ch":
+                            redundants3 = [
+                                subject_goldens[i][random.randrange(len(subject_goldens[i]))] + "的" + non_ent_label
+                                for non_ent_label in non_ent_label_list
+                            ]
+                        else:
+                            redundants3 = [
+                                non_ent_label + " of " + subject_goldens[i][random.randrange(len(subject_goldens[i]))]
+                                for non_ent_label in non_ent_label_list
+                            ]
+                    redundants_list = [redundants1, redundants2, redundants3]
+
+                    for redundants in redundants_list:
+                        added, rest = add_relation_negative_example(
+                            redundants,
+                            texts[i],
+                            num_positive,
+                            per_n_ratio,
+                        )
+                        negative_example.extend(added)
+                        collects.extend(rest)
+
+                    num_sup = num_positive * negative_ratio - len(negative_example)
+                    if num_sup > 0 and collects:
+                        if num_sup > len(collects):
+                            idxs = [k for k in range(len(collects))]
+                        else:
+                            idxs = random.sample(range(0, len(collects)), num_sup)
+                        for idx in idxs:
+                            negative_example.append(collects[idx])
+
+                    positive_examples.extend(relation_examples[i])
+                    negative_examples.extend(negative_example)
+                    pbar.update(1)
+            all_relation_examples = positive_examples + negative_examples
+        else:
+            relation_examples = add_full_negative_example(
+                relation_examples, texts, relation_prompts, predicate_set, subject_goldens, schema_lang=schema_lang
+            )
+            all_relation_examples = [r for relation_example in relation_examples for r in relation_example]
+
+    return all_entity_examples, all_relation_examples
diff --git a/llm/auto_parallel/deepseek-v3/run_pretrain_auto.py b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.py
new file mode 100644
index 000000000000..91381cb1e05a
--- /dev/null
+++ b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.py
@@ -0,0 +1,725 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+deepseek-v3 auto parallel pretraining scripts.
+"""
+import os
+import random
+import sys
+import types
+from collections import OrderedDict
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.distributed import fleet
+
+from paddlenlp.ops import Topology
+from paddlenlp.trainer import (
+    AutoTrainingArguments,
+    PdArgumentParser,
+    get_last_checkpoint,
+)
+from paddlenlp.trainer.auto_trainer import AutoTrainer
+from paddlenlp.trainer.trainer_utils import IntervalStrategy, _get_distributed_seeds
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    CosineAnnealingWithWarmupDecay,
+    DeepseekV2Config,
+    DeepseekV2PretrainingCriterion,
+    DeepseekV3ForCausalLMAuto,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "deepseekv3_auto": (DeepseekV2Config, DeepseekV3ForCausalLMAuto, DeepseekV2PretrainingCriterion),
+}
+
+
+from paddlenlp.data.causal_dataset import (
+    build_train_valid_test_datasets,
+    check_data_split,
+    print_rank_0,
+)
+from paddlenlp.trainer.utils.doc import add_start_docstrings
+
+
+@dataclass
+@add_start_docstrings(AutoTrainingArguments.__doc__)
+class PreTrainingArguments(AutoTrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+    enable_linear_fused_grad_add: bool = field(
+        default=False,
+        metadata={
+            "help": "Enable fused linear grad add strategy, which will reduce elementwise add for grad accumulation in the backward of nn.Linear ."
+        },
+    )
+    job_schedule_profiler_start: int = field(
+        default=-1,
+        metadata={"help": "The step to start job_schedule_profiler."},
+    )
+    job_schedule_profiler_end: int = field(
+        default=-1,
+        metadata={"help": "The step to end job_schedule_profiler."},
+    )
+    pipeline_schedule_mode: str = field(
+        default="1F1B", metadata={"help": "The pipeline schedule mode, support FThenB, 1F1B, VPP and Eager-1F1B."}
+    )
+    sr: Optional[int] = field(default=0, metadata={"help": "The count of chunks without recompute."})
+    virtual_pipeline_seg_method: str = field(
+        default="DeepseekV2DecoderLayerAuto",
+        metadata={"help": "The seg method of spliting pp layer for virtual pipeline."},
+    )
+    # NOTE(gongenlei): new add autotuner_benchmark
+    autotuner_benchmark: bool = field(
+        default=False,
+        metadata={"help": "Weather to run benchmark by autotuner. True for from_scratch and pad_max_length."},
+    )
+
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.enable_auto_parallel
+
+        # NOTE(gongenlei): new add autotuner_benchmark
+        if self.autotuner_benchmark:
+            self.max_steps = 5
+            self.do_train = True
+            self.do_export = False
+            self.do_predict = False
+            self.do_eval = False
+            self.overwrite_output_dir = True
+            self.load_best_model_at_end = False
+            self.report_to = []
+            self.save_strategy = IntervalStrategy.NO
+            self.evaluation_strategy = IntervalStrategy.NO
+
+        logger.info(self.strategy)
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+
+    data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."})
+    skip_warmup: bool = field(
+        default=True,
+        metadata={"help": "Whether to skip the warmup process of mmap files."},
+    )
+    data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(
+        default="deepseekv3", metadata={"help": "Only support for llama pre-training for now."}
+    )
+    model_name_or_path: str = field(
+        default="deepseek-ai/DeepSeek-V3",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    vocab_size: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": ".Vocabulary size of the deeepseekv2 model. Defines the number of different tokens that can be represented by the `inputs_ids`"
+        },
+    )
+    hidden_size: Optional[int] = field(default=None, metadata={"help": "Dimension of the hidden representations."})
+    intermediate_size: Optional[int] = field(default=None, metadata={"help": "Dimension of the MLP representations."})
+    num_hidden_layers: Optional[int] = field(
+        default=None, metadata={"help": "Number of hidden layers in the Transformer encoder."}
+    )
+    num_attention_heads: Optional[int] = field(
+        default=None,
+        metadata={"help": "Number of attention heads for each attention layer in the Transformer encoder."},
+    )
+    use_flash_attention: bool = field(
+        default=False,
+        metadata={"help": "use_flash_attention"},
+    )
+    use_fused_rms_norm: bool = field(
+        default=False,
+        metadata={"help": "deepseekv3, use_fused_rms_norm"},
+    )
+    fuse_attention_qkv: bool = field(
+        default=False,
+        metadata={"help": "whether to fuse attention qkv"},
+    )
+    fuse_attention_ffn: bool = field(
+        default=False,
+        metadata={"help": "whether to fuse first up and gate proj in mlp block"},
+    )
+    recompute_granularity: str = field(
+        default="full",
+        metadata={"help": "Choose among ['full', 'core_attn', 'full_attn']"},
+    )
+    virtual_pp_degree: int = field(
+        default=1,
+        metadata={"help": "virtual_pp_degree"},
+    )
+    continue_training: bool = field(
+        default=False,
+        metadata={
+            "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
+        },
+    )
+    use_fused_rope: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Enable rope fusion or not."},
+    )
+    no_recompute_layers: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "Specify the full transformer layers that should not be recomputed."},
+    )
+    pp_recompute_interval: int = field(
+        default=1,
+        metadata={
+            "help": "The interval for the number of layers at which recomputation occurs. A value of 0 indicates no recomputation. Default is 0."
+        },
+    )
+    recompute_use_reentrant: bool = field(
+        default=False,
+        metadata={"help": "recompute_use_reentrant"},
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+    need_data=True,
+):
+
+    check_data_split(data_args.split, training_args.do_train, training_args.do_eval, training_args.do_predict)
+
+    train_val_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.dataset_world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters,
+    ]
+
+    print_rank_0(" > datasets target sizes (minimum size):")
+    if training_args.do_train:
+        print_rank_0("    train:      {}".format(train_val_test_num_samples[0]))
+    if training_args.do_eval:
+        print_rank_0("    validation: {}".format(train_val_test_num_samples[1]))
+    if training_args.do_predict:
+        print_rank_0("    test:       {}".format(train_val_test_num_samples[2]))
+
+    # Build the datasets.
+    train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        data_impl=data_args.data_impl,
+        splits_string=data_args.split,
+        train_val_test_num_samples=train_val_test_num_samples,
+        seq_length=data_args.max_seq_length,
+        seed=training_args.seed,
+        skip_warmup=data_args.skip_warmup,
+        share_folder=data_args.share_folder,
+        data_cache_path=data_args.data_cache,
+        need_data=need_data,
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode.")
+        # input_ids, loss_mask, attention_mask, position_ids, labels = data
+        input_ids = data["text"]
+
+        logger.info(tokenizer._decode(input_ids))
+
+    from paddlenlp.data import Stack
+
+    def _collate_data(data, stack_fn=Stack()):
+        tokens_ = stack_fn([x["text"] for x in data])
+
+        labels = tokens_[:, 1:]
+        tokens = tokens_[:, :-1]
+
+        return {
+            "input_ids": tokens,
+            "labels": labels,
+        }
+
+    if need_data:
+        if training_args.do_train:
+            print_dataset(train_dataset[0], "train")
+        if training_args.do_eval:
+            print_dataset(valid_dataset[0], "valid")
+        if training_args.do_predict:
+            print_dataset(test_dataset[0], "test")
+
+    return train_dataset, valid_dataset, test_dataset, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]  # add
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+class PretrainingTrainer(AutoTrainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.is_pretraining = True
+
+    def _wrap_for_dist_loader(self, train_dataloader):
+        dist_loader = super()._wrap_for_dist_loader(train_dataloader)
+        dist_loader._input_keys = ["input_ids", "labels"]
+        return dist_loader
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        if self.train_dataset is None:
+            return None
+
+        total_batch_size_per_acc_step = self.args.per_device_train_batch_size * self.args.dataset_world_size
+        total_batch_size = total_batch_size_per_acc_step
+
+        # In llm/llama/run_pretrain.py, it uses paddlenlp.utils.batch_sampler.DistributedBatchSampler,
+        # which does no shuffle when shuffle is set True.
+        sampler = paddle.io.BatchSampler(
+            dataset=self.train_dataset,
+            shuffle=False,
+            batch_size=total_batch_size,
+            drop_last=self.args.dataloader_drop_last,
+        )
+        sampler._acc_steps = self.args.gradient_accumulation_steps
+        return sampler
+
+
+def print_config(args, key=""):
+    """
+    print config values
+    """
+    logger.info("=" * 60)
+    if args is None:
+        args = args
+        key = "Training"
+    import paddlenlp
+
+    logger.info("{:^40}".format("{} Configuration Arguments".format(key)))
+    logger.info("{:30}: {}".format("paddle commit id", paddle.version.commit))
+    logger.info("{:30}: {}".format("paddlenlp commit id", paddlenlp.version.commit))
+
+    for a in dir(args):
+        if a[:2] != "__":  # don't print double underscore methods
+            v = getattr(args, a)
+            if not isinstance(v, types.MethodType):
+                logger.info("{:30}: {}".format(a, v))
+
+    logger.info("")
+
+
+def init_seed(seed: int = 1234, args=None):
+    if args is None:
+        random.seed(seed)
+        np.random.seed(seed)
+        paddle.seed(seed)
+    else:
+        assert not args.use_hybrid_parallel and args.enable_auto_parallel
+        if dist.get_world_size() > 1:
+            if args.hybrid_parallel_topo_order is None or args.hybrid_parallel_topo_order == "pp_first":
+                order = ["pp", "dp", "sharding", "mp", "sep"]
+            elif args.hybrid_parallel_topo_order == "sharding_first":
+                order = ["dp", "sharding", "pp", "mp", "sep"]
+            topo = Topology(
+                dist.get_rank(),
+                dist.get_world_size(),
+                dp_degree=args.dataset_world_size,
+                pp_degree=args.pipeline_parallel_degree,
+                mp_degree=args.tensor_parallel_degree,
+                sharding_degree=1,  # auto_parallel's sharding is not orthogonal with dp, mp and pp
+                order=order,
+            )
+
+            global_seed, local_seed, random_seed = _get_distributed_seeds(args.seed, topo)
+
+            paddle.seed(local_seed)
+            random.seed(random_seed)
+            np.random.seed(random_seed)
+
+            logger.info(
+                "The global seed is set to {}, local seed is set to {} and "
+                "random seed is set to {}.".format(global_seed, local_seed, random_seed)
+            )
+        else:
+            random.seed(args.seed)
+            np.random.seed(args.seed)
+            paddle.seed(args.seed)
+
+
+def get_mesh(pp_idx=0):
+    mesh = fleet.auto.get_mesh()
+    if "pp" in mesh.dim_names:
+        mesh = mesh.get_mesh_with_dim("pp")[pp_idx]
+    return mesh
+
+
+def shard_fn(layer, mesh_idx, placements):
+    paran_name = layer.weight.name
+    layer.weight = dist.shard_tensor(layer.weight, get_mesh(mesh_idx), placements)
+    layer.weight.name = paran_name
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if training_args.enable_linear_fused_grad_add:
+        from fused_layers import mock_layers
+
+        mock_layers()
+
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    if data_args.data_cache is not None:
+        os.makedirs(data_args.data_cache, exist_ok=True)
+
+    init_seed(args=training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class, criterion_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+
+    config.seq_length = data_args.max_seq_length
+    # There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings
+    if not model_args.continue_training:
+        config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
+
+    if not model_args.continue_training:
+        config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
+        logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.")
+
+    if model_args.no_recompute_layers is not None:
+        model_args.no_recompute_layers.sort()
+
+    config.vocab_size = model_args.vocab_size if model_args.vocab_size is not None else config.vocab_size
+    config.hidden_size = model_args.hidden_size if model_args.hidden_size is not None else config.hidden_size
+    config.intermediate_size = (
+        model_args.intermediate_size if model_args.intermediate_size is not None else config.intermediate_size
+    )
+    config.num_hidden_layers = (
+        model_args.num_hidden_layers if model_args.num_hidden_layers is not None else config.num_hidden_layers
+    )
+    config.num_attention_heads = (
+        model_args.num_attention_heads if model_args.num_attention_heads is not None else config.num_attention_heads
+    )
+
+    config.use_flash_attention = model_args.use_flash_attention
+    config.use_fused_rms_norm = model_args.use_fused_rms_norm
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.fuse_attention_ffn = model_args.fuse_attention_ffn
+    config.recompute_granularity = model_args.recompute_granularity
+    config.virtual_pp_degree = model_args.virtual_pp_degree
+    config.sequence_parallel = training_args.sequence_parallel
+
+    config.fuse_sequence_parallel_allreduce = training_args.fuse_sequence_parallel_allreduce
+
+    config.use_fused_rope = model_args.use_fused_rope
+    config.no_recompute_layers = model_args.no_recompute_layers
+    config.pp_recompute_interval = model_args.pp_recompute_interval
+    config.recompute_use_reentrant = model_args.recompute_use_reentrant
+
+    config.use_recompute = training_args.recompute
+    config.tensor_parallel_degree = training_args.tensor_parallel_degree
+    config.tensor_parallel_rank = training_args.tensor_parallel_rank
+    config.sharding_parallel_degree = training_args.sharding_parallel_degree
+
+    if training_args.strategy.pipeline.enable and config.virtual_pp_degree > 1:
+        pipeline = training_args.strategy.pipeline
+        pipeline.vpp_degree = config.virtual_pp_degree
+        pipeline.vpp_seg_method = training_args.virtual_pipeline_seg_method
+
+    print("Final pre-training config:", config)
+
+    # # Set the dtype for loading model
+    # dtype = "float32"
+    # if training_args.fp16_opt_level == "O2":
+    #     if training_args.fp16:
+    #         dtype = "float16"
+    #     if training_args.bf16:
+    #         dtype = "bfloat16"
+
+    with paddle.LazyGuard():
+        model = model_class.from_config(config, dtype="float32")
+        criterion = criterion_class(config)
+
+    if training_args.recompute:
+
+        def fn(layer):
+            if hasattr(layer, "enable_recompute") and (layer.enable_recompute is False or layer.enable_recompute == 0):
+                layer.enable_recompute = True
+
+        model.apply(fn)
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+
+    if training_args.warmup_steps > 0:
+        warmup_steps = training_args.warmup_steps
+    else:
+        warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = None
+    if training_args.lr_scheduler_type.value == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+    elif training_args.lr_scheduler_type.value == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+
+    data_file = get_train_data_file(data_args)
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args,
+        training_args,
+        data_file,
+        tokenizer,
+        need_data=training_args.should_load_dataset,
+    )
+    trainer = PretrainingTrainer(
+        model=model,
+        criterion=criterion,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+
+        # NOTE(gongenlei): new add
+        if not training_args.autotuner_benchmark:
+            metrics = train_result.metrics
+            if not int(os.getenv("test_ci_no_save_model", 0)):
+                trainer.save_model()
+            trainer.log_metrics("train", metrics)
+            trainer.save_metrics("train", metrics)
+            trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # if training_args.should_load_dataset:
+    #     effective_tokens_per_second = total_effective_tokens / train_result.metrics["train_runtime"]
+    #     print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}")
+    #     print(f"ips: {effective_tokens_per_second:.2f} tokens/s")
+
+
+def shard_model(model):
+    pp_stage = 0
+    for name, layer in model.named_sublayers(include_self=False):
+        if hasattr(layer, "ipp"):
+            pp_stage = layer.ipp
+        # print(f"name {name},pp_stage {pp_stage}==>", type(layer))
+        if "embed_tokens" in name:
+            # embedding only support column split now. it will update in the future
+            shard_fn(layer, 0, [dist.Replicate(), dist.Shard(1)])
+        for n in [
+            "self_attn.q_proj",
+            "self_attn.k_proj",
+            "self_attn.v_proj",
+            "self_attn.qkv_proj",
+            "gate_proj",
+            "up_proj",
+            "gate_up_fused_proj",
+        ]:
+            if n in name:
+                shard_fn(layer, pp_stage, [dist.Replicate(), dist.Shard(1)])
+                break
+        for n in ["self_attn.o_proj", "down_proj"]:
+            if n in name:
+                shard_fn(layer, pp_stage, [dist.Replicate(), dist.Shard(0)])
+                break
+        if "lm_head" in name:
+            shard_fn(layer, -1, [dist.Replicate(), dist.Shard(1)])
+
+
+def load_model(model):
+    model_state_dict = model.state_dict()
+    state_dict = paddle.load("hand/all.pdparams")
+    tmp = OrderedDict()
+    (tmp, state_dict) = (state_dict, tmp)
+    for (k, v) in tmp.items():
+        k = map_structure_name(k)
+        state_dict[k] = v
+    model.set_state_dict(state_dict)
+    assert len(model_state_dict) == len(state_dict), f"{len(model_state_dict)} vs {len(state_dict)}"
+    """
+    print("=======model_state_dict=======")
+    for (k,v) in model_state_dict.items():
+        print(f"{k}=>{v.shape}")
+    """
+    print("=======state_dict=======")
+    for (k, v) in state_dict.items():
+        assert k in model_state_dict
+        print(f"{k}=>{v.shape}")
+
+
+def print_grad(model):
+    model_state_dict = model.state_dict()
+    name_mapping = {v.name: k for (k, v) in model_state_dict.items()}
+    for p in model.parameters():
+        assert p.name in name_mapping
+        if p.grad is not None:
+            print(f"{name_mapping[p.name]} {p.name}_grad shape: {p.grad.shape} md5sum: {p.grad._md5sum()}")
+
+
+def print_param(model):
+    model_state_dict = model.state_dict()
+    name_mapping = {v.name: k for (k, v) in model_state_dict.items()}
+    for p in model.parameters():
+        assert p.name in name_mapping
+        if p.grad is not None:
+            print(f"{name_mapping[p.name]} {p.name} shape: {p.shape} md5sum: {p._md5sum()}")
+
+
+def map_structure_name(k):
+    fs = k.split(".")
+    idx = int(fs[1])
+    if idx == 0:
+        return "deepseek_v2.embed_tokens.weight"
+    if idx == 28:
+        return "deepseek_v2.norm.weight"
+    if idx == 29:
+        return "lm_head.weight"
+    else:
+        return f"deepseek_v2.layers.{idx-1}." + ".".join(fs[2:])
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/auto_parallel/deepseek-v3/run_pretrain_auto.sh b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.sh
new file mode 100644
index 000000000000..15dd24f8a0c5
--- /dev/null
+++ b/llm/auto_parallel/deepseek-v3/run_pretrain_auto.sh
@@ -0,0 +1,80 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/bin/bash
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+task_name="deepseekv3"
+rm -rf output/$task_name/
+rm -rf "output/$task_name""_log"
+
+export SOT_LOG_LEVEL=4
+export PYTHONPATH=../../../:$PYTHONPATH
+#ulimit -c unlimited
+# export GLOG_v=3
+
+# export FLAGS_call_stack_level=3
+# export FLAGS_use_cuda_managed_memory=true
+
+# export FLAGS_embedding_deterministic=1        
+# export FLAGS_cudnn_deterministic=1
+# export NVIDIA_TF32_OVERRIDE=0
+
+to_static=0  # 是否开启动转静训练
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3" \
+    --log_dir  "output/$task_name""_log" \
+    run_pretrain_auto.py \
+    --model_type "deepseekv3_auto" \
+    --model_name_or_path "deepseek-ai/DeepSeek-V3" \
+    --tokenizer_name_or_path "deepseek-ai/DeepSeek-V3" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 2 \
+    --use_flash_attention 0 \
+    --use_fused_rms_norm 1 \
+    --fp16 0 \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --pipeline_parallel_degree 1 \
+    --tensor_parallel_degree 2 \
+    --sharding_parallel_degree 2 \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 2 \
+    --save_steps 5000000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --logging_steps 1\
+    --dataloader_num_workers 1 \
+    --sharding "stage1" \
+    --eval_steps 1000000 \
+    --disable_tqdm true \
+    --continue_training 0\
+    --recompute 0 \
+    --do_train \
+    --do_eval \
+    --device "gpu" \
+    --data_impl "mmap" \
+    --enable_auto_parallel 1 \
+    --max_grad_norm 1.0 \
+    --num_hidden_layers 1 \
+    --use_intermediate_api true \
+    --to_static $to_static \
diff --git a/llm/auto_parallel/gpt-3/gpt_with_intermediate.sh b/llm/auto_parallel/gpt-3/gpt_with_intermediate.sh
index 9d6d28855008..dc99491e4d41 100644
--- a/llm/auto_parallel/gpt-3/gpt_with_intermediate.sh
+++ b/llm/auto_parallel/gpt-3/gpt_with_intermediate.sh
@@ -98,7 +98,7 @@ python -u -m paddle.distributed.launch \
     --amp_master_grad true \
     --attention_probs_dropout_prob 0.1 \
     --hidden_dropout_prob 0.1 \
-    --sharding_parallel_config "enable_stage1_tensor_fusion enable_stage1_overlap" \
+    --sharding_parallel_config "enable_tensor_fusion enable_overlap" \
     --tensor_parallel_config "enable_mp_async_allreduce" \
     --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate" \
     --pipeline_parallel_config "enable_send_recv_overlap enable_split_backward" \
diff --git a/llm/auto_parallel/gpt-3/run_pretrain_auto.py b/llm/auto_parallel/gpt-3/run_pretrain_auto.py
index b27a91a4d484..8414e306bfa7 100644
--- a/llm/auto_parallel/gpt-3/run_pretrain_auto.py
+++ b/llm/auto_parallel/gpt-3/run_pretrain_auto.py
@@ -224,15 +224,6 @@ class ModelArguments:
 
     hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."})
     attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention hidden dropout prob."})
-
-    sequence_parallel: bool = field(
-        default=False,
-        metadata={"help": "whether to use sequence parallel"},
-    )
-    fuse_sequence_parallel_allreduce: bool = field(
-        default=False,
-        metadata={"help": "whether to use fuse sequence parallel allreduce"},
-    )
     use_fused_rope: Optional[bool] = field(
         default=False,
         metadata={"help": "Enable rope fusion or not."},
@@ -502,8 +493,8 @@ def main():
     config.fuse_attention_ffn = model_args.fuse_attention_ffn
     config.recompute_granularity = model_args.recompute_granularity
     config.virtual_pp_degree = model_args.virtual_pp_degree
-    config.sequence_parallel = model_args.sequence_parallel
-    config.fuse_sequence_parallel_allreduce = model_args.fuse_sequence_parallel_allreduce
+    config.sequence_parallel = training_args.sequence_parallel
+    config.fuse_sequence_parallel_allreduce = training_args.fuse_sequence_parallel_allreduce
     config.use_fused_rope = model_args.use_fused_rope
     config.no_recompute_layers = model_args.no_recompute_layers
     config.pp_recompute_interval = model_args.pp_recompute_interval
@@ -574,9 +565,6 @@ def fn(layer):
         need_data=training_args.should_load_dataset,
     )
 
-    # load_model_auto(model)
-    # model = shard_model(model)
-
     trainer = PretrainingTrainer(
         model=model,
         criterion=criterion,
diff --git a/llm/auto_parallel/llama/README.md b/llm/auto_parallel/llama/README.md
index 529f7ba2a8f3..c6aa0324b708 100644
--- a/llm/auto_parallel/llama/README.md
+++ b/llm/auto_parallel/llama/README.md
@@ -45,19 +45,30 @@ import paddle.distributed as dist
 
 ckpt_path='/path/for/dist_ckpt'
 # offload=1, 参数 offload 到 CPU，减少显存占用
-merged_state_dict = dist.checkpoint.load_state_dict.load_merged_state_dict(ckpt_path, offload=1)
+# prefix="model" 参数可用于过滤掉非模型参数，例如 optimizer 状态等
+merged_state_dict = dist.checkpoint.load_state_dict.load_merged_state_dict(ckpt_path, offload=1, prefix="model")
 paddle.save(merged_state_dict, 'model_state.pdparams')
 
-# 上述合并的模型参数格式为Paddle原生格式，如需转换为unified_param格式(safetensors)，可继续执行如下代码：
-python PaddleNLP/llm/auto_parallel/utils/convert_to_safetensors.py --input_path input_path  [--output_path output_path] [--split_num split_num] [--offload offload]
+# 上述合并的模型参数格式为Paddle原生格式，如需转换为unified checkpoint格式(safetensors)，或需获取模型参数的index文件，继续执行如下代码：
+python PaddleNLP/llm/auto_parallel/utils/convert_to_safetensors.py --input_path input_path  [--output_path output_path] [--split_num split_num] [--offload] [--as_safetensors]
 
 # 参数介绍
 --input_path: 输入的单卡模型参数路径
 --output_path: 可选，输出模型参数路径，默认为'./temp'
 --split_num: 可选，输出的模型参数分片数，默认为 1
---offload: 可选，是否将参数 offload 到 CPU，默认为 false
+--offload: 可选，选项用于控制是否将参数 offload 到 CPU
+--as_safetensors: 可选，选项用于控制是否将模型参数转换为 safetensors 格式
 ```
 
 - 动态图推理
 
     [大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)
+
+## 5.PPO 训练
+自动并行当前尚未支持 PPO 训练，后续会持续支持。但您可以将自动并行训练得到的模型参数转换后用于 PPO 训练。自动并行 ckpt 转手动并行 ckpt 流程参考**推理**部分。
+
+- PPO 训练
+
+    [PPO 训练教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/rlhf.md)
+
+- 注：PPO 训练教程中 PKU-Alignment/alpaca-7b-reproduced 模型是一个类 llama 模型，但与原生 llama 模型结构存在一定差异，具体为 embedding 层和 lm_head 层 shape 不同，原生 llama 的 shape 为 [4096, 32000]，但 PKU-Alignment/alpaca-7b-reproduced 的 shape 为 [4096, 32001]。
diff --git a/llm/auto_parallel/llama/llama_finetune_with_api.sh b/llm/auto_parallel/llama/llama_finetune_with_api.sh
index 653939f04bff..545da9a2f5f3 100644
--- a/llm/auto_parallel/llama/llama_finetune_with_api.sh
+++ b/llm/auto_parallel/llama/llama_finetune_with_api.sh
@@ -47,7 +47,7 @@ python -u  -m paddle.distributed.launch \
     --log_dir  "log/$task_name""_log" \
     ../run_finetune_auto.py \
     --model_name_or_path "meta-llama/Meta-Llama-3.1-8B-Instruct" \
-    --dataset_name_or_path "../fintune_data/data" \
+    --dataset_name_or_path "../../finetune_data/data" \
     --output_dir "output/$task_name/" \
     --enable_auto_parallel true \
     --lora false \
diff --git a/llm/auto_parallel/llama/llama_with_api.sh b/llm/auto_parallel/llama/llama_with_api.sh
index e4dce9536a4f..ee2dcd55f975 100644
--- a/llm/auto_parallel/llama/llama_with_api.sh
+++ b/llm/auto_parallel/llama/llama_with_api.sh
@@ -78,7 +78,7 @@ python -u  -m paddle.distributed.launch \
     --max_seq_length 4096 \
     --sequence_parallel true \
     --sharding "stage1" \
-    --sharding_parallel_config "enable_stage1_tensor_fusion enable_stage1_overlap" \
+    --sharding_parallel_config "enable_tensor_fusion enable_overlap" \
     --tensor_parallel_config "enable_mp_async_allreduce" \
     --model_type "llama_network" \
     --ignore_load_lr_and_optim true \
diff --git a/llm/auto_parallel/llama/run_llama2_13b_xpu.sh b/llm/auto_parallel/llama/run_llama2_13b_xpu.sh
new file mode 100755
index 000000000000..301d19a38bb1
--- /dev/null
+++ b/llm/auto_parallel/llama/run_llama2_13b_xpu.sh
@@ -0,0 +1,106 @@
+#!/bin/bash
+
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task_name_or_path="llama2-13b-auto"
+
+#export XPUAPI_DEBUG=0x1
+#export XPURT_DISPATCH_MODE=PROFILING
+export XBLAS_FC_HBM_VERSION=40
+
+# PaddlePaddle
+export FLAGS_use_stride_kernel="0"
+export XPU_PADDLE_L3_SIZE=98566144 # 94 MB
+export XPU_CDNN_CLUSTER_PARALLEL=1
+export XPU_CDNN_CLUSTER_PARALLEL_STREAM_NUMBER=2
+
+# PDC
+unset PADDLE_ELASTIC_JOB_ID
+unset PADDLE_TRAINER_ENDPOINTS
+unset DISTRIBUTED_TRAINER_ENDPOINTS
+unset FLAGS_START_PORT
+unset PADDLE_ELASTIC_TIMEOUT
+unset PADDLE_TRAINERS_NUM
+
+# BKCL
+# export BKCL_DEBUG=1
+# Multi-computer RDMA
+#export BKCL_ENABLE_XDR=1
+#export BKCL_RDMA_FORCE_TREE=1
+#export BKCL_TREE_THRESHOLD=0
+#export BKCL_RDMA_NICS=xgbe1,xgbe1,xgbe2,xgbe2,xgbe3,xgbe3,xgbe4,xgbe4
+#export BKCL_SOCKET_IFNAME=xgbe0
+#export BKCL_FORCE_L3_RDMA=0
+export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64
+echo "bkcl version:"
+strings ${bkcl_location}/libbkcl.so | grep COM
+
+export CUDA_DEVICE_MAX_CONNECTIONS=8
+
+#PYTHONPATH
+export PYTHONPATH=../../../:$PYTHONPATH
+
+# for debug
+#export GLOG_v=10
+export FLAGS_call_stack_level=2
+
+rm -rf output/$task_name_or_path
+PYTHONPATH=../:$PYTHONPATH  \
+python -u  -m paddle.distributed.launch \
+    --xpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name_or_path/" \
+    run_pretrain_auto.py \
+    --model_name_or_path "meta-llama/Llama-2-13b" \
+    --tokenizer_name_or_path "meta-llama/Llama-2-13b" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name_or_path" \
+    --split 949,50,1 \
+    --max_seq_length 4096 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --use_flash_attention 1 \
+    --use_fused_rope 1 \
+    --fuse_attention_ffn 1 \
+    --fuse_attention_qkv 1 \
+    --use_fused_rms_norm 0 \
+    --num_hidden_layers 4 \
+    --bf16 \
+    --fp16_opt_level "O2"  \
+    --amp_master_grad true \
+    --scale_loss 1024 \
+    --learning_rate 0.00003 \
+    --min_learning_rate 0.000005 \
+    --lr_scheduler_type "cosine" \
+    --max_steps 10 \
+    --save_steps 100000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1 \
+    --sequence_parallel 0 \
+    --dataloader_num_workers 4 \
+    --pipeline_parallel_degree 1 \
+    --tensor_parallel_degree 1 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --continue_training 0 \
+    --recompute 0 \
+    --do_train \
+    --seed 1026 \
+    --device "xpu" \
+    --enable_auto_parallel 1 \
+    --to_static 1
diff --git a/llm/auto_parallel/llama/run_llama3.sh b/llm/auto_parallel/llama/run_llama3.sh
index d95d48b8d295..e86fe305a1f8 100644
--- a/llm/auto_parallel/llama/run_llama3.sh
+++ b/llm/auto_parallel/llama/run_llama3.sh
@@ -92,7 +92,7 @@ python -u  -m paddle.distributed.launch \
     --sharding "stage2" \
     --pipeline_parallel_config "enable_send_recv_overlap" \
     --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate" \
-    --sharding_parallel_config "enable_stage2_overlap" \
+    --sharding_parallel_config "enable_overlap" \
     --tensor_parallel_config "enable_mp_async_allreduce" \
     --to_static 1 \
     --amp_custom_black_list "reduce_sum" "c_softmax_with_cross_entropy" \
diff --git a/llm/auto_parallel/llama/run_pretrain_auto.py b/llm/auto_parallel/llama/run_pretrain_auto.py
index 4f51b8c2493e..24e737de544b 100644
--- a/llm/auto_parallel/llama/run_pretrain_auto.py
+++ b/llm/auto_parallel/llama/run_pretrain_auto.py
@@ -59,6 +59,7 @@
     print_rank_0,
 )
 from paddlenlp.trainer.utils.doc import add_start_docstrings
+from paddlenlp.utils.tools import get_env_device
 
 
 @dataclass
@@ -173,6 +174,11 @@ class ModelArguments:
         default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
     )
 
+    use_fast_layer_norm: bool = field(
+        default=False,
+        metadata={"help": "GPT3 model, use fast layernorm"},
+    )
+
     config_name: Optional[str] = field(
         default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
     )
@@ -221,14 +227,6 @@ class ModelArguments:
             "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
         },
     )
-    sequence_parallel: bool = field(
-        default=False,
-        metadata={"help": "whether to use sequence parallel"},
-    )
-    fuse_sequence_parallel_allreduce: bool = field(
-        default=False,
-        metadata={"help": "whether to use fuse sequence parallel allreduce"},
-    )
     use_fused_rope: Optional[bool] = field(
         default=False,
         metadata={"help": "Enable rope fusion or not."},
@@ -504,6 +502,8 @@ def main():
 
     config = config_class.from_pretrained(model_args.model_name_or_path)
 
+    config.use_fast_layer_norm = model_args.use_fast_layer_norm
+
     config.seq_length = data_args.max_seq_length
     # There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings
     if not model_args.continue_training:
@@ -534,8 +534,10 @@ def main():
     config.fuse_attention_ffn = model_args.fuse_attention_ffn
     config.recompute_granularity = model_args.recompute_granularity
     config.virtual_pp_degree = model_args.virtual_pp_degree
-    config.sequence_parallel = model_args.sequence_parallel
-    config.fuse_sequence_parallel_allreduce = model_args.fuse_sequence_parallel_allreduce
+    config.sequence_parallel = training_args.sequence_parallel
+
+    config.fuse_sequence_parallel_allreduce = training_args.fuse_sequence_parallel_allreduce
+
     config.use_fused_rope = model_args.use_fused_rope
     config.no_recompute_layers = model_args.no_recompute_layers
     config.pp_recompute_interval = model_args.pp_recompute_interval
@@ -550,6 +552,15 @@ def main():
         pipeline = training_args.strategy.pipeline
         pipeline.vpp_degree = config.virtual_pp_degree
         pipeline.vpp_seg_method = training_args.virtual_pipeline_seg_method
+    if get_env_device() == "xpu" and training_args.gradient_accumulation_steps > 1:
+        try:
+            from paddle_xpu.layers.nn.linear import LinearConfig  # noqa: F401
+
+            LinearConfig.enable_accumulate_steps_opt()
+            LinearConfig.set_accumulate_steps(training_args.gradient_accumulation_steps)
+        except ImportError:
+            # It's OK, not use accumulate_steps optimization
+            pass
 
     print("Final pre-training config:", config)
 
@@ -608,7 +619,6 @@ def fn(layer):
         tokenizer,
         need_data=training_args.should_load_dataset,
     )
-
     trainer = PretrainingTrainer(
         model=model,
         criterion=criterion,
@@ -618,7 +628,6 @@ def fn(layer):
         eval_dataset=eval_dataset if training_args.do_eval else None,
         optimizers=(None, lr_scheduler),
         tokenizer=tokenizer,
-        model_args=model_args,
     )
 
     checkpoint = None
diff --git a/llm/auto_parallel/qwen/run_pretrain_3D_auto.py b/llm/auto_parallel/qwen/run_pretrain_3D_auto.py
index 06d698494a19..67cf3b51e815 100644
--- a/llm/auto_parallel/qwen/run_pretrain_3D_auto.py
+++ b/llm/auto_parallel/qwen/run_pretrain_3D_auto.py
@@ -225,14 +225,6 @@ class ModelArguments:
             "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
         },
     )
-    sequence_parallel: bool = field(
-        default=False,
-        metadata={"help": "whether to use sequence parallel"},
-    )
-    fuse_sequence_parallel_allreduce: bool = field(
-        default=False,
-        metadata={"help": "whether to use fuse sequence parallel allreduce"},
-    )
     use_fused_rope: Optional[bool] = field(
         default=False,
         metadata={"help": "Enable rope fusion or not."},
@@ -513,8 +505,8 @@ def main():
     config.fuse_attention_ffn = model_args.fuse_attention_ffn
     config.recompute_granularity = model_args.recompute_granularity
     config.virtual_pp_degree = model_args.virtual_pp_degree
-    config.sequence_parallel = model_args.sequence_parallel
-    config.fuse_sequence_parallel_allreduce = model_args.fuse_sequence_parallel_allreduce
+    config.sequence_parallel = training_args.sequence_parallel
+    config.fuse_sequence_parallel_allreduce = training_args.fuse_sequence_parallel_allreduce
     config.use_fused_rope = model_args.use_fused_rope
     config.no_recompute_layers = model_args.no_recompute_layers
     config.pp_recompute_interval = model_args.pp_recompute_interval
diff --git a/llm/auto_parallel/run_finetune_auto.py b/llm/auto_parallel/run_finetune_auto.py
index 7b7d920b7930..2056f314c1ff 100644
--- a/llm/auto_parallel/run_finetune_auto.py
+++ b/llm/auto_parallel/run_finetune_auto.py
@@ -267,15 +267,13 @@ def main():
     if (
         model_args.continue_training
         and not training_args.autotuner_benchmark
-        and not training_args.use_intermediate_api
     ):
-        with paddle.LazyGuard():
-            criterion = criterion_class(model_config)
-            model = model_class.from_pretrained(
-                model_args.model_name_or_path,
-                config=model_config,
-                from_aistudio=model_args.from_aistudio,
-            )
+        criterion = criterion_class(model_config)
+        model = model_class.from_pretrained(
+            model_args.model_name_or_path,
+            config=model_config,
+            from_aistudio=model_args.from_aistudio,
+        )
     else:
         with paddle.LazyGuard():
             criterion = criterion_class(model_config)
@@ -489,7 +487,7 @@ def compute_metrics_do_generation(eval_preds):
     #     layer.register_forward_pre_hook(forward_pre_hook)
     #     layer.register_forward_post_hook(forward_post_hook)
     # Train
-    print(trainer.model)
+
     if training_args.do_train:
         checkpoint = None
         if training_args.resume_from_checkpoint is not None:
diff --git a/llm/auto_parallel/utils/convert_to_safetensors.py b/llm/auto_parallel/utils/convert_to_safetensors.py
index 6f000e1e8955..c07e91af18b2 100644
--- a/llm/auto_parallel/utils/convert_to_safetensors.py
+++ b/llm/auto_parallel/utils/convert_to_safetensors.py
@@ -19,10 +19,12 @@
 from safetensors.numpy import save_file as safe_save_file
 
 from paddlenlp.transformers.utils import dtype_byte_size
-from paddlenlp.utils.env import SAFE_WEIGHTS_INDEX_NAME
+from paddlenlp.utils.env import PADDLE_WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_INDEX_NAME
 
 
-def convert_to_unified_ckpt(path: str, output_dir: str = "./tmp", split_num: int = 1, offload: bool = False):
+def convert_to_unified_ckpt(
+    path: str, output_dir: str = "./tmp", split_num: int = 1, offload: bool = False, as_safetensors: bool = False
+):
     """
     Convert a single card checkpoint to the unified format.
 
@@ -31,9 +33,10 @@ def convert_to_unified_ckpt(path: str, output_dir: str = "./tmp", split_num: int
         output_dir (str, optional): The directory where the converted files will be saved. Defaults to ".".
         split_num (int, optional): The number of shards to split the weights into output_dir. Defaults to 1.
         offload (bool, optional): Whether to offload the weights to CPU memory before saving them. Defaults to False.
+        as_safetensors (bool, optional): Whether to save the weights as safetensors. Defaults to False.
     """
 
-    def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file, total_size):
+    def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file, total_size, as_safetensors):
         """
         Get the sub-state dict and update the index weight file and total size.
         Args:
@@ -42,8 +45,12 @@ def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file,
             weight_filename (str): The filename of the corresponding weight file.
             index_weight_file (dict): The dictionary containing the mapping from keys to their corresponding weight filenames.
             total_size (int): The total size of the model so far.
+            as_safetensors (bool): Whether to save the weights as safetensors.
         """
-        sub_state_dict = {key: state_dict[key].numpy() for key in sub_keys}
+        if as_safetensors:
+            sub_state_dict = {key: state_dict[key].numpy() for key in sub_keys}
+        else:
+            sub_state_dict = {key: state_dict[key] for key in sub_keys}
         for key in sub_keys:
             index_weight_file[key] = weight_filename
             total_size += state_dict[key].numel().item() * dtype_byte_size(state_dict[key].dtype)
@@ -65,12 +72,21 @@ def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file,
         current_size = split_size + (1 if rank < extra_keys else 0)
         sub_keys = all_keys[index : index + current_size]
         index += current_size
-        weight_filename = f"model-{rank+1:04d}-of-{split_num:04d}.safetensors"
+        if as_safetensors:
+            weight_filename = f"model-{rank+1:04d}-of-{split_num:04d}.safetensors"
+        else:
+            weight_filename = f"model_state-{rank+1:04d}-of-{split_num:04d}.pdparams"
         sub_state_dict, total_size = get_sub_state_dict(
-            sub_keys, state_dict, weight_filename, index_weight_file, total_size
+            sub_keys, state_dict, weight_filename, index_weight_file, total_size, as_safetensors
         )
-        safe_save_file(sub_state_dict, os.path.join(output_dir, weight_filename))
-    with open(os.path.join(output_dir, SAFE_WEIGHTS_INDEX_NAME), "w") as f:
+        if as_safetensors:
+            safe_save_file(sub_state_dict, os.path.join(output_dir, weight_filename), metadata={"format": "np"})
+            index_file_name = SAFE_WEIGHTS_INDEX_NAME
+        else:
+            paddle.save(sub_state_dict, os.path.join(output_dir, weight_filename))
+            index_file_name = PADDLE_WEIGHTS_INDEX_NAME
+
+    with open(os.path.join(output_dir, index_file_name), "w") as f:
         json.dump({"metadata": {"total_size": total_size}, "weight_map": index_weight_file}, f, indent=4)
 
 
@@ -86,7 +102,10 @@ def get_sub_state_dict(sub_keys, state_dict, weight_filename, index_weight_file,
         "--split_num", type=int, default=1, help="The number of shards to split the weights into output_dir."
     )
     parser.add_argument(
-        "--offload", type=bool, help="Whether to offload the weights to CPU memory before saving them."
+        "--offload", action="store_true", help="Whether to offload the weights to CPU memory before saving them."
+    )
+    parser.add_argument(
+        "--as_safetensors", action="store_true", help="Save the weights as safetensors instead of pdparams."
     )
     args = parser.parse_args()
-    convert_to_unified_ckpt(args.input_path, args.output_dir, args.split_num, args.offload)
+    convert_to_unified_ckpt(args.input_path, args.output_dir, args.split_num, args.offload, args.as_safetensors)
diff --git a/llm/config/deepseek-v2/pretrain_argument.json b/llm/config/deepseek-v2/pretrain_argument.json
index 9bc889e13f85..8ab15be1f5d9 100644
--- a/llm/config/deepseek-v2/pretrain_argument.json
+++ b/llm/config/deepseek-v2/pretrain_argument.json
@@ -4,10 +4,10 @@
     "input_dir": "./data",
     "output_dir": "./checkpoints/pretrain_ckpts",
     "per_device_train_batch_size": 1,
-    "gradient_accumulation_steps": 1,
+    "gradient_accumulation_steps": 32,
     "per_device_eval_batch_size": 1,
     "tensor_parallel_degree": 1,
-    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_degree": 8,
     "sharding_parallel_degree": 1,
     "sharding": "stage2",
     "virtual_pp_degree": 1,
diff --git a/llm/config/llama/dpo_argument.json b/llm/config/llama/dpo_argument.json
index 60065fbb7a1d..510cd2d475d2 100644
--- a/llm/config/llama/dpo_argument.json
+++ b/llm/config/llama/dpo_argument.json
@@ -1,5 +1,5 @@
 {
-    "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
+    "model_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
     "train_dataset_path": "./data/train.jsonl",
     "dev_dataset_path": "./data/dev.jsonl",
     "output_dir": "./checkpoints/dpo_ckpts",
diff --git a/llm/config/llama/pretrain_argument.json b/llm/config/llama/pretrain_argument.json
index dff5b322337e..304b6d7822a2 100644
--- a/llm/config/llama/pretrain_argument.json
+++ b/llm/config/llama/pretrain_argument.json
@@ -28,7 +28,7 @@
     "warmup_ratio": 0.01,
     "max_grad_norm": 1.0,
     "dataloader_num_workers": 1,
-    "continue_training": 1,
+    "continue_training": 0,
     "do_train": true,
     "do_eval": true,
     "do_predict": true,
diff --git a/llm/config/qwen/dpo_argument_0p5b.json b/llm/config/qwen/dpo_argument_0p5b.json
new file mode 100644
index 000000000000..4799d83bda6e
--- /dev/null
+++ b/llm/config/qwen/dpo_argument_0p5b.json
@@ -0,0 +1,39 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "train_dataset_path": "./data/train.jsonl",
+    "dev_dataset_path": "./data/dev.jsonl",
+    "output_dir": "./checkpoints/dpo_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 1,
+    "num_train_epochs": 1,
+    "max_steps": 100,
+    "learning_rate": 1e-06,
+    "warmup_steps": 10,
+    "logging_steps": 1,
+    "evaluation_strategy": "steps",
+    "save_strategy": "steps",
+    "eval_steps": 100,
+    "save_steps": 500,
+    "max_seq_len": 2048,
+    "max_prompt_len": 1024,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "tensor_parallel_degree": 1,
+    "sharding": "stage1",
+    "use_flash_attention": false,
+    "flash_mask": false,
+    "recompute": true,
+    "recompute_granularity": "full",
+    "benchmark": false,
+    "unified_checkpoint": true,
+    "autotuner_benchmark":false,
+    "beta": 0.1,
+    "loss_type": "sigmoid",
+    "greedy_zero_padding": false,
+    "label_smoothing": 0.0
+  }
diff --git a/llm/config/qwen/lora_argument.json b/llm/config/qwen/lora_argument.json
index aeb0d5d61f92..a00845c2263a 100644
--- a/llm/config/qwen/lora_argument.json
+++ b/llm/config/qwen/lora_argument.json
@@ -4,7 +4,7 @@
     "output_dir": "./checkpoints/lora_ckpts",
     "per_device_train_batch_size": 4,
     "gradient_accumulation_steps": 4,
-    "per_device_eval_batch_size": 8,
+    "per_device_eval_batch_size": 4,
     "eval_accumulation_steps":16,
     "num_train_epochs": 3,
     "learning_rate": 3e-04,
diff --git a/llm/config/qwen/lora_argument_0p5b.json b/llm/config/qwen/lora_argument_0p5b.json
new file mode 100644
index 000000000000..88014ac90268
--- /dev/null
+++ b/llm/config/qwen/lora_argument_0p5b.json
@@ -0,0 +1,34 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/lora_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 2,
+    "eval_accumulation_steps": 32,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true,
+    "unified_checkpoint": true,
+    "zero_padding": false,
+    "use_flash_attention": false,
+    "pissa": false
+  }
diff --git a/llm/config/qwen/pretrain_argument_0p5b.json b/llm/config/qwen/pretrain_argument_0p5b.json
new file mode 100644
index 000000000000..a0e2ff37c3d2
--- /dev/null
+++ b/llm/config/qwen/pretrain_argument_0p5b.json
@@ -0,0 +1,40 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B",
+    "tokenizer_name_or_path": "Qwen/Qwen2.5-0.5B",
+    "input_dir": "./data",
+    "output_dir": "./checkpoints/pretrain_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 1,
+    "per_device_eval_batch_size": 2,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "virtual_pp_degree": 1,
+    "sequence_parallel": 0,   
+    "use_flash_attention": false,
+    "use_fused_rms_norm": false,
+    "max_seq_length": 1024,
+    "learning_rate": 3e-05,
+    "min_learning_rate": 3e-06,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "max_steps": 10000,
+    "save_steps": 5000,
+    "eval_steps": 1000,
+    "weight_decay": 0.01,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "warmup_ratio": 0.01,
+    "max_grad_norm": 1.0,
+    "dataloader_num_workers": 1,
+    "continue_training": 0,
+    "do_train": true,
+    "do_eval": true,
+    "do_predict": true,
+    "disable_tqdm": true,
+    "recompute": false,
+    "distributed_dataloader": 1,
+    "recompute_granularity": "full",
+    "unified_checkpoint": true,
+    "save_total_limit": 2
+  }
diff --git a/llm/config/qwen/pt_argument.json b/llm/config/qwen/pt_argument.json
index b70e4a144c75..85ecd8ab004c 100644
--- a/llm/config/qwen/pt_argument.json
+++ b/llm/config/qwen/pt_argument.json
@@ -4,8 +4,8 @@
     "output_dir": "./checkpoints/pt_ckpts",
     "per_device_train_batch_size": 4,
     "gradient_accumulation_steps": 4,
-    "per_device_eval_batch_size": 8,
-    "eval_accumulation_steps":16,
+    "per_device_eval_batch_size": 4,
+    "eval_accumulation_steps": 32,
     "num_train_epochs": 3,
     "learning_rate": 3e-02,
     "warmup_steps": 30,
diff --git a/llm/config/qwen/pt_argument_0p5b.json b/llm/config/qwen/pt_argument_0p5b.json
new file mode 100644
index 000000000000..4ebb18ace09c
--- /dev/null
+++ b/llm/config/qwen/pt_argument_0p5b.json
@@ -0,0 +1,31 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/pt_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 4,
+    "eval_accumulation_steps": 32,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true,
+    "use_flash_attention": false
+  }
diff --git a/llm/config/qwen/sft_argument_0p5b.json b/llm/config/qwen/sft_argument_0p5b.json
new file mode 100644
index 000000000000..e5f05bc5e2cd
--- /dev/null
+++ b/llm/config/qwen/sft_argument_0p5b.json
@@ -0,0 +1,33 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/sft_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "zero_padding": false,
+    "unified_checkpoint": true,
+    "use_flash_attention": false
+  }
diff --git a/llm/config/qwen/sft_argument_0p5b_best.json b/llm/config/qwen/sft_argument_0p5b_best.json
new file mode 100644
index 000000000000..5ad6b466aab0
--- /dev/null
+++ b/llm/config/qwen/sft_argument_0p5b_best.json
@@ -0,0 +1,37 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/sft_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 2,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "zero_padding": true,
+    "unified_checkpoint": true,
+    "fuse_attention_qkv": true,
+    "fuse_attention_ffn": true,
+    "use_fused_rms_norm": true,
+    "use_fused_rope": true,
+    "use_fused_linear_cross_entropy": true,
+    "use_flash_attention": true
+  }
diff --git a/llm/docs/predict/best_practices.md b/llm/docs/predict/best_practices.md
index 77b29fcb5ebe..31c9382a7c4c 100644
--- a/llm/docs/predict/best_practices.md
+++ b/llm/docs/predict/best_practices.md
@@ -1,4 +1,4 @@
-# 最佳实践
+# 高性能推理最佳实践
 
 PaddleNLP 提供了多种环境变量，用于优化推理性能和资源使用。下面提供一些调整 PaddleNLP 推理性能的最佳实践。
 
@@ -29,6 +29,6 @@ PaddleNLP 提供了多种环境变量，用于优化推理性能和资源使用
 
 **Append Attention 优化**
 
-- `FLAGS_cascade_attention_max_partition_size`：Append Attention decoder计算时对cache_kv进行分chunk的chunk大小，默认值根据batchsize设置，batchsize=1时设置为128，batchsize>1时设置为512。显式设置时不再区分batchsize。
-- `FLAGS_dec_block_shape_q`：Append Attention decoder计算时对q进行分块的分块大小，默认值为16。
-- `FLAGS_enc_block_shape_q`：Append Attention encoder计算时对q进行分块的分块大小，默认值为64。
+- `FLAGS_cascade_attention_max_partition_size`：Append Attention decoder 计算时对 cache_kv 进行分 chunk 的 chunk 大小，默认值根据 batchsize 设置，batchsize=1时设置为128，batchsize>1时设置为512。显式设置时不再区分 batchsize。
+- `FLAGS_dec_block_shape_q`：Append Attention decoder 计算时对 q 进行分块的分块大小，默认值为16。
+- `FLAGS_enc_block_shape_q`：Append Attention encoder 计算时对 q 进行分块的分块大小，默认值为64。
diff --git a/llm/docs/predict/inference.md b/llm/docs/predict/inference.md
index 9c3439682573..2c1dcecd35a1 100644
--- a/llm/docs/predict/inference.md
+++ b/llm/docs/predict/inference.md
@@ -25,6 +25,7 @@ PaddleNLP 大模型推理提供压缩、推理、服务全流程体验 ：
 ## 1. 模型支持
 
 PaddleNLP 中已经添加高性能推理模型相关实现，已验证过的模型如下：
+
 | Models | Example Models |
 |--------|----------------|
 |Llama 3.x, Llama 2|`meta-llama/Llama-3.2-3B-Instruct`, `meta-llama/Meta-Llama-3.1-8B`, `meta-llama/Meta-Llama-3.1-8B-Instruct`, `meta-llama/Meta-Llama-3.1-405B`, `meta-llama/Meta-Llama-3.1-405B-Instruct`,`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-Guard-3-8B`, `Llama-2-7b, meta-llama/Llama-2-7b-chat`, `meta-llama/Llama-2-13b`, `meta-llama/Llama-2-13b-chat`, `meta-llama/Llama-2-70b`, `meta-llama/Llama-2-70b-chat`|
@@ -170,6 +171,77 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 2. `a8w8`与`a8w8_fp8`需要额外的 act 和 weight 的 scale 校准表，推理传入的 `model_name_or_path` 为 PTQ 校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
 3. `cachekv_int8_type`可选`dynamic`（已不再维护，不建议使用）和`static`两种，`static`需要额外的 cache kv 的 scale 校准表，传入的 `model_name_or_path` 为 PTQ 校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
 
+
+## 5. 服务化部署
+
+**高性能服务化部署请参考**：[静态图服务化部署教程](../../server/docs/deploy_usage_tutorial.md)。
+
+如果您想简单体验模型，我们提供了**简易的 Flash Server 动态图部署**方式，我们提供了一套基于动态图推理的简单易用 UI 服务化部署方法，用户可以快速部署服务化推理。
+
+环境准备
+
+- python >= 3.9
+- gradio
+- flask
+
+服务化部署脚本
+
+```shell
+# 单卡，可以使用 paddle.distributed.launch 启动多卡推理
+python  ./predict/flask_server.py \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
+    --port 8010 \
+    --flask_port 8011 \
+    --dtype "float16"
+```
+
+- `port`: Gradio UI 服务端口号，默认8010。
+- `flask_port`: Flask 服务端口号，默认8011。
+
+图形化界面: 打开 `http://127.0.0.1:8010` 即可使用 gradio 图形化界面，即可开启对话。
+API 访问: 您也可用通过 flask 服务化 API 的形式.
+
+1. 可参考：`./predict/request_flask_server.py` 文件。
+```shell
+python predict/request_flask_server.py
+```
+
+2. 或者直接使用 curl,调用开始对话
+```shell
+curl 127.0.0.1:8011/v1/chat/completions \
+-H 'Content-Type: application/json' \
+-d '{"message": [{"role": "user", "content": "你好"}]}'
+```
+3.使用 OpenAI 客户端调用：
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://localhost:8011/v1/",
+)
+
+# Completion API
+stream = True
+completion = client.chat.completions.create(
+    model="paddlenlp",
+    messages=[
+        {"role": "user", "content": "PaddleNLP好厉害！这句话的感情色彩是？"}
+    ],
+    max_tokens=1024,
+    stream=stream,
+)
+
+if stream:
+    for c in completion:
+        print(c.choices[0].delta.content, end="")
+else:
+    print(completion.choices[0].message.content)
+```
+该方式部署，性能一般，高性能服务化部署请参考：[静态图服务化部署教程](../../server/docs/deploy_usage_tutorial.md)。
+
+
+
 更多大模型推理教程：
 
 -  [llama](./llama.md)
@@ -188,7 +260,7 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 更多压缩、服务化推理体验：
 
 - [大模型量化教程](../quantization.md)
-- [服务化部署教程](https://github.com/PaddlePaddle/FastDeploy/blob/develop/README_CN.md)
+- [静态图服务化部署教程](../../server/docs/deploy_usage_tutorial.md)
 
 更多硬件大模型推理教程：
 
diff --git a/llm/docs/predict/installation.md b/llm/docs/predict/installation.md
index 4d077c1c9ed6..c1a57f6adf78 100644
--- a/llm/docs/predict/installation.md
+++ b/llm/docs/predict/installation.md
@@ -1,4 +1,4 @@
-# 安装
+# 高性能推理算子安装
 
 git clone 代码到本地：
 
@@ -7,17 +7,17 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git
 export PYTHONPATH=/path/to/PaddleNLP:$PYTHONPATH
 ```
 
-PaddleNLP 针对于Transformer 系列编写了高性能自定义算子，提升模型在推理和解码过程中的性能，使用之前需要预先安装自定义算子库：
+PaddleNLP 针对于 Transformer 系列编写了高性能自定义算子，提升模型在推理和解码过程中的性能，使用之前需要预先安装自定义算子库：
 
 ```shell
 #GPU设备安装自定义算子
 cd PaddleNLP/csrc && python setup_cuda.py install
 #XPU设备安装自定义算子
-cd PaddleNLP/csrc/xpu/src && sh cmake_build.sh
+# cd PaddleNLP/csrc/xpu/src && sh cmake_build.sh
 #DCU设备安装自定义算子
-cd PaddleNLP/csrc && python setup_hip.py install
+# cd PaddleNLP/csrc && python setup_hip.py install
 #SDAA设备安装自定义算子
-cd PaddleNLP/csrc/sdaa && python setup_sdaa.py install
+# cd PaddleNLP/csrc/sdaa && python setup_sdaa.py install
 ```
 
 到达运行目录，即可开始：
diff --git a/llm/docs/predict/mixtral.md b/llm/docs/predict/mixtral.md
index 0ec0e8811310..1a3c1269a70d 100644
--- a/llm/docs/predict/mixtral.md
+++ b/llm/docs/predict/mixtral.md
@@ -12,7 +12,7 @@
 
 |Model|
 |:-|
-|mistralai/Mixtral-8x7B-v0.1-Instruct|
+|mistralai/Mixtral-8x7B-Instruct-v0.1|
 
 
 ## 模型推理
diff --git a/llm/predict/flask_server.py b/llm/predict/flask_server.py
index d467d6dac688..6a845b8e7ebd 100644
--- a/llm/predict/flask_server.py
+++ b/llm/predict/flask_server.py
@@ -11,11 +11,13 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+
 from __future__ import annotations
 
 import json
 import os
 import socket
+import time
 from contextlib import closing
 from dataclasses import asdict, dataclass, field
 from time import sleep
@@ -44,14 +46,13 @@ def __free_port(port):
             try:
                 s.bind(("", port))
                 return port
-            except:
+            except Exception:
                 return -1
 
     for port in range(port_l, port_u):
-        port = __free_port(port)
-        if port != -1:
-            return port
-
+        free = __free_port(port)
+        if free != -1:
+            return free
     return -1
 
 
@@ -66,17 +67,15 @@ class ServerArgument:
 
 class PredictorServer:
     def __init__(self, args: ServerArgument, predictor: BasePredictor):
-
         self.predictor = predictor
         self.args = args
         scan_l, scan_u = (
             self.args.flask_port + port_interval * predictor.tensor_parallel_rank,
             self.args.flask_port + port_interval * (predictor.tensor_parallel_rank + 1),
         )
-        self.total_max_length = predictor.config.src_length + predictor.config.max_length
+        self.total_max_length = predictor.config.total_max_length
 
         if self.predictor.tensor_parallel_rank == 0:
-            # fetch port info
             self.port = find_free_ports(scan_l, scan_u)
             self.peer_ports = {}
             while True and self.predictor.tensor_parallel_degree > 1:
@@ -84,120 +83,205 @@ def __init__(self, args: ServerArgument, predictor: BasePredictor):
                     with FileLock(FILE_LOCK), open(PORT_FILE, "r") as f:
                         cnt = 1
                         for line in f:
-                            data = json.loads(line)
-                            self.peer_ports[data["rank"]] = data["port"]
+                            port_data = json.loads(line)
+                            self.peer_ports[port_data["rank"]] = port_data["port"]
                             cnt += 1
-
                     if cnt == predictor.tensor_parallel_degree:
                         break
                     else:
                         print("waiting for port reach", cnt)
                 sleep(1)
         else:
-            # save port info
             self.port = find_free_ports(scan_l, scan_u)
             data = {"rank": predictor.tensor_parallel_rank, "port": self.port}
             with FileLock(FILE_LOCK), open(PORT_FILE, "a") as f:
                 f.write(json.dumps(data) + "\n")
-            print("rank: ", predictor.tensor_parallel_rank, " port info saving done.")
+            print("rank:", predictor.tensor_parallel_rank, " port info saving done.")
+
+    def stream_predict(self, input_texts: str | list[str]):
+        if hasattr(self.predictor, "stream_predict"):
+            return self.predictor.stream_predict(input_texts)
+        else:
+            return self.predictor.predict(input_texts)
 
     def predict(self, input_texts: str | list[str]):
-        return self.predictor.stream_predict(input_texts)
+        return self.predictor.predict(input_texts)
 
     def broadcast_msg(self, data):
+        import threading
+
+        def send_request(peer_port, data):
+            try:
+                url = f"http://0.0.0.0:{peer_port}/v1/chat/completions"
+                requests.post(url, json=data)
+            except Exception:
+                pass
+
         for _, peer_port in self.peer_ports.items():
             if peer_port != self.port:
-                _ = requests.post(f"http://0.0.0.0:{peer_port}/api/chat", json=data)
+                logger.info(f"broadcast_msg to {peer_port}")
+                # Here we need async call send_request to other card.
+                thread = threading.Thread(target=send_request, args=(peer_port, data))
+                thread.start()
 
     def start_flask_server(self):
         from flask import Flask, request, stream_with_context
 
         app = Flask(__name__)
 
-        @app.post("/api/chat")
+        @app.post("/v1/chat/completions")
         def _server():
             data = request.get_json()
-            logger.info(f"Request: {json.dumps(data, indent=2, ensure_ascii=False)}")
 
             if self.predictor.tensor_parallel_rank == 0:
                 self.broadcast_msg(data)
+            logger.info(f"Request: {json.dumps(data, indent=2, ensure_ascii=False)}")
 
-            def streaming(data):
-                query = data.pop("context", "")
-                history = data.pop("history", "")
-                data.pop("extra_info", None)
-
-                # build chat template
-                if self.predictor.tokenizer.chat_template is not None:
-                    if not history:
-                        history = []
-                    # also support history data
-                    elif isinstance(history, str):
+            # 处理 OpenAI 格式消息（支持 messages 字段）以及兼容原有格式
+            if "messages" in data:
+                messages = data["messages"]
+                if not messages:
+                    return json.dumps({"error": "Empty messages"}), 400
+                if messages[-1].get("role") == "user":
+                    query = messages[-1].get("content", "")
+                    history = []
+                    if len(messages) > 1:
+                        temp = []
+                        for msg in messages[:-1]:
+                            if msg.get("role") in ["user", "assistant"]:
+                                temp.append(msg.get("content", ""))
+                        if len(temp) % 2 != 0:
+                            temp = temp[1:]
+                        history = temp
+                else:
+                    query = ""
+                    history = [msg.get("content", "") for msg in messages if msg.get("role") in ["user", "assistant"]]
+                data["context"] = query
+                data["history"] = history
+            else:
+                data["context"] = data.get("context", "")
+                data["history"] = data.get("history", "")
+
+            # 判断是否采用流式返回，默认为非流式（可根据需求调整默认值）
+            is_stream = data.get("stream", False)
+
+            # 统一对 context/history 做处理，兼容 chat_template 格式
+            def process_input(query, history):
+                if isinstance(history, str):
+                    try:
                         history = json.loads(history)
-
-                    assert len(history) % 2 == 0
-                    chat_query = []
+                    except Exception:
+                        history = [history]
+                # 如果模型支持 chat_template，则转换为消息格式处理
+                if self.predictor.tokenizer.chat_template is not None:
+                    messages = []
                     for idx in range(0, len(history), 2):
-                        if isinstance(history[idx], str):
-                            chat_query.append([history[idx], history[idx + 1]])
-                        elif isinstance(history[idx], dict):
-                            chat_query.append([history[idx]["utterance"], history[idx + 1]["utterance"]])
-                        else:
-                            raise ValueError(
-                                "history data should be list[str] or list[dict], eg: ['sentence-1', 'sentece-2', ...], or "
-                                "[{'utterance': 'sentence-1'}, {'utterance': 'sentence-2'}, ...]"
+                        user_msg = history[idx] if isinstance(history[idx], str) else history[idx].get("utterance", "")
+                        messages.append({"role": "user", "content": user_msg})
+                        if idx + 1 < len(history):
+                            assistant_msg = (
+                                history[idx + 1]
+                                if isinstance(history[idx + 1], str)
+                                else history[idx + 1].get("utterance", "")
                             )
+                            messages.append({"role": "assistant", "content": assistant_msg})
+                    messages.append({"role": "user", "content": query})
+                    return messages
+                return query
+
+            # 提取生成参数
+            generation_args = data.copy()
+            query = generation_args.pop("context", "")
+            history = generation_args.pop("history", [])
+            query = process_input(query, history)
+
+            # 更新生成相关配置参数
+            self.predictor.config.max_length = generation_args.get(
+                "max_tokens", generation_args.get("max_length", self.predictor.config.max_length)
+            )
+            if "src_length" in generation_args:
+                self.predictor.config.src_length = generation_args["src_length"]
+
+            if self.predictor.config.src_length + self.predictor.config.max_length > self.total_max_length:
+                output = {
+                    "error_code": 1,
+                    "error_msg": (
+                        f"The sum of src_length<{self.predictor.config.src_length}> and max_length<{self.predictor.config.max_length}> "
+                        f"should be smaller than or equal to the max-total-length<{self.total_max_length}>"
+                    ),
+                }
+                return json.dumps(output, ensure_ascii=False), 400
+
+            self.predictor.config.top_p = generation_args.get("top_p", self.predictor.config.top_p)
+            self.predictor.config.temperature = generation_args.get("temperature", self.predictor.config.temperature)
+            self.predictor.config.top_k = generation_args.get("top_k", self.predictor.config.top_k)
+            self.predictor.config.repetition_penalty = generation_args.get(
+                "repetition_penalty", self.predictor.config.repetition_penalty
+            )
+
+            for key, value in generation_args.items():
+                setattr(self.args, key, value)
+
+            # 根据是否流式返回选择不同处理方式
+            if is_stream:
+                # 流式返回生成结果
+                def streaming(data):
+                    streamer = self.stream_predict(query)
+                    if self.predictor.tensor_parallel_rank != 0:
+                        return "done"
 
-                    # the input of predictor should be batched.
-                    # batched query: [ [[user, bot], [user, bot], ..., [user]]  ]
-                    query = [chat_query + [[query]]]
-
-                generation_args = data
-                self.predictor.config.max_length = generation_args["max_length"]
-                if "src_length" in generation_args:
-                    self.predictor.config.src_length = generation_args["src_length"]
-
-                if self.predictor.config.src_length + self.predictor.config.max_length > self.total_max_length:
-                    output = {
-                        "error_code": 1,
-                        "error_msg": f"The sum of src_length<{self.predictor.config.src_length}> and "
-                        f"max_length<{self.predictor.config.max_length}> should be smaller than or equal to "
-                        f"the max-total-length<{self.total_max_length}>",
-                    }
-                    yield json.dumps(output, ensure_ascii=False) + "\n"
-                    return
-
-                self.predictor.config.top_p = generation_args["top_p"]
-                self.predictor.config.temperature = generation_args["temperature"]
-                self.predictor.config.top_k = generation_args["top_k"]
-                self.predictor.config.repetition_penalty = generation_args["repetition_penalty"]
-
-                for key, value in generation_args.items():
-                    setattr(self.args, key, value)
-
-                streamer = self.predict(query)
-                if self.predictor.tensor_parallel_rank == 0:
                     for new_text in streamer:
                         if not new_text:
                             continue
-
-                        output = {
-                            "error_code": 0,
-                            "error_msg": "Success",
-                            "result": {"response": {"role": "bot", "utterance": new_text}},
+                        response_body = {
+                            "id": "YouID",
+                            "object": "chat.completion",
+                            "created": int(time.time()),
+                            "model": self.args.model_name_or_path,
+                            "choices": [
+                                {
+                                    "index": 0,
+                                    "delta": {
+                                        "role": "assistant",
+                                        "content": new_text,
+                                    },
+                                    "finish_reason": "stop",
+                                }
+                            ],
                         }
-                        yield json.dumps(output, ensure_ascii=False) + "\n"
-                else:
-                    return "done"
+                        yield f"data: {json.dumps(response_body, ensure_ascii=False)}\n\n"
+                    yield "data: [DONE]\n\n"
 
-            return app.response_class(stream_with_context(streaming(data)))
+                return app.response_class(stream_with_context(streaming(data)), mimetype="text/event-stream")
+
+            else:
+                # 非流式：一次性返回完整结果
+                result = self.predict(query)
+                if self.predictor.tensor_parallel_rank == 0:
+                    if type(result) is list and len(result) == 1:
+                        result = result[0]
+                    response_body = {
+                        "id": "YouID",
+                        "object": "chat.completion",
+                        "created": int(time.time()),
+                        "model": self.args.model_name_or_path,
+                        "choices": [
+                            {
+                                "index": 0,
+                                "message": {"role": "assistant", "content": result},
+                                "finish_reason": "stop",
+                            }
+                        ],
+                    }
+                    data = f"{json.dumps(response_body, ensure_ascii=False)}"
+                    return app.response_class(data, mimetype="application/json")
+                else:
+                    return app.response_class("done")
 
-        # set single thread to do prediction
-        # refer to: https://github.com/pallets/flask/blob/main/src/flask/app.py#L605
+        # 启动 Flask 服务（单线程预测）
         app.run(host="0.0.0.0", port=self.port, threaded=False)
 
     def start_ui_service(self, args, predictor_args):
-        # do not support start ui service in one command
         from multiprocessing import Process
 
         from gradio_ui import main
@@ -208,17 +292,16 @@ def start_ui_service(self, args, predictor_args):
 
 
 if __name__ == "__main__":
-
     parser = PdArgumentParser((PredictorArgument, ModelArgument, ServerArgument))
     predictor_args, model_args, server_args = parser.parse_args_into_dataclasses()
-    # check port
+    server_args.model_name_or_path = predictor_args.model_name_or_path
+
     if server_args.base_port is not None:
         logger.warning("`--base_port` is deprecated, please use `--flask_port` instead after 2023.12.30.")
-
         if server_args.flask_port is None:
             server_args.flask_port = server_args.base_port
         else:
-            logger.warning("`--base_port` and `--flask_port` are both set, `--base_port` will be ignored.")
+            logger.warning("Both `--base_port` and `--flask_port` are set; `--base_port` will be ignored.")
 
     log_dir = os.getenv("PADDLE_LOG_DIR", "./")
     PORT_FILE = os.path.join(log_dir, PORT_FILE)
@@ -226,10 +309,10 @@ def start_ui_service(self, args, predictor_args):
         os.remove(PORT_FILE)
 
     predictor = create_predictor(predictor_args, model_args)
-
-    server = PredictorServer(server_args, predictor)
-
+    server = PredictorServer(
+        server_args,
+        predictor,
+    )
     if server.predictor.tensor_parallel_rank == 0:
         server.start_ui_service(server_args, asdict(predictor.config))
-
     server.start_flask_server()
diff --git a/llm/predict/gradio_ui.py b/llm/predict/gradio_ui.py
index 5e43ec8ae12b..9dceb705710d 100644
--- a/llm/predict/gradio_ui.py
+++ b/llm/predict/gradio_ui.py
@@ -1,3 +1,4 @@
+#!/usr/bin/env python
 # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,17 +16,31 @@
 from __future__ import annotations
 
 import argparse
-import copy
 import json
+import logging
+import re
 
 import gradio as gr
 import requests
 
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+console_handler = logging.StreamHandler()
+console_handler.setLevel(logging.INFO)
+formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
+console_handler.setFormatter(formatter)
+logger.addHandler(console_handler)
+
 
 def setup_args():
     """Setup arguments."""
     parser = argparse.ArgumentParser()
     parser.add_argument("--port", type=int, default=8073)
+    parser.add_argument("--api_key", type=str, default=None, help="Your API key")
+    parser.add_argument("--model", type=str, default="", help="Model name")
+    parser.add_argument("--title", type=str, default="PaddleNLP Chat", help="UI Title")
+    parser.add_argument("--sub_title", type=str, default="powered by paddlenlp team.", help="UI Sub Title")
+    parser.add_argument("--flask_port", type=int, default=None, help="The port of flask service")
     args = parser.parse_args()
     return args
 
@@ -52,137 +67,147 @@ def create_max_slider(value, maximum):
     )
 
 
+def remove_think_tags(text):
+    """
+    清除文本中 <think> 和 </think> 标签之间的所有字符。
+
+    Args:
+        text: 要处理的文本字符串。
+
+    Returns:
+        清除 <think> 和 </think> 标签之间内容的文本字符串。
+    """
+    pattern = re.compile(r"\\<think\\>.*?\\<\\\/think\\>", re.DOTALL)
+    # 将匹配到的部分替换为空字符串
+    cleaned_text = pattern.sub("", text).strip()
+    return cleaned_text
+
+
 def launch(args, default_params: dict = {}):
-    """Launch characters dialogue demo."""
+    """Launch chat UI with OpenAI API."""
 
     def rollback(state):
         """Rollback context."""
         context = state.setdefault("context", [])
-        utterance = context[-2]["utterance"]
-        context = context[:-2]
-        state["context"] = context
-        shown_context = get_shown_context(context)
-        return utterance, shown_context, context, state
+        # 回退时移除最后一次对话
+        if len(context) >= 2:
+            content = context[-2]["content"]
+            context = context[:-2]
+            state["context"] = context
+            shown_context = get_shown_context(context)
+            return content, shown_context, context, state
+        else:
+            gr.Warning("没有可撤回的对话历史")
+            return None, get_shown_context(context), context, state
 
-    def regen(state, top_k, top_p, temperature, repetition_penalty, max_length, src_length):
+    def regen(state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length):
         """Regenerate response."""
         context = state.setdefault("context", [])
         if len(context) < 2:
-            gr.Warning("don't have chat history")
+            gr.Warning("No chat history!")
             shown_context = get_shown_context(context)
             return None, shown_context, context, state
 
+        # 删除上一次回复，重新生成
         context.pop()
         user_turn = context.pop()
-        context.append({"role": "user", "utterance": user_turn["utterance"]})
-        context.append({"role": "bot", "utterance": ""})
+        context.append({"role": "user", "content": user_turn["content"]})
+        context.append({"role": "assistant", "content": ""})
         shown_context = get_shown_context(context)
-        return user_turn["utterance"], shown_context, context, state
+        return user_turn["content"], shown_context, context, state
 
-    def begin(utterance, state):
-        """Model inference."""
-        utterance = utterance.strip().replace("<br>", "\n")
+    def begin(content, state):
+        """记录用户输入，并初始化 bot 回复为空。"""
         context = state.setdefault("context", [])
 
-        if not utterance:
-            gr.Warning("invalid inputs")
-            # gr.Warning("请输入有效问题")
+        if not content:
+            gr.Warning("Invalid inputs")
             shown_context = get_shown_context(context)
             return None, shown_context, context, state
 
-        context.append({"role": "user", "utterance": utterance})
-        context.append({"role": "bot", "utterance": ""})
-
+        context.append({"role": "user", "content": content})
+        context.append({"role": "assistant", "content": ""})
         shown_context = get_shown_context(context)
-        return utterance, shown_context, context, state
+        return content, shown_context, context, state
 
-    def infer(utterance, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length):
-        """Model inference."""
-        utterance = utterance.strip().replace("<br>", "\n")
+    def infer(content, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length):
+        """调用 OpenAI 接口生成回答，并以流式返回部分结果。"""
         context = state.setdefault("context", [])
-
-        if not utterance:
-            gr.Warning("invalid inputs")
-            # gr.Warning("请输入有效问题")
+        if not content:
+            gr.Warning("Invalid inputs")
             shown_context = get_shown_context(context)
             return None, shown_context, context, state
 
-        data = {
-            "context": utterance,
-            "top_k": top_k,
-            "top_p": top_p,
+        # 构造 OpenAI API 要求的 messages 格式
+        messages = []
+        for turn in context[:-1]:
+            messages.append({"role": turn["role"], "content": remove_think_tags(turn["content"])})
+
+        # 默认模型名称从参数中获取
+        model = getattr(args, "model", default_params.get("model", ""))
+        payload = {
+            "model": model,
+            "messages": messages,
             "temperature": temperature,
             "repetition_penalty": repetition_penalty,
-            "max_length": max_length,
+            "max_tokens": max_tokens,
             "src_length": src_length,
-            "min_length": 1,
+            "top_p": top_p,
+            "top_k": top_k,
+            "stream": True,
         }
-        if len(context) > 2:
-            data["history"] = json.dumps(context[:-2])
-
-        res = requests.post(f"http://0.0.0.0:{args.flask_port}/api/chat", json=data, stream=True)
-        for index, line in enumerate(res.iter_lines()):
-            result = json.loads(line)
-            if result["error_code"] != 0:
-                gr.Warning(result["error_msg"])
-                shown_context = get_shown_context(context)
-                return None, shown_context, context, state
-
-            bot_response = result["result"]["response"]
-
-            # replace \n with br: https://github.com/gradio-app/gradio/issues/4344
-            bot_response["utterance"] = bot_response["utterance"].replace("\n", "<br>")
-
-            if bot_response["utterance"].endswith("[END]"):
-                bot_response["utterance"] = bot_response["utterance"][:-5]
-
-            # the first character of gradio can not be "<br>" or "<br/>"
-            if bot_response["utterance"] in ["<br>", "<br/>"] and index == 0:
-                continue
-
-            context[-1]["utterance"] += bot_response["utterance"]
+        headers = {
+            # "Authorization": "Bearer " + args.api_key,
+            "Content-Type": "application/json"
+        }
+        url = f"http://0.0.0.0:{args.flask_port}/v1/chat/completions"
+        try:
+            res = requests.post(url, json=payload, headers=headers, stream=True)
+        except Exception as e:
+            gr.Warning(f"请求异常: {e}")
             shown_context = get_shown_context(context)
-
             yield None, shown_context, context, state
-
-    def clean_context(context):
-        """Clean context for EB input."""
-        cleaned_context = copy.deepcopy(context)
-        for turn in cleaned_context:
-            if turn["role"] == "bot":
-                bot_resp = turn["utterance"]
-                if bot_resp.startswith("<img src") or bot_resp.startswith("<audio controls>"):
-                    bot_resp = "\n".join(bot_resp.split("\n")[1:])
-                turn["utterance"] = bot_resp
-        return cleaned_context
-
-    def extract_eda(eb_debug_info):
-        """Extract EDA result from EB dispatch info."""
-        eda_res = None
-        for item in eb_debug_info:
-            if item["sys"] == "EDA":
-                eda_output = json.loads(item["output"])
-                eda_res = eda_output["result"]
-                break
-        return eda_res
-
-    def extract_eb_input(eb_debug_info, convert_for_ar=True):
-        """Extract EB raw input from EB dispatch info."""
-        eb_raw_input = None
-        for item in eb_debug_info:
-            if item["sys"] == "EB":
-                eb_output = json.loads(item["output"])
-                eb_raw_input = eb_output["text_after_process"]
-                if convert_for_ar:
-                    eb_raw_input = eb_raw_input.replace("[CLS]", "<cls>").replace("[SEP]", "<sep>")
-                break
-        return eb_raw_input
+            return
+
+        # 流式处理返回结果，实时更新最后一个对话记录（即 bot 回复）
+        for line in res.iter_lines():
+            if line:
+                try:
+                    decoded_line = line.decode("utf-8").strip()
+                    # OpenAI 流返回每行以 "data:" 开头
+                    if decoded_line.startswith("data:"):
+                        data_str = decoded_line[len("data:") :].strip()
+                        if data_str == "[DONE]":
+                            logger.info("Conversation round over.")
+                            break
+                        data_json = json.loads(data_str)
+
+                        # delta 中可能包含部分回复内容
+                        delta = data_json["choices"][0]["delta"].get("content", "")
+                        if delta:
+                            # Reformat <think> tags to show in chatbot
+                            delta = delta.replace("<think>", r"\<think\>")
+                            delta = delta.replace("</think>", r"\<\/think\>")
+                            context[-1]["content"] += delta
+                            shown_context = get_shown_context(context)
+                            yield None, shown_context, context, state
+                    else:
+                        logger.error(f"{decoded_line}")
+                        gr.Warning(f"{decoded_line}")
+
+                except Exception as e:
+                    logger.error(f"解析返回结果异常: {e}")
+                    gr.Warning(f"解析返回结果异常: {e}")
+                    continue
 
     def get_shown_context(context):
-        """Get gradio chatbot."""
+        """将对话上下文转换为 gr.Chatbot 显示格式，每一对 [用户, 助手]"""
         shown_context = []
+        # 每两项组成一对
         for turn_idx in range(0, len(context), 2):
-            shown_context.append([context[turn_idx]["utterance"], context[turn_idx + 1]["utterance"]])
+            user_text = context[turn_idx]["content"]
+            bot_text = context[turn_idx + 1]["content"] if turn_idx + 1 < len(context) else ""
+            shown_context.append([user_text, bot_text])
         return shown_context
 
     with gr.Blocks(title="LLM", theme=gr.themes.Soft()) as block:
@@ -195,7 +220,7 @@ def get_shown_context(context):
                     value=0,
                     step=1,
                     label="Top-k",
-                    info="该参数越大，模型生成结果更加随机，反之生成结果更加确定。",
+                    info="控制采样token个数。(不建议设置)",
                 )
                 top_p = gr.Slider(
                     minimum=0,
@@ -203,7 +228,7 @@ def get_shown_context(context):
                     value=default_params.get("top_p", 0.7),
                     step=0.05,
                     label="Top-p",
-                    info="该参数越大，模型生成结果更加随机，反之生成结果更加确定。",
+                    info="控制采样范围。",
                 )
                 temperature = gr.Slider(
                     minimum=0.05,
@@ -211,7 +236,7 @@ def get_shown_context(context):
                     value=default_params.get("temperature", 0.95),
                     step=0.05,
                     label="Temperature",
-                    info="该参数越小，模型生成结果更加随机，反之生成结果更加确定。",
+                    info="温度，控制生成随机性。",
                 )
                 repetition_penalty = gr.Slider(
                     minimum=0.1,
@@ -219,32 +244,40 @@ def get_shown_context(context):
                     value=default_params.get("repetition_penalty", 1.2),
                     step=0.05,
                     label="Repetition Penalty",
-                    info="该参数越大，生成结果重复的概率越低。设置 1 则不开启。",
+                    info="生成结果重复惩罚。(不建议设置)",
                 )
-                default_src_length = default_params["src_length"]
-                total_length = default_params["src_length"] + default_params["max_length"]
+                default_src_length = default_params.get("src_length", 128)
+                total_length = default_src_length + default_params.get("max_tokens", 50)
                 src_length = create_src_slider(default_src_length, total_length)
-                max_length = create_max_slider(min(total_length - default_src_length, 50), total_length)
+                max_tokens = create_max_slider(max(total_length - default_src_length, 50), total_length)
 
-                def src_length_change_event(src_length_value, max_length_value):
+                def src_length_change_event(src_length_value, max_tokens_value):
                     return create_max_slider(
-                        min(total_length - src_length_value, max_length_value),
+                        min(total_length - src_length_value, max_tokens_value),
                         total_length - src_length_value,
                     )
 
-                def max_length_change_event(src_length_value, max_length_value):
+                def max_tokens_change_event(src_length_value, max_tokens_value):
                     return create_src_slider(
-                        min(total_length - max_length_value, src_length_value),
-                        total_length - max_length_value,
+                        min(total_length - max_tokens_value, src_length_value),
+                        total_length - max_tokens_value,
                     )
 
-                src_length.change(src_length_change_event, inputs=[src_length, max_length], outputs=max_length)
-                max_length.change(max_length_change_event, inputs=[src_length, max_length], outputs=src_length)
-
+                src_length.change(src_length_change_event, inputs=[src_length, max_tokens], outputs=max_tokens)
+                max_tokens.change(max_tokens_change_event, inputs=[src_length, max_tokens], outputs=src_length)
             with gr.Column(scale=4):
                 state = gr.State({})
-                context_chatbot = gr.Chatbot(label="Context")
-                utt_text = gr.Textbox(placeholder="请输入...", label="Utterance")
+                # 这里修改 gr.Chatbot 组件，启用 Markdown 渲染并支持 LaTeX 展示
+                context_chatbot = gr.Chatbot(
+                    label="Context",
+                    render_markdown=True,
+                    latex_delimiters=[
+                        {"left": "$$", "right": "$$", "display": True},
+                        {"left": "\\[", "right": "\\]", "display": True},
+                        {"left": "$", "right": "$", "display": True},
+                    ],
+                )
+                utt_text = gr.Textbox(placeholder="请输入...", label="Content")
                 with gr.Row():
                     clear_btn = gr.Button("清空")
                     rollback_btn = gr.Button("撤回")
@@ -261,7 +294,7 @@ def max_length_change_event(src_length_value, max_length_value):
                 api_name="chat",
             ).then(
                 infer,
-                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
             )
 
@@ -280,13 +313,13 @@ def max_length_change_event(src_length_value, max_length_value):
             )
             regen_btn.click(
                 regen,
-                inputs=[state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
                 queue=False,
                 api_name="chat",
             ).then(
                 infer,
-                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
             )
 
@@ -298,7 +331,7 @@ def max_length_change_event(src_length_value, max_length_value):
                 api_name="chat",
             ).then(
                 infer,
-                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length, src_length],
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_tokens, src_length],
                 outputs=[utt_text, context_chatbot, raw_context_json, state],
             )
 
@@ -310,5 +343,12 @@ def main(args, default_params: dict = {}):
 
 
 if __name__ == "__main__":
+    # 可以在 default_params 中设置默认参数，如 src_length, max_tokens, temperature, top_p 等
+    default_params = {
+        "src_length": 1024,
+        "max_tokens": 1024,
+        "temperature": 0.95,
+        "top_p": 0.7,
+    }
     args = setup_args()
-    main(args)
+    main(args, default_params)
diff --git a/llm/predict/predictor.py b/llm/predict/predictor.py
index e01aa7ce5a97..8a7711cffd01 100644
--- a/llm/predict/predictor.py
+++ b/llm/predict/predictor.py
@@ -24,11 +24,14 @@
 import numpy as np
 import paddle
 import paddle.incubate.multiprocessing as mp
-from paddle.base.framework import in_cinn_mode, in_pir_executor_mode, use_pir_api
+from paddle.base.framework import in_cinn_mode, in_pir_executor_mode
 from paddle.distributed import fleet
 
 try:
-    from paddlenlp.experimental.transformers import InferenceWithReferenceProposer
+    from paddlenlp.experimental.transformers import (
+        EagleProposer,
+        InferenceWithReferenceProposer,
+    )
 except:
     pass
 from paddlenlp.generation import GenerationConfig, TextIteratorStreamer
@@ -48,7 +51,12 @@
     PretrainedTokenizer,
 )
 from paddlenlp.trl import llm_utils
-from paddlenlp.utils.env import MAX_BSZ, MAX_DRAFT_TOKENS
+from paddlenlp.utils.env import (
+    MAX_BSZ,
+    MAX_DRAFT_TOKENS,
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+)
 from paddlenlp.utils.import_utils import is_paddlenlp_ops_available
 from paddlenlp.utils.log import logger
 
@@ -142,7 +150,9 @@ class PredictorArgument:
     )
     speculate_method: str = field(
         default=None,
-        metadata={"help": "speculate method, it should be one of ['None', 'inference_with_reference']"},
+        metadata={
+            "help": "speculate method, it should be one of ['None', 'inference_with_reference', 'eagle', 'mtp']"
+        },
     )
     speculate_max_draft_token_num: int = field(
         default=1,
@@ -153,6 +163,12 @@ class PredictorArgument:
         default=2, metadata={"help": "the max length of verify window for speculate method."}
     )
     speculate_max_candidate_len: int = field(default=5, metadata={"help": "the max length of candidate tokens."})
+    draft_model_name_or_path: str = field(default=None, metadata={"help": "The directory of eagle or draft model"})
+    draft_model_quant_type: str = field(
+        default="",
+        metadata={"help": "Draft model quantization type. Reserved for future"},
+    )
+    return_full_hidden_states: int = field(default=False, metadata={"help": "whether return full hidden_states"})
 
     def __post_init__(self):
         if self.speculate_method is not None:
@@ -208,8 +224,17 @@ def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = N
             self.generation_config = None
 
     def _preprocess(self, source):
+
         if self.tokenizer.chat_template is not None:
-            source = [source] if isinstance(source, str) else source
+            # for str -> List[str] eg. "hello"
+            # for List[str] -> List[str]  eg. ["hello", "hello new"]
+            # for List[List[str]] -> List[List[List[str]]]  eg. 历史对话形式,一轮
+            #             [ [ "Hello, how are you?", "I'm doing great. How can I help you today?"],
+            #                ["I'd like to show off how chat templating works!"], ]
+            # for List[Dict] -> List[List[Dict]]  [{'role': 'user', 'content': 'hello'}, {'role': 'assistant', 'content': 'nice'}]
+            #                                 ->  [[{'role': 'user', 'content': 'hello'}, {'role': 'assistant', 'content': 'nice'}]]
+            if not isinstance(source, list) or not isinstance(source[0], str):
+                source = [source]
             source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
 
         tokenized_source = self.tokenizer(
@@ -217,6 +242,7 @@ def _preprocess(self, source):
             max_length=self.config.src_length,
             truncation=True,
             return_position_ids=True if not isinstance(self.tokenizer, ChatGLMTokenizer) else False,
+            return_attention_mask=True,
             truncation_side="left",
             return_tensors=self.return_tensors,
             padding=True,
@@ -471,7 +497,8 @@ def _preprocess(self, source):
         pre_caches_length = 0 if not self.config.export_precache else self.pre_caches[0].shape[-2]
 
         if self.tokenizer.chat_template is not None:
-            source = [source] if isinstance(source, str) else source
+            if not isinstance(source, list) or not isinstance(source[0], str):
+                source = [source]
             source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
 
         inputs = llm_utils.dybatch_preprocess(
@@ -648,10 +675,11 @@ def _create_predictor(self, predictor_args: PredictorArgument):
         infer_model_path = llm_utils.get_infer_model_path(
             predictor_args.model_name_or_path, predictor_args.model_prefix
         )
-        if use_pir_api():
-            config = paddle.inference.Config(infer_model_path + ".json", infer_model_path + ".pdiparams")
-        else:
-            config = paddle.inference.Config(infer_model_path + ".pdmodel", infer_model_path + ".pdiparams")
+
+        config = paddle.inference.Config(
+            infer_model_path + PADDLE_INFERENCE_MODEL_SUFFIX,
+            infer_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+        )
 
         config.switch_ir_optim(True)
         # remove `gpu_cpu_map_matmul_v2_to_matmul_pass` to avoid mapping matmul_v2 -> matmul op
@@ -914,7 +942,8 @@ def _preprocess(self, input_text: list[str]):
             assert len(input_text) == self.batch_size
 
         if self.tokenizer.chat_template is not None:
-            input_text = [input_text] if isinstance(input_text, str) else input_text
+            if not isinstance(input_text, list) or not isinstance(input_text[0], str):
+                input_text = [input_text]
             input_text = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in input_text]
 
         input_ids = []
@@ -931,20 +960,24 @@ def _preprocess(self, input_text: list[str]):
             )
             input_ids.append(tokens["input_ids"][0])
 
-        seq_lens = self.pad_batch_data(input_ids)
+        self.seq_lens = self.pad_batch_data(input_ids)
         self.model_inputs["input_ids"] = self.input_ids
 
         self.model_inputs["block_tables"][:][:] = -1
         free_list = list(range(self.max_block_nums))
         for i in range(self.config.batch_size):
             for j in range(
-                (seq_lens[i] + self.config.max_length + self.config.block_size - 1) // self.config.block_size
+                (self.seq_lens[i] + self.config.max_length + self.config.block_size - 1) // self.config.block_size
             ):
                 used_block_id = free_list.pop()
                 self.model_inputs["block_tables"][i, j] = used_block_id
 
-        self.model_inputs["seq_lens_this_time"] = paddle.to_tensor(np.array(seq_lens).astype("int32").reshape(-1, 1))
-        self.model_inputs["seq_lens_encoder"] = paddle.to_tensor(np.array(seq_lens).astype("int32").reshape(-1, 1))
+        self.model_inputs["seq_lens_this_time"] = paddle.to_tensor(
+            np.array(self.seq_lens).astype("int32").reshape(-1, 1)
+        )
+        self.model_inputs["seq_lens_encoder"] = paddle.to_tensor(
+            np.array(self.seq_lens).astype("int32").reshape(-1, 1)
+        )
         self.model_inputs["seq_lens_decoder"] = paddle.full(
             shape=[self.config.batch_size, 1], fill_value=0, dtype="int32"
         )
@@ -979,8 +1012,10 @@ def _preprocess(self, input_text: list[str]):
             self.proposer.input_ids_cpu = self.model_inputs["input_ids"].to("cpu", blocking=False)
             for bid in range(self.config.batch_size):
                 self.model_inputs["pre_ids"][bid, 0] = self.model_inputs["input_ids"][bid][
-                    seq_lens[bid] - 1
+                    self.seq_lens[bid] - 1
                 ]  # get the last token before padding of this batch
+                if self.config.speculate_method == "inference_with_reference":
+                    self.proposer.input_ids_len[bid, 0] = self.seq_lens[bid]
 
         if self.config.mode == "static":
             for k, v in self.model_inputs.items():
@@ -990,6 +1025,8 @@ def _preprocess(self, input_text: list[str]):
 class DygraphBlockInferencePredictor(BlockInferencePredictorMixin):
     def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = None, **kwargs):
         model = kwargs.get("model", None)
+        self.return_full_hidden_states = config.return_full_hidden_states
+        self.full_hidden_states = None
         if model is None:
             raise ValueError("model should be provided for DygraphBlockInferencePredictor")
         self.cache_kvs_shape = model.get_cache_kvs_shape(model.config, config.batch_size)
@@ -1019,19 +1056,24 @@ def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = N
                 config.batch_size,
                 config.max_length,
             )
+        elif config.speculate_method in ["eagle", "mtp"]:
+            self.proposer = EagleProposer(args=config)
         else:
             self.proposer = None
 
     @paddle.no_grad()
     def _infer(self, inputs: dict[str, paddle.Tensor]):
-        self.model.generate(
+        return self.model.generate(
             **inputs,
         )
 
     @paddle.no_grad()
     def predict(self, input_texts: list[str], return_tokens=False):
         self._preprocess(input_texts)
-
+        if self.proposer is not None:
+            self.proposer.insert_query(
+                base_model_inputs=self.model_inputs, real_bs=len(input_texts), seq_lens=self.seq_lens
+            )
         result_queue = mp.Queue()
         tensor_queue = mp.Queue()
         done_event = mp.Event()
@@ -1063,9 +1105,16 @@ def predict(self, input_texts: list[str], return_tokens=False):
                     self.model_inputs,
                     real_batch_size=self.batch_size,
                     seq_lens_this_time=self.model_inputs["seq_lens_this_time"],
+                    base_model_full_hidden_states=self.full_hidden_states,
                 )
-            self._infer(self.model_inputs)
-        logger.info(f"running spend {time.time()  -  s_time}")
+            if self.return_full_hidden_states:
+                self.full_hidden_states = self._infer(self.model_inputs)
+            else:
+                self._infer(self.model_inputs)
+        logger.info(f"running spend {time.time() - s_time}")
+
+        if self.proposer is not None:
+            self.proposer.postprocess(base_model_inputs=self.model_inputs)
 
         if self.tensor_parallel_rank == 0:
             outputs = []
@@ -1091,6 +1140,9 @@ def __init__(
         **kwargs,
     ):
         self.cache_kvs_shape = kwargs.get("cache_kvs_shape", None)
+        self.model_args = kwargs.get("model_args", None)
+        self.return_full_hidden_states = config.return_full_hidden_states
+        self.full_hidden_states = None
         if self.cache_kvs_shape is None:
             raise ValueError("cache_kvs_shape should be provided for StaticGraphBlockInferencePredictor")
         BlockInferencePredictorMixin.__init__(self, config, tokenizer)
@@ -1127,6 +1179,11 @@ def __init__(
                 config.batch_size,
                 config.max_length,
             )
+        elif config.speculate_method in ["eagle", "mtp"]:
+            self.proposer = EagleProposer(
+                args=config,
+                model_args=self.model_args,
+            )
         else:
             self.proposer = None
 
@@ -1141,10 +1198,10 @@ def _create_predictor(self, predictor_args: PredictorArgument):
             predictor_args.model_name_or_path, predictor_args.model_prefix
         )
 
-        if use_pir_api():
-            config = paddle.inference.Config(infer_model_path + ".json", infer_model_path + ".pdiparams")
-        else:
-            config = paddle.inference.Config(infer_model_path + ".pdmodel", infer_model_path + ".pdiparams")
+        config = paddle.inference.Config(
+            infer_model_path + PADDLE_INFERENCE_MODEL_SUFFIX,
+            infer_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+        )
 
         config.switch_ir_optim(False)
         if predictor_args.device in paddle.device.get_all_custom_device_type():
@@ -1177,7 +1234,11 @@ def _create_predictor(self, predictor_args: PredictorArgument):
     def predict(self, input_texts: list[str], return_tokens=False):
         s_time = time.time()
         self._preprocess(input_texts)
-        logger.info(f"preprocess spend {time.time()  -  s_time}")
+        if self.proposer is not None:
+            self.proposer.insert_query(
+                base_model_inputs=self.model_inputs, real_bs=len(input_texts), seq_lens=self.seq_lens
+            )
+        logger.info(f"preprocess spend {time.time() - s_time}")
 
         result_queue = mp.Queue()
         tensor_queue = mp.Queue()
@@ -1210,10 +1271,16 @@ def predict(self, input_texts: list[str], return_tokens=False):
                     self.model_inputs,
                     real_batch_size=self.batch_size,
                     seq_lens_this_time=self.model_inputs["seq_lens_this_time"],
+                    base_model_full_hidden_states=self.full_hidden_states,
                 )
-            self.predictor.run(list(self.model_inputs.values()))
-        logger.info(f"running spend {time.time()  -  s_time}")
+            if self.return_full_hidden_states:
+                self.full_hidden_states = self.predictor.run(list(self.model_inputs.values()))[0]
+            else:
+                self.predictor.run(list(self.model_inputs.values()))
+        logger.info(f"running spend {time.time() - s_time}")
 
+        if self.proposer is not None:
+            self.proposer.postprocess(base_model_inputs=self.model_inputs)
         if self.tensor_parallel_rank == 0:
             outputs = []
             output_tokens = []
@@ -1244,7 +1311,7 @@ def create_predictor(
         config: PretrainedConfig,
         model_args: ModelArgument,
         tokenizer: PretrainedTokenizer = None,
-        **kwargs
+        **kwargs,
     ):
         """
         Create a predictor
@@ -1288,7 +1355,9 @@ def create_predictor(
         predictor_class = getattr(import_class, predictor_class_name)
 
         # instance
-        predictor = predictor_class(predictor_args, tokenizer=tokenizer, model=model, cache_kvs_shape=cache_kvs_shape)
+        predictor = predictor_class(
+            predictor_args, tokenizer=tokenizer, model=model, cache_kvs_shape=cache_kvs_shape, model_args=model_args
+        )
         return predictor
 
 
@@ -1296,7 +1365,16 @@ def create_predictor(
     predictor_args: PredictorArgument,
     model_args: ModelArgument,
 ):
-    tokenizer = AutoTokenizer.from_pretrained(predictor_args.model_name_or_path, padding_side="left")
+
+    paddle.set_device(predictor_args.device)
+    paddle.set_default_dtype(predictor_args.dtype)
+
+    from paddlenlp.utils.env import USE_FAST_TOKENIZER
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        predictor_args.model_name_or_path, padding_side="left", use_fast=USE_FAST_TOKENIZER
+    )
+
     # init chat_template for tokenizer
     llm_utils.init_chat_template(tokenizer, predictor_args.model_name_or_path, predictor_args.chat_template)
 
@@ -1386,9 +1464,6 @@ def predict():
     parser = PdArgumentParser((PredictorArgument, ModelArgument))
     predictor_args, model_args = parser.parse_args_into_dataclasses()
 
-    paddle.set_device(predictor_args.device)
-    paddle.set_default_dtype(predictor_args.dtype)
-
     tensor_parallel_degree = paddle.distributed.get_world_size()
     if tensor_parallel_degree > 1:
         strategy = fleet.DistributedStrategy()
@@ -1421,7 +1496,9 @@ def predict():
                     target_texts.append("")
 
     else:
-        source_texts = ["解释一下温故而知新"] * predictor_args.batch_size
+        source_texts = [
+            "2014年3月，大范围雾霾天气长时间影响我国东部地区，严重危害人体健康。造成雾霾天气的人为原因有____\r\n①工业生产中使用矿物作为燃料，大量排放污染物     ②汽车尾气的大量排放     \r\n③风力小，空气流动不畅     ④冬季取暖排放粉尘\nA. ①②③\nB. ②③④\nC. ①③④\nD. ①②④"
+        ] * predictor_args.batch_size
         target_texts = [""] * predictor_args.batch_size
 
     batch_source_texts = batchfy_text(source_texts, predictor_args.batch_size)
diff --git a/llm/predict/request_flask_server.py b/llm/predict/request_flask_server.py
index b7cef31eda9a..11b3e88404f1 100644
--- a/llm/predict/request_flask_server.py
+++ b/llm/predict/request_flask_server.py
@@ -17,38 +17,99 @@
 import requests
 
 
-def send_request(query, history=None):
-    data = {
-        "context": query,
-        "history": history,
-        "top_k": 0,
-        "top_p": 0.7,  # 0.0 为 greedy_search
-        "temperature": 0.95,
-        "repetition_penalty": 1.3,
-        "max_length": 100,
-        "src_length": 100,
+def build_messages(query, history=None):
+    """
+    根据传入的 query 和 history 构造符合 OpenAI 格式的消息列表。
+    如果 history 为 list 且每项为 dict，则直接使用；如果为 list 且每项为字符串，
+    则依次按用户（user）与助手（assistant）交替添加；否则直接只添加当前用户消息。
+    """
+    messages = []
+    if history:
+        if isinstance(history, list):
+            if all(isinstance(item, dict) for item in history):
+                messages.extend(history)
+            else:
+                # 假设 history 按顺序依次为用户、助手、用户、助手……
+                for idx, item in enumerate(history):
+                    role = "user" if idx % 2 == 0 else "assistant"
+                    messages.append({"role": role, "content": str(item)})
+        else:
+            messages.append({"role": "user", "content": str(history)})
+    # 当前请求作为最新的用户消息
+    messages.append({"role": "user", "content": query})
+    return messages
+
+
+def send_request(query, history=None, stream=True):
+    # 构造 OpenAI 格式的请求体
+    payload = {
+        "messages": build_messages(query, history),
+        # 以下生成参数可根据需要调整
+        # "top_k": 0,
+        # "top_p": 0.7,
+        # "temperature": 0.8,
+        # "repetition_penalty": 1.3,
+        "max_length": 1024,
+        "src_length": 1024,
         "min_length": 1,
+        "stream": stream,
     }
-    res = requests.post("http://127.0.0.1:8010/api/chat", json=data, stream=True)
-    text = ""
+    res = requests.post("http://localhost:8011/v1/chat/completions", json=payload, stream=True)
+    result_text = ""
+    printed_reasoning_content = False
+    printed_content = False
     for line in res.iter_lines():
-        result = json.loads(line)
-
-        if result["error_code"] != 0:
-            text = "error-response"
-            break
+        # https://github.com/vllm-project/vllm/blob/433c4a49230a470f13657f06e7612cde86e4fb40/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py#L67-L69
+        if not line:
+            continue
 
-        result = json.loads(line)
-        bot_response = result["result"]["response"]
+        decoded_line = line.decode("utf-8").strip()
+        # OpenAI 流返回每行以 "data:" 开头
+        if decoded_line.startswith("data:"):
+            data = decoded_line[5:].strip()  # Remove "data:" prefix
+            if data == "[DONE]":  # End of stream
+                print("\nclient: Stream completed.\n")
+                break
+            try:
+                # Parse the JSON data
+                chunk = json.loads(data)
+                reasoning_content = chunk["choices"][0]["delta"].get("reasoning_content", "")
+                content = chunk["choices"][0]["delta"].get("content", "")
 
-        if bot_response["utterance"].endswith("[END]"):
-            bot_response["utterance"] = bot_response["utterance"][:-5]
-        text += bot_response["utterance"]
+                if reasoning_content:
+                    if not printed_reasoning_content:
+                        printed_reasoning_content = True
+                        print("reasoning_content:", end="", flush=True)
+                    print(reasoning_content, end="", flush=True)
+                elif content:
+                    if not printed_content:
+                        printed_content = True
+                        print("\ncontent:", end="", flush=True)
+                    # Extract and print the content
+                    print(content, end="", flush=True)
+                    result_text += content
+            except Exception as e:
+                print("解析响应出错:", e)
+                continue
+        else:
+            try:
+                data = json.loads(decoded_line)
+                content = data["choices"][0]["message"].get("content", "")
+                print(content, end="", flush=True)
+                result_text += content
+            except Exception as e:
+                print("解析响应出错:", e)
+                continue
 
-    print("result -> ", text)
-    return text
+    print()
+    return result_text
 
 
-send_request("你好啊")
-send_request("再加一等于多少", ["一加一等于多少", "一加一等于二"])
-send_request("再加一等于多少", [{"utterance": "一加一等于多少"}, {"utterance": "一加一等于二"}])
+if __name__ == "__main__":
+    # 示例调用：仅发送当前用户消息
+    send_request("你好啊")
+    send_request("你好啊", stream=False)
+    # 示例调用：使用 history 为字符串列表（交替为用户与助手的对话）
+    send_request("再加一等于多少", ["一加一等于多少", "一加一等于二"])
+    # 示例调用：history 为字典格式，明确指定对话角色
+    send_request("再加一等于多少", [{"role": "user", "content": "一加一等于多少"}, {"role": "assistant", "content": "一加一等于二"}])
diff --git a/llm/run_finetune.py b/llm/run_finetune.py
index b3a5f3fcea2e..f5abf8366580 100644
--- a/llm/run_finetune.py
+++ b/llm/run_finetune.py
@@ -52,13 +52,18 @@
     AutoModelForCausalLM,
     AutoModelForCausalLMPipe,
     AutoTokenizer,
+    DeepseekV2ForCausalLM,
+    DeepseekV2ForCausalLMPipe,
+    DeepseekV3ForCausalLM,
+    DeepseekV3ForCausalLMPipe,
     Llama3Tokenizer,
     LlamaForCausalLM,
     LlamaForCausalLMPipe,
     LlamaTokenizer,
     Qwen2ForCausalLM,
     Qwen2ForCausalLMPipe,
-    register_sequence_parallel_allreduce_hooks,
+    Qwen2MoeForCausalLM,
+    Qwen2MoeForCausalLMPipe,
 )
 from paddlenlp.transformers.configuration_utils import LlmMetaConfig
 from paddlenlp.trl import DataConfig, ModelConfig, SFTConfig, SFTTrainer
@@ -75,7 +80,18 @@
 # Fine-tune Environment Variables to support sharding stage1 overlap optimization.
 os.environ["USE_CASUAL_MASK"] = "False"
 
-flash_mask_support_list = [LlamaForCausalLM, LlamaForCausalLMPipe, Qwen2ForCausalLM, Qwen2ForCausalLMPipe]
+flash_mask_support_list = [
+    DeepseekV2ForCausalLM,
+    DeepseekV2ForCausalLMPipe,
+    DeepseekV3ForCausalLM,
+    DeepseekV3ForCausalLMPipe,
+    LlamaForCausalLM,
+    LlamaForCausalLMPipe,
+    Qwen2ForCausalLM,
+    Qwen2ForCausalLMPipe,
+    Qwen2MoeForCausalLM,
+    Qwen2MoeForCausalLMPipe,
+]
 
 
 def paddlenlp_verison_check():
@@ -152,6 +168,13 @@ def main():
         quantization_config=quantization_config,
     )
 
+    architectures_to_check = {"Qwen2Moe", "DeepseekV2", "DeepseekV3"}
+    if (
+        any(architecture in str(model_config.architectures) for architecture in architectures_to_check)
+        and training_args.data_parallel_degree > 1
+    ):
+        training_args.use_expert_parallel = True
+
     LlmMetaConfig.set_llm_config(model_config, training_args)
     model_config.use_fast_layer_norm = model_args.use_fast_layer_norm
 
@@ -188,6 +211,8 @@ def main():
 
     logger.info(f"Final model config: {model_config}")
 
+    logger.info("Creating model")
+
     model_class = AutoModelForCausalLM
     if training_args.pipeline_parallel_degree > 1:
         if data_args.eval_with_do_generation and training_args.do_eval:
@@ -231,10 +256,6 @@ def neft_post_hook(module, input, output):
         else:
             raise NotImplementedError("Only support neftune for model with get_input_embeddings")
 
-    if training_args.sequence_parallel:
-        register_sequence_parallel_allreduce_hooks(
-            model, training_args.gradient_accumulation_steps, training_args.fuse_sequence_parallel_allreduce
-        )
     # Load tokenizer & dataset
     tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, from_aistudio=model_args.from_aistudio)
     reft_layers = None
@@ -259,7 +280,6 @@ def neft_post_hook(module, input, output):
         tokenizer.pad_token_id = tokenizer.eos_token_id
 
     train_ds, dev_ds, test_ds = create_dataset(data_args, training_args)
-
     # TODO(ZHUI & sijunhe): Temporary implementation. Generalize this logic and move to Trainer later.
     if training_args.resume_from_checkpoint is not None and data_args.lazy:
         logger.info(
@@ -303,6 +323,7 @@ def neft_post_hook(module, input, output):
         )
         eval_zero_padding = False
 
+    logger.info("Trans the dataset text into token ids, please wait for a moment.")
     train_ds, dev_ds, test_ds = trans_dataset_to_ids(
         train_ds, dev_ds, test_ds, model_args, data_args, trans_func, eval_zero_padding
     )
@@ -593,7 +614,12 @@ def create_peft_model(model_args, reft_args, training_args, dtype, model_config,
 def trans_dataset_to_ids(train_ds, dev_ds, test_ds, model_args, data_args, trans_func, eval_zero_padding):
     if train_ds is not None:
         train_ds = train_ds.map(
-            partial(trans_func, is_test=False, zero_padding=data_args.zero_padding, flash_mask=model_args.flash_mask)
+            partial(
+                trans_func,
+                is_test=False,
+                zero_padding=data_args.zero_padding,
+                flash_mask=model_args.flash_mask,
+            )
         )
     if dev_ds is not None:
         dev_ds = dev_ds.map(
@@ -620,18 +646,21 @@ def create_dataset(data_args, training_args):
     if os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) or os.path.exists(
         os.path.join(data_args.dataset_name_or_path, "dev.json")
     ):
+        logger.info("load train")
         if training_args.do_train:
             train_ds = load_dataset(
                 "json",
                 data_files=os.path.join(data_args.dataset_name_or_path, "train.json"),
                 lazy=data_args.lazy,
             )[0]
+        logger.info("load eval")
         if training_args.do_eval:
             dev_ds = load_dataset(
                 "json",
                 data_files=os.path.join(data_args.dataset_name_or_path, "dev.json"),
                 lazy=data_args.lazy,
             )[0]
+        logger.info("load test")
         if training_args.do_predict:
             test_ds = load_dataset(
                 "json",
diff --git a/llm/run_lora_hand.sh b/llm/run_lora_hand.sh
index 29e5165ed1cf..2e2478bdc7f8 100644
--- a/llm/run_lora_hand.sh
+++ b/llm/run_lora_hand.sh
@@ -46,8 +46,8 @@ python -u  -m paddle.distributed.launch \
     --log_dir  "log/$task_name""_log" \
     run_finetune.py \
     --model_name_or_path "meta-llama/Meta-Llama-3.1-8B-Instruct" \
-    --dataset_name_or_path "fintune_data/data" \
-    --output_dir "./checkpoints/llama_sft_ckpts" \
+    --dataset_name_or_path "finetune_data/data" \
+    --output_dir "./checkpoints/llama_lora_ckpts" \
     --lora true \
     --use_mora false \
     --per_device_train_batch_size 1 \
@@ -71,12 +71,11 @@ python -u  -m paddle.distributed.launch \
     --disable_tqdm true \
     --load_best_model_at_end true \
     --eval_with_do_generation false \
-    --metric_for_best_model "accuracy" \
+    --metric_for_best_model "loss" \
     --recompute false \
     --save_total_limit 1 \
     --tensor_parallel_degree 2 \
     --pipeline_parallel_degree 2 \
-    --sharding "stage1" \
     --zero_padding false \
     --unified_checkpoint false \
     --use_flash_attention true \
@@ -85,4 +84,3 @@ python -u  -m paddle.distributed.launch \
     --sharding_parallel_config "enable_stage1_tensor_fusion" \
     --tensor_parallel_config "enable_mp_async_allreduce" \
     --pipeline_parallel_config "enable_sharding_comm_overlap enable_dp_comm_overlap enable_overlap_p2p_comm disable_p2p_cache_shape" \
-    # --num_hidden_layers 4 \
diff --git a/llm/run_pretrain.py b/llm/run_pretrain.py
index fc5e4510cc4c..25be3832156e 100644
--- a/llm/run_pretrain.py
+++ b/llm/run_pretrain.py
@@ -41,7 +41,6 @@
     AutoTokenizer,
     CosineAnnealingWithWarmupDecay,
     LinearAnnealingWithWarmupDecay,
-    register_sequence_parallel_allreduce_hooks,
 )
 from paddlenlp.transformers.configuration_utils import LlmMetaConfig, llmmetaclass
 from paddlenlp.utils.batch_sampler import DistributedBatchSampler
@@ -479,6 +478,13 @@ def main():
             except:
                 print("Not register llama pp reshard information.")
 
+    architectures_to_check = {"Qwen2Moe", "DeepseekV2", "DeepseekV3"}
+    if (
+        any(architecture in str(config.architectures) for architecture in architectures_to_check)
+        and training_args.data_parallel_degree > 1
+    ):
+        training_args.use_expert_parallel = True
+
     if model_args.continue_training:
         # NOTE(gongenlei): new add
         if training_args.autotuner_benchmark:
@@ -492,11 +498,6 @@ def main():
     else:
         model = model_class.from_config(config, dtype=dtype)
 
-    if training_args.sequence_parallel:
-        register_sequence_parallel_allreduce_hooks(
-            model, training_args.gradient_accumulation_steps, training_args.fuse_sequence_parallel_allreduce
-        )
-
     if training_args.recompute:
         model.recompute_enable()
 
diff --git a/llm/run_quantization.py b/llm/run_quantization.py
index 232c0595ca9c..9989353d54b6 100644
--- a/llm/run_quantization.py
+++ b/llm/run_quantization.py
@@ -41,7 +41,6 @@
     LlamaTokenizer,
     Qwen2ForCausalLM,
     Qwen2ForCausalLMPipe,
-    register_sequence_parallel_allreduce_hooks,
 )
 from paddlenlp.transformers.configuration_utils import LlmMetaConfig
 from paddlenlp.trl import DataConfig, ModelConfig, QuantConfig, SFTConfig, SFTTrainer
@@ -162,10 +161,6 @@ def main():
     if model_args.flash_mask and not any(isinstance(model, cls) for cls in flash_mask_support_list):
         raise NotImplementedError(f"{model.__class__} not support flash mask.")
 
-    if training_args.sequence_parallel:
-        register_sequence_parallel_allreduce_hooks(
-            model, training_args.gradient_accumulation_steps, training_args.fuse_sequence_parallel_allreduce
-        )
     # Load tokenizer & dataset
     tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, from_aistudio=model_args.from_aistudio)
     # init chat_template for tokenizer
diff --git a/llm/run_sft_hand.sh b/llm/run_sft_hand.sh
index 88e7a81df5d8..535825e8bf0c 100644
--- a/llm/run_sft_hand.sh
+++ b/llm/run_sft_hand.sh
@@ -46,7 +46,7 @@ python -u  -m paddle.distributed.launch \
     --log_dir  "log/$task_name""_log" \
     run_finetune.py \
     --model_name_or_path "meta-llama/Meta-Llama-3.1-8B-Instruct" \
-    --dataset_name_or_path "fintune_data/data" \
+    --dataset_name_or_path "finetune_data/data" \
     --output_dir "./checkpoints/llama_sft_ckpts" \
     --per_device_train_batch_size 1 \
     --gradient_accumulation_steps 2 \
@@ -69,12 +69,11 @@ python -u  -m paddle.distributed.launch \
     --disable_tqdm true \
     --load_best_model_at_end true \
     --eval_with_do_generation false \
-    --metric_for_best_model "accuracy" \
+    --metric_for_best_model "loss"  \
     --recompute false \
     --save_total_limit 1 \
     --tensor_parallel_degree 2 \
     --pipeline_parallel_degree 2 \
-    --sharding "stage1" \
     --zero_padding false \
     --unified_checkpoint false \
     --flash_mask false \
@@ -84,4 +83,3 @@ python -u  -m paddle.distributed.launch \
     --sharding_parallel_config "enable_stage1_tensor_fusion enable_stage1_overlap" \
     --tensor_parallel_config "enable_mp_async_allreduce" \
     --pipeline_parallel_config "enable_sharding_comm_overlap enable_dp_comm_overlap enable_overlap_p2p_comm disable_p2p_cache_shape" \
-    # --num_hidden_layers 4 \
diff --git a/llm/server/README.md b/llm/server/README.md
index b521644e5769..535b40497d05 100644
--- a/llm/server/README.md
+++ b/llm/server/README.md
@@ -1,11 +1,10 @@
+# 大模型服务化部署-快速开始教程
 
-<h1 align="center"><b><em>大模型服务化部署</em></b></h1>
+*该部署工具是基于英伟达 Triton 框架专为服务器场景的大模型服务化部署而设计。它提供了支持 gRPC、HTTP 协议的服务接口，以及流式 Token 输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
 
-*该部署工具是基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计。它提供了支持gRPC、HTTP协议的服务接口，以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
+## 快速开始
 
-# 快速开始
-
-  基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
+  基于预编译镜像部署，**使用飞桨静态图模型部署**。本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例。其他模型需按照要求导出为**静态图模型格式**。更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
 
   ```
     # 下载模型
@@ -34,6 +33,6 @@ Note:
 
 更多关于该部署工具的使用方法，请查看[服务化部署流程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/server/docs/deploy_usage_tutorial.md)
 
-# License
+## License
 
 遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/LICENSE) 。
diff --git a/llm/server/dockerfiles/Dockerfile_serving_cuda118_cudnn8 b/llm/server/dockerfiles/Dockerfile_serving_cuda118_cudnn8
index c701765e9829..f6f762e63726 100644
--- a/llm/server/dockerfiles/Dockerfile_serving_cuda118_cudnn8
+++ b/llm/server/dockerfiles/Dockerfile_serving_cuda118_cudnn8
@@ -1,31 +1,28 @@
 FROM registry.baidubce.com/paddlepaddle/fastdeploy:llm-base-gcc12.3-cuda11.8-cudnn8-nccl2.15.5
 
 WORKDIR /opt/output/
-COPY ./server/ /opt/output/Serving/
-
 ENV LD_LIBRARY_PATH="/usr/local/cuda-11.8/compat/:$LD_LIBRARY_PATH"
 
-RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
-RUN python3 -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu118/ \
+RUN python3 -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu123/ \
     && python3 -m pip install paddlenlp==3.0.0b0 \
-    && python3 -m pip install --no-cache-dir sentencepiece pycryptodome tritonclient[all]==2.41.1
+    && python3 -m pip install --no-cache-dir sentencepiece pycryptodome tritonclient[all]==2.41.1 \
+    && python3 -m pip install --no-cache-dir --force-reinstall https://paddlepaddle-inference-banchmark.bj.bcebos.com/paddlenlp_ops-0.0.0-py3-none-any.whl \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
 
-RUN git clone https://gitee.com/paddlepaddle/PaddleNLP.git && cd PaddleNLP/csrc \
-    && python3 setup_cuda.py build && python3 setup_cuda.py install --user \
-    && cp -r /opt/output/PaddleNLP/paddlenlp /usr/local/lib/python3.10/dist-packages/ \
-    && cp -r /root/.local/lib/python3.10/site-packages/* /usr/local/lib/python3.10/dist-packages/ \
-    && rm -rf /opt/output/PaddleNLP
+RUN mkdir -p /opt/source/ && cd /opt/source/ \
+    && git clone https://github.com/PaddlePaddle/Paddle.git \
+    && git clone https://github.com/PaddlePaddle/PaddleNLP.git \
+    && cp -r /opt/source/PaddleNLP/paddlenlp /usr/local/lib/python3.10/dist-packages/ \
+    && python3 -m pip install --no-cache-dir -r PaddleNLP/requirements.txt \
+    && python3 -m pip install --no-cache-dir -r PaddleNLP/llm/server/server/requirements.txt
 
-RUN python3 -m pip install -r /opt/output/Serving/requirements.txt && rm /opt/output/Serving/requirements.txt
-RUN mv Serving/server /usr/local/lib/python3.10/dist-packages/
 RUN mkdir -p /opt/output/Serving/llm_model/model/1 \
-    && mv /opt/output/Serving/config/config.pbtxt /opt/output/Serving/llm_model/model/ \
-    && rm -rf /opt/output/Serving/config/
-RUN echo "from server.triton_server import TritonPythonModel" >>/opt/output/Serving/llm_model/model/1/model.py
+    && cp /opt/source/PaddleNLP/llm/server/server/config/config.pbtxt /opt/output/Serving/llm_model/model/ \
+    && cp /opt/source/PaddleNLP/llm/server/server/scripts/start_server.sh /opt/output/Serving/ \
+    && cp /opt/source/PaddleNLP/llm/server/server/scripts/stop_server.sh /opt/output/Serving/
 
-RUN cd /opt/output/Serving/ \
-    && cp scripts/start_server.sh . && cp scripts/stop_server.sh . \
-    && rm -rf scripts
+ENV PYTHONPATH="/opt/source/PaddleNLP/llm/server/server"
+RUN echo "from server.triton_server import TritonPythonModel" >>/opt/output/Serving/llm_model/model/1/model.py
 
 ENV http_proxy=""
 ENV https_proxy=""
diff --git a/llm/server/dockerfiles/Dockerfile_serving_cuda123_cudnn9 b/llm/server/dockerfiles/Dockerfile_serving_cuda123_cudnn9
index 4b0d1f002d98..ffe2517d3f0c 100644
--- a/llm/server/dockerfiles/Dockerfile_serving_cuda123_cudnn9
+++ b/llm/server/dockerfiles/Dockerfile_serving_cuda123_cudnn9
@@ -1,31 +1,28 @@
 FROM registry.baidubce.com/paddlepaddle/fastdeploy:llm-base-gcc12.3-cuda12.3-cudnn9-nccl2.15.5
 
 WORKDIR /opt/output/
-COPY ./server/ /opt/output/Serving/
-
 ENV LD_LIBRARY_PATH="/usr/local/cuda-12.3/compat/:$LD_LIBRARY_PATH"
 
-RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
 RUN python3 -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu123/ \
     && python3 -m pip install paddlenlp==3.0.0b0 \
-    && python3 -m pip install --no-cache-dir sentencepiece pycryptodome tritonclient[all]==2.41.1
+    && python3 -m pip install --no-cache-dir sentencepiece pycryptodome tritonclient[all]==2.41.1 \
+    && python3 -m pip install --no-cache-dir --force-reinstall https://paddlepaddle-inference-banchmark.bj.bcebos.com/paddlenlp_ops-0.0.0-py3-none-any.whl \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
 
-RUN git clone https://gitee.com/paddlepaddle/PaddleNLP.git && cd PaddleNLP/csrc \
-    && python3 setup_cuda.py build && python3 setup_cuda.py install --user \
-    && cp -r /opt/output/PaddleNLP/paddlenlp /usr/local/lib/python3.10/dist-packages/ \
-    && cp -r /root/.local/lib/python3.10/site-packages/* /usr/local/lib/python3.10/dist-packages/ \
-    && rm -rf /opt/output/PaddleNLP
+RUN mkdir -p /opt/source/ && cd /opt/source/ \
+    && git clone https://github.com/PaddlePaddle/Paddle.git \
+    && git clone https://github.com/PaddlePaddle/PaddleNLP.git \
+    && cp -r /opt/source/PaddleNLP/paddlenlp /usr/local/lib/python3.10/dist-packages/ \
+    && python3 -m pip install --no-cache-dir -r PaddleNLP/requirements.txt \
+    && python3 -m pip install --no-cache-dir -r PaddleNLP/llm/server/server/requirements.txt
 
-RUN python3 -m pip install -r /opt/output/Serving/requirements.txt && rm /opt/output/Serving/requirements.txt
-RUN mv Serving/server /usr/local/lib/python3.10/dist-packages/
 RUN mkdir -p /opt/output/Serving/llm_model/model/1 \
-    && mv /opt/output/Serving/config/config.pbtxt /opt/output/Serving/llm_model/model/ \
-    && rm -rf /opt/output/Serving/config/
-RUN echo "from server.triton_server import TritonPythonModel" >>/opt/output/Serving/llm_model/model/1/model.py
+    && cp /opt/source/PaddleNLP/llm/server/server/config/config.pbtxt /opt/output/Serving/llm_model/model/ \
+    && cp /opt/source/PaddleNLP/llm/server/server/scripts/start_server.sh /opt/output/Serving/ \
+    && cp /opt/source/PaddleNLP/llm/server/server/scripts/stop_server.sh /opt/output/Serving/
 
-RUN cd /opt/output/Serving/ \
-    && cp scripts/start_server.sh . && cp scripts/stop_server.sh . \
-    && rm -rf scripts
+ENV PYTHONPATH="/opt/source/PaddleNLP/llm/server/server"
+RUN echo "from server.triton_server import TritonPythonModel" >>/opt/output/Serving/llm_model/model/1/model.py
 
 ENV http_proxy=""
 ENV https_proxy=""
diff --git a/llm/server/docs/deploy_usage_tutorial.md b/llm/server/docs/deploy_usage_tutorial.md
index d5771bd0efbf..5ad704f22d4f 100644
--- a/llm/server/docs/deploy_usage_tutorial.md
+++ b/llm/server/docs/deploy_usage_tutorial.md
@@ -1,6 +1,8 @@
+# 静态图高性能部署全流程
 
 ## 目录
 
+- [快速开始](#快速开始)
 - [部署环境准备](#部署环境准备)
   - [基础环境](#基础环境)
   - [准备部署镜像](#准备部署镜像)
@@ -12,13 +14,46 @@
   - [服务状态查询](#服务状态查询)
 - [服务测试](#服务测试)
   - [Python 客户端](#Python-客户端)
-  - [HTTP调用](#HTTP调用)
+  - [HTTP 调用](#HTTP-调用)
   - [OpenAI 客户端](#OpenAI-客户端)
   - [返回示例](#返回示例)
-- [基于dockerfile创建自己的镜像](#基于dockerfile创建自己的镜像)
+- [基于 dockerfile 创建自己的镜像](#基于 dockerfile 创建自己的镜像)
 - [模型配置参数介绍](#模型配置参数介绍)
 - [请求参数介绍](#请求参数介绍)
 
+
+
+*该部署工具是基于英伟达 Triton 框架专为服务器场景的大模型服务化部署而设计。它提供了支持 gRPC、HTTP 协议的服务接口，以及流式 Token 输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
+
+## 快速开始
+
+基于预编译镜像部署，**使用飞桨静态图模型部署**。本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例。其他模型需按照要求导出为**静态图模型格式**。
+具体流程如下，仅供示例参考，用户需要根据自己的需求导出所需**静态图模型**，然后开始部署流程。
+
+```shell
+  # 下载模型
+  wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
+  mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
+
+  # 挂载模型文件
+  export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8
+
+  docker run --gpus all --shm-size 5G --network=host --privileged --cap-add=SYS_PTRACE \
+  -v ${MODEL_PATH}:/models/ \
+  -dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2 \
+  bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
+```
+等待服务启动成功（服务初次启动大概需要40s），可以通过以下命令测试：
+
+```shell
+  curl 127.0.0.1:9965/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"text": "hello, llm"}'
+```
+
+Note:
+1. 请保证 shm-size >= 5，不然可能会导致服务启动失败
+
 ## 部署环境准备
 
 ### 基础环境
@@ -34,7 +69,7 @@
 
 ### 准备部署镜像
 
-为了方便部署，我们提供了 cuda12.3 的镜像，可以直接拉取镜像，或者使用我们提供的 `Dockerfile` [构建自定义镜像](#基于dockerfile创建自己的镜像)
+为了方便部署，我们提供了 cuda12.3 的镜像，可以直接拉取镜像，或者使用我们提供的 `Dockerfile` [构建自定义镜像](#基于 dockerfile 创建自己的镜像)
 ```
 docker pull registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2
 ```
@@ -43,6 +78,13 @@ docker pull registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cu
 
 该部署工具为 PaddleNLP 静态图模型提供了高效的部署方案，模型静态图导出方案请参考：[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md) ...
 
+或者下载样例模型:
+```shell
+# 下载模型
+wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
+mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
+```
+
 导出后的模型放在任意文件夹下，以 `/home/workspace/models_dir` 为例
 
 ```
@@ -57,7 +99,7 @@ cd /home/workspace/models_dir
 # ├── rank_mapping.csv           # 多卡模型会有此文件，如为单卡模型，则无此文件（可选，仅在多卡部署模式下需要）
 # └── rank_0                     # 保存模型结构和权重文件的目录
 #     ├── model.pdiparams
-#     └── model.pdmodel
+#     └── model.pdmodel 或者 model.json # Paddle 3.0 版本模型为model.json，Paddle 2.x 版本模型为model.pdmodel
 ```
 
 ### 创建容器
@@ -87,7 +129,7 @@ ls /models/
 
 根据需求和硬件信息，配置以下环境变量
 
-```
+```shell
 # 单/多卡推理配置。自行修改。
 ## 如果是单卡推理，使用0卡，设置如下环境变量。
 export MP_NUM=1
@@ -128,7 +170,7 @@ export PUSH_MODE_HTTP_WORKERS="1" # HTTP服务进程数，在 PUSH_MODE_HTTP_POR
 
 ### 启动服务
 
-```
+```shell
 cd /opt/output/Serving
 bash start_server.sh
 
@@ -149,11 +191,11 @@ health接口：（模型是否准备好推理）
 
 ## 服务测试
 
-### HTTP调用
+### HTTP 调用
 
-提示：HTTP调用接口使用变量 PUSH_MODE_HTTP_PORT 配置！HTTP_PORT 仅用于探活接口使用！
+提示：HTTP 调用接口使用变量 PUSH_MODE_HTTP_PORT 配置！HTTP_PORT 仅用于探活接口使用！
 
-```
+```python
 import uuid
 import json
 import requests
@@ -193,7 +235,7 @@ for line in res.iter_lines():
 
 ### 返回示例
 
-```
+```python
 如果stream为True，流式返回
     如果正常，返回{'token': xxx, 'is_end': xxx, 'send_idx': xxx, ..., 'error_msg': '', 'error_code': 0}
     如果异常，返回{'error_msg': xxx, 'error_code': xxx}，error_msg字段不为空，error_code字段不为0
@@ -209,7 +251,7 @@ for line in res.iter_lines():
 
 提示：使用 OpenAI 客户端需要配置 `PUSH_MODE_HTTP_PORT`！
 
-```
+```python
 import openai
 
 push_mode_http_port = "9965"    # 服务配置的PUSH_MODE_HTTP_PORT
@@ -217,8 +259,8 @@ client = openai.Client(base_url=f"http://127.0.0.1:{push_mode_http_port}/v1/chat
 
 # 非流式返回
 response = client.completions.create(
-	model="default",
-	prompt="Hello, how are you?",
+    model="default",
+    prompt="Hello, how are you?",
   max_tokens=50,
   stream=False,
 )
@@ -228,8 +270,8 @@ print("\n")
 
 # 流式返回
 response = client.completions.create(
-	model="default",
-	prompt="Hello, how are you?",
+    model="default",
+    prompt="Hello, how are you?",
   max_tokens=100,
   stream=True,
 )
@@ -275,10 +317,10 @@ for chunk in response:
 print("\n")
 ```
 
-## 基于dockerfile创建自己的镜像
+## 基于 dockerfile 创建自己的镜像
 
-为了方便用户构建自定义服务，我们提供了基于dockerfile创建自己的镜像的脚本。
-```
+为了方便用户构建自定义服务，我们提供了基于 dockerfile 创建自己的镜像的脚本。
+```shell
 git clone https://github.com/PaddlePaddle/PaddleNLP.git
 cd PaddleNLP/llm/server
 
@@ -292,35 +334,35 @@ docker build --network=host -f ./dockerfiles/Dockerfile_serving_cuda123_cudnn9 -
 | :---: | :-----: | :---: | :---: | :-----: | :----: |
 | MP_NUM |  int  | 模型并行度 | 否 | 8 | CUDA_VISIBLE_DEVICES 需配置对应卡数 |
 | CUDA_VISIBLE_DEVICES | str | 使用 GPU 编号 | 否 | 0,1,2,3,4,5,6,7 |  |
-| HTTP_PORT | int | 探活服务的http端口 | 是 | 无 | 当前仅用于健康检查、探活 |
-| GRPC_PORT | int | 模型推服务的grpc端口 | 是 | 无 |   |
+| HTTP_PORT | int | 探活服务的 http 端口 | 是 | 无 | 当前仅用于健康检查、探活 |
+| GRPC_PORT | int | 模型推服务的 grpc 端口 | 是 | 无 |   |
 | METRICS_PORT | int | 模型服务中监督指标的端口 | 是 | 无 |   |
 | INFER_QUEUE_PORT | int | 模型服务内部使用的端口 | 否 | 56666 |   |
-| PUSH_MODE_HTTP_PORT | int | 服务请求HTTP端口号 | 否 | -1 | 如不配置，服务只支持GRPC协议 |
+| PUSH_MODE_HTTP_PORT | int | 服务请求 HTTP 端口号 | 否 | -1 | 如不配置，服务只支持 GRPC 协议 |
 | DISABLE_STREAMING | int | 是否使用流式返回 | 否 | 0 |  |
-| MAX_SEQ_LEN | int | 最大输入序列长度 | 否 | 8192 | 服务会拒绝input token数量超过MAX_SEQ_LEN的请求，并返回错误提示 |
-| MAX_DEC_LEN | int | 最大decoer序列长度 | 否 | 1024 | 服务会拒绝请求中max_dec_len/min_dec_len超过此参数的请求，并返回错误提示 |
-| BATCH_SIZE | int | 最大Batch Size | 否 | 50 | 模型可同时并发处理的最大输入数量，不能高于128 |
-| BLOCK_BS | int | 缓存Block支持的最大Query Batch Size | 否 | 50 | 如果出现out of memeory 错误，尝试减少该数值 |
-| BLOCK_RATIO | float |  | 否 | 0.75 | 建议配置 输入平均Token数/（输入+输出平均Token数) |
+| MAX_SEQ_LEN | int | 最大输入序列长度 | 否 | 8192 | 服务会拒绝 input token 数量超过 MAX_SEQ_LEN 的请求，并返回错误提示 |
+| MAX_DEC_LEN | int | 最大 decoer 序列长度 | 否 | 1024 | 服务会拒绝请求中 max_dec_len/min_dec_len 超过此参数的请求，并返回错误提示 |
+| BATCH_SIZE | int | 最大 Batch Size | 否 | 50 | 模型可同时并发处理的最大输入数量，不能高于128 |
+| BLOCK_BS | int | 缓存 Block 支持的最大 Query Batch Size | 否 | 50 | 如果出现 out of memeory 错误，尝试减少该数值 |
+| BLOCK_RATIO | float |  | 否 | 0.75 | 建议配置 输入平均 Token 数/（输入+输出平均 Token 数) |
 | MAX_CACHED_TASK_NUM | int | 服务缓存队列最大长度 | 否 | 128 | 队列达到上限后，会拒绝新的请求 |
-| PUSH_MODE_HTTP_WORKERS | int | HTTP服务进程数 | 否 | 1 | 在 PUSH_MODE_HTTP_PORT 配置的情况下有效，高并发下提高该数值，建议最高配置为8 |
+| PUSH_MODE_HTTP_WORKERS | int | HTTP 服务进程数 | 否 | 1 | 在 PUSH_MODE_HTTP_PORT 配置的情况下有效，高并发下提高该数值，建议最高配置为8 |
 | USE_WARMUP | int | 是否进行 warmup | 否 | 0 |  |
-| USE_HF_TOKENIZER | int | 是否进行使用huggingface的词表 | 否 | 0 |   |
-| USE_CACHE_KV_INT8 | int | 是否将INT8配置为KV Cache的类型 | 否 | 0 | c8量化模型需要配置为1 |
+| USE_HF_TOKENIZER | int | 是否进行使用 huggingface 的词表 | 否 | 0 |   |
+| USE_CACHE_KV_INT8 | int | 是否将 INT8配置为 KV Cache 的类型 | 否 | 0 | c8量化模型需要配置为1 |
 | MODEL_DIR | str | 模型文件路径 | 否 | /models/ |  |
-| FD_MODEL_CONFIG_PATH | str | 模型config文件路径 | 否 | ${model_dir}/config.json |  |
+| FD_MODEL_CONFIG_PATH | str | 模型 config 文件路径 | 否 | ${model_dir}/config.json |  |
 | DISTRIBUTED_CONFIG | str | 模型分布式配置文件路径 | 否 | ${model_dir}/rank_mapping.csv |  |
 
 ## 请求参数介绍
 
 | 字段名 | 字段类型 | 说明 | 是否必填 | 默认值 | 备注 |
 | :---: | :-----: | :---: | :---: | :-----: | :----: |
-| req_id |  str  | 请求ID，用于标识一个请求。建议设置req_id，保证其唯一性   | 否 | 随机id | 如果推理服务中同时有两个相同req_id的请求，会返回req_id重复的错误信息 |
+| req_id |  str  | 请求 ID，用于标识一个请求。建议设置 req_id，保证其唯一性   | 否 | 随机 id | 如果推理服务中同时有两个相同 req_id 的请求，会返回 req_id 重复的错误信息 |
 | text   | str  | 请求的文本 | 否 | 无 | text 和 messages 必须有一个 |
-| messages | str | 多轮对话文本 | 否 | 无 | 多轮对话以list方式存储 |
-| max_dec_len | int  | 最大生成token的长度，如果请求的文本token长度加上max_dec_len大于模型的max_seq_len，会返回长度超限的错误信息 | 否 | max_seq_len减去文本token长度 |  |
-| min_dec_len | int | 最小生成token的长度，最小是1 | 否 | 1 |  |
+| messages | str | 多轮对话文本 | 否 | 无 | 多轮对话以 list 方式存储 |
+| max_dec_len | int  | 最大生成 token 的长度，如果请求的文本 token 长度加上 max_dec_len 大于模型的 max_seq_len，会返回长度超限的错误信息 | 否 | max_seq_len 减去文本 token 长度 |  |
+| min_dec_len | int | 最小生成 token 的长度，最小是1 | 否 | 1 |  |
 | topp | float | 控制随机性参数，数值越大则随机性越大，范围是0~1 | 否 | 0.7 |  |
 | temperature | float | 控制随机性参数，数值越小随机性越大，需要大于 0 | 否 | 0.95 |  |
 | frequency_score | float | 频率分数 | 否 | 0 |  |
@@ -330,5 +372,5 @@ docker build --network=host -f ./dockerfiles/Dockerfile_serving_cuda123_cudnn9 -
 | timeout | int | 请求等待的超时时间，单位是秒 | 否 | 300 |  |
 | return_usage | bool | 是否返回输入、输出 token 数量 | 否 | False |  |
 
-* 在正确配置PUSH_MODE_HTTP_PORT字段下，服务支持 GRPC 和 HTTP 两种请求服务
+* 在正确配置 PUSH_MODE_HTTP_PORT 字段下，服务支持 GRPC 和 HTTP 两种请求服务
   * stream 参数仅对 HTTP 请求生效
diff --git a/llm/server/server/scripts/start_server.sh b/llm/server/server/scripts/start_server.sh
index e7975b3e838f..d4956d3065ac 100644
--- a/llm/server/server/scripts/start_server.sh
+++ b/llm/server/server/scripts/start_server.sh
@@ -6,14 +6,12 @@ export PYTHONIOENCODING=utf8
 export LC_ALL=C.UTF-8
 
 # PaddlePaddle environment variables
-export FLAGS_allocator_strategy=auto_growth
-export FLAGS_dynamic_static_unified_comm=0
-export FLAGS_use_xqa_optim=1
 export FLAGS_gemm_use_half_precision_compute_type=0
 export NVIDIA_TF32_OVERRIDE=0
 
 # Model hyperparameters
-export MP_NUM=${MP_NUM:-"1"}                                # Number of GPUs
+export MP_NUM=${MP_NUM:-"1"}                                # number of model parallelism
+export MP_NNODES=${MP_NNODES:-"1"}                            # number of nodes
 export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-"0"}    # GPU ids
 export MAX_SEQ_LEN=${MAX_SEQ_LEN:-"8192"}
 export MAX_DEC_LEN=${MAX_DEC_LEN:-"2048"}
@@ -43,7 +41,26 @@ mkdir -p log
 rm -rf console.log log/*
 rm -rf /dev/shm/*
 
-echo "start serving ..."
+FED_POD_IP=$(hostname -i)
+if [ "$MP_NNODE" -gt 1 ]; then
+    POD_0_IP=$POD_0_IP
+    HOST_IP=$FED_POD_IP
+else
+    POD_0_IP="127.0.0.1"
+    HOST_IP="127.0.0.1"
+fi
+
+echo "POD_0_IP: $POD_0_IP HOST_IP: $HOST_IP"
+
+if [ "$POD_0_IP" == "$HOST_IP" ]; then
+    echo "Master node, start serving ..."
+else
+    echo "Slave node, start push mode"
+    # waiting for master node to start serving ...
+    sleep ${SERVER_WAITTING_TIME:-"25"}
+fi
+
+
 
 tritonserver --exit-timeout-secs 100 --cuda-memory-pool-byte-size 0:0 --cuda-memory-pool-byte-size 1:0 \
                  --cuda-memory-pool-byte-size 2:0 --cuda-memory-pool-byte-size 3:0 --cuda-memory-pool-byte-size 4:0 \
diff --git a/llm/server/server/server/data/processor.py b/llm/server/server/server/data/processor.py
index 423fe6b61408..6e7873d98bd8 100644
--- a/llm/server/server/server/data/processor.py
+++ b/llm/server/server/server/data/processor.py
@@ -19,6 +19,7 @@
 from paddlenlp.trl.llm_utils import get_eos_token_id
 from server.engine.config import Config
 from server.utils import data_processor_logger
+from paddlenlp.utils.env import USE_FAST_TOKENIZER
 
 
 class BaseDataProcessor(ABC):
@@ -121,7 +122,8 @@ class DataProcessor(BaseDataProcessor):
     def __init__(self):
         self.config = Config()
         max_length = self.config.get_model_config().get('max_length', 1024)
-        self.src_length = max_length - self.config.seq_len_limit
+        self.src_length = self.config.seq_len_limit - max_length
+
 
         self.decode_status = dict()
         self.tokenizer = self._load_tokenizer()
@@ -143,6 +145,9 @@ def process_request(self, request, max_seq_len=None):
             request["eos_token_ids"] = []
         request["eos_token_ids"].extend(get_eos_token_id(self.tokenizer, self.config.generation_config))
 
+        if "stop_seqs" not in request or (isinstance(request["stop_seqs"], (list, tuple)) and len(request["stop_seqs"]) == 0):
+            self.update_stop_seq(request)
+
         if "input_ids" not in request or \
             (isinstance(request["input_ids"], (list, tuple)) and len(request["input_ids"]) == 0):
             if "text" in request:
@@ -282,10 +287,10 @@ def _load_tokenizer(self):
         """
         if self.config.use_hf_tokenizer:
             from transformers import AutoTokenizer
-            return AutoTokenizer.from_pretrained(self.config.model_dir, use_fast=False, vocab_file=os.path.join(self.config.model_dir, "sentencepiece.bpe.model"))
+            return AutoTokenizer.from_pretrained(self.config.model_dir, use_fast=False)
         else:
             from paddlenlp.transformers import AutoTokenizer
-            return AutoTokenizer.from_pretrained(self.config.model_dir)
+            return AutoTokenizer.from_pretrained(self.config.model_dir, use_fast=USE_FAST_TOKENIZER)
 
     def clear_request_status(self, task_id):
         """
@@ -334,3 +339,43 @@ def get_pad_id(self):
         if isinstance(self.tokenizer, (LlamaTokenizer, Llama3Tokenizer)) and not self.tokenizer.pad_token_id:
             return self.tokenizer.eos_token
         return self.tokenizer.pad_token_id
+
+    def pad_batch_data(self, insts, pad_id=0, return_seq_len=False, return_array=True, pad_style="right"):
+        """Pad the instances to the max sequence length in batch."""
+        if len(insts) == 0:
+            padded_insts = np.array([[]], dtype=np.int64) if return_array else [[]]
+            if return_seq_len:
+                seq_len = np.array([], dtype=np.int64) if return_array else []
+                return padded_insts, seq_len
+            return padded_insts
+
+        max_len = max(map(len, insts))
+        if pad_style == "left":
+            padded_insts = [[pad_id] * (max_len - len(inst)) + list(inst) for inst in insts]
+        else:
+            padded_insts = [list(inst) + [pad_id] * (max_len - len(inst)) for inst in insts]
+        if return_array:
+            padded_insts = np.array(padded_insts, dtype=np.int64).reshape([-1, max_len])
+
+        if return_seq_len:
+            seq_len = [len(inst) for inst in insts]
+            if return_array:
+                seq_len = np.array(seq_len, dtype=np.int64).reshape(-1, 1)
+            return padded_insts, seq_len
+        return padded_insts
+
+    def update_stop_seq(self, request):
+        """
+        Update stop sequences from request.
+        """
+        stop_seqs =  []
+        for seq in request.get("stop_sequences", []):
+            if seq != self.tokenizer.eos_token_id:
+                stop_seqs.append(self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(seq)))
+        request["stop_seqs"], request["stop_seqs_len"] = self.pad_batch_data(
+            stop_seqs,
+            pad_id=-1,
+            return_seq_len=True,
+            return_array=False
+        )
+        data_processor_logger.debug(f"processed request: {request['stop_seqs'], request['stop_seqs_len']}")
diff --git a/llm/server/server/server/engine/config.py b/llm/server/server/server/engine/config.py
index 6f0e1964e21f..fe25da48fb3d 100644
--- a/llm/server/server/server/engine/config.py
+++ b/llm/server/server/server/engine/config.py
@@ -19,6 +19,7 @@
 
 from paddlenlp.generation import GenerationConfig
 from server.utils import model_server_logger
+from dataclasses import dataclass
 
 
 class Config:
@@ -58,6 +59,12 @@ def read_from_env(self):
         else:
             raise Exception(f"unsupported device type: {self.device}")
 
+        # multi-node config
+        self.nnode = int(env.get("MP_NNODE", "1"))
+        assert self.mp_num % self.nnode == 0 ,f"mp_num: {self.mp_num} should be divisible by nnode: {self.nnode}"
+        self.mp_num_per_node = self.mp_num // self.nnode
+        self.host_ip = os.getenv("HOST_IP", "127.0.0.1")
+        
         # Triton config
         self.max_prefill_batch = int(os.getenv("MAX_PREFILL_BATCH", 1))
         if self.max_prefill_batch <= 0:
@@ -92,6 +99,7 @@ def read_from_env(self):
         self.use_cache_kv_int8 = int(os.getenv("USE_CACHE_KV_INT8", 0))
         self.use_cache_kv_int4 = int(os.getenv("USE_CACHE_KV_INT4", 0))
 
+
         # infer config
         self.max_batch_size = int(env.get("BATCH_SIZE", 50))
         self.max_seq_len = int(env.get("MAX_SEQ_LEN", 8192))
@@ -167,6 +175,20 @@ def check(self):
             f"which means the exported MAX_DEC_LEN should less than "
             f"{self.max_seq_len}, but now it's {self.dec_len_limit}."
         )
+        if os.getenv("DISABLE_CAPACITY_CHECKER", "0") == 1:
+            # max_output_token_num
+            max_output_token_num = (self.total_block_num - self.max_block_num) * self.block_size + self.enc_dec_block_num * self.block_size
+            assert max_output_token_num >= self.dec_len_limit, (
+                f"The available output token number of the service is {max_output_token_num}, "
+                f"which is less than the setting MAX_DEC_LEN:{self.dec_len_limit}. "
+            )
+    
+            # Maximum input length of a single query that the service can handle
+            max_input_token_num = int(math.floor(self.max_block_num * self.block_size - self.dec_token_num))
+            assert max_input_token_num >= self.seq_len_limit, (
+                f"The available input token number of the service is {max_input_token_num}, "
+                f"which is less than the setting MAX_SEQ_LEN:{self.seq_len_limit}. "
+            )
 
     def print(self, file=None):
         """
@@ -203,6 +225,27 @@ def get_model_config(self):
         model_config_json = json.load(open(self.model_config_path, 'r', encoding='utf-8'))
         return model_config_json
 
+    def get_speculate_config(self):
+        """
+        get speculate_decoding related config
+
+        Returns:
+            SpeculateConfig: the speculate related config
+        """
+        speculate_config = SpeculateConfig()
+        model_cfg = self.get_model_config()
+        if model_cfg.get("speculate_method", "None") != "None":
+            speculate_config.speculate_method = str(model_cfg["speculate_method"])
+            speculate_config.speculate_max_draft_token_num = model_cfg[
+                "speculate_max_draft_token_num"]
+            speculate_config.speculate_max_ngram_size = model_cfg[
+                "speculate_max_ngram_size"]
+
+        if speculate_config.speculate_method not in ["None", "inference_with_reference"]:
+            model_server_logger.error(f"Unsupport speculate method: {speculate_config.speculate_method}")
+
+        return speculate_config
+
     def read_from_config(self):
         """
         reset model config from json file
@@ -234,3 +277,10 @@ def get_unique_name(self, name):
 
     def __str__(self) -> str:
         return json.dumps(self.__dict__, indent=4)
+
+
+@dataclass
+class SpeculateConfig:
+    speculate_method: str = "None"
+    speculate_max_draft_token_num: int = 1
+    speculate_max_ngram_size: int = 1
\ No newline at end of file
diff --git a/llm/server/server/server/engine/engine.py b/llm/server/server/server/engine/engine.py
index 932404d9c094..4bf0fcfeffb8 100644
--- a/llm/server/server/server/engine/engine.py
+++ b/llm/server/server/server/engine/engine.py
@@ -50,7 +50,9 @@ def start(self):
         """
         assert not self.is_started, "The engine is already started.!"
         start_time = time.time()
-        self.queue_service = self._start_tasks_queue_service()
+        # Master node only
+        if self.cfg.nnode == 1 or self.cfg.host_ip == os.getenv('POD_0_IP', '127.0.0.1'):
+            self.queue_service = self._start_tasks_queue_service()
         self.tasks_queue = TaskQueueManager(mp_num=self.cfg.mp_num, port=self.cfg.infer_port)
 
         self.token_processor.tasks_queue = self.tasks_queue
@@ -258,7 +260,7 @@ def _infer_processes_ready(self):
         Returns:
             return: True if all ready, False otherwise
         """
-        if np.sum(self.flag_ready_array) == self.cfg.mp_num:
+        if np.sum(self.flag_ready_array) == self.cfg.mp_num_per_node:
             return True
         return False
 
@@ -378,7 +380,8 @@ def _start_gpu_infer_service(self):
         pd_cmd = "python3 -m paddle.distributed.launch "
         py_script = os.path.join(current_dir_path, "infer.py")
 
-        arguments = (f" --devices {self.cfg.device_ids} {py_script} --model_dir {self.cfg.model_dir}"
+        arguments = (f" --nnodes {str(self.cfg.nnode)}"
+                    f" --devices {self.cfg.device_ids} {py_script} --model_dir {self.cfg.model_dir}"
                     f" --max_batch_size {self.cfg.max_batch_size} --max_seq_len {self.cfg.max_seq_len}"
                     f" --max_dec_len {self.cfg.max_dec_len}"
                     f" --max_block_num {self.cfg.total_block_num} --block_size {self.cfg.block_size}"
diff --git a/llm/server/server/server/engine/infer.py b/llm/server/server/server/engine/infer.py
index 63e87e425058..030657cbb0f6 100644
--- a/llm/server/server/server/engine/infer.py
+++ b/llm/server/server/server/engine/infer.py
@@ -25,13 +25,20 @@
 import paddle
 import paddle.distributed as dist
 import paddle.distributed.fleet as fleet
-from paddlenlp.trl.llm_utils import get_rotary_position_embedding
+from paddle.base.framework import use_pir_api
 from paddlenlp_ops import step_paddle
 from server.data.processor import DataProcessor
 from server.engine.config import Config
 from server.utils import get_logger
 from task_queue_manager import TaskQueueManager
 
+from paddlenlp.experimental.transformers import InferenceWithReferenceProposer
+from paddlenlp.trl.llm_utils import get_rotary_position_embedding
+from paddlenlp.utils.env import (
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+)
+
 File_Path = os.path.realpath(sys.argv[0])
 Dir_Path = os.path.dirname(File_Path)
 logger = get_logger("infer_server", "infer.log")
@@ -46,11 +53,21 @@ def __init__(self, args):
 
         self.config = Config()
         self.model_cfg = self.config.get_model_config()
+        self.speculate_config = self.config.get_speculate_config()
+        self.is_speculate_decoding = self.speculate_config.speculate_method != "None"
         self.format_print_configuration()
 
         self.args.num_layers = self.get_value(self.model_cfg, ["num_hidden_layers", "num_layers"])
         self.args.num_attention_heads = self.get_value(self.model_cfg, ["num_attention_heads", "n_head"])
         self.args.hidden_size = self.model_cfg["hidden_size"]
+        if "deepseek" in self.model_cfg["model_type"]:
+            self.qk_nope_head_dim = int(self.model_cfg["qk_nope_head_dim"])
+            self.qk_rope_head_dim = int(self.model_cfg["qk_rope_head_dim"])
+            self.v_head_dim = int(self.model_cfg["v_head_dim"])
+
+
+        self.max_stop_seqs_num = int(os.getenv("MAX_STOP_SEQS_NUM", 5))
+        self.stop_seqs_max_len = int(os.getenv("STOP_SEQS_MAX_LEN", 8))
 
         self.nranks = dist.get_world_size()
         self.init_dist_env()
@@ -62,18 +79,31 @@ def __init__(self, args):
         self.cache_kvs = {}
         self.init_inputs()
 
+        if self.is_speculate_decoding:
+            logger.info(f"Using speculate decoding, method: {self.speculate_config.speculate_method}.")
+            if self.speculate_config.speculate_method == "inference_with_reference":
+                self.proposer = InferenceWithReferenceProposer(
+                    self.speculate_config.speculate_max_draft_token_num,
+                    self.speculate_config.speculate_max_ngram_size,
+                    self.args.max_batch_size,
+                    self.args.max_seq_len,
+                )
+        else:
+            self.proposer = None
+
         self.infer_queue = TaskQueueManager(rank=self.rank, mp_num=self.nranks, port=self.config.infer_port)
 
         model_rank_path = os.path.join(self.args.model_dir, f"rank_{self.rank}")
         if not os.path.exists(model_rank_path):
             model_rank_path = self.args.model_dir
 
-        self.infer_engine = InferenceEngine(model_dir=model_rank_path,
-                                            share_inputs=self.share_inputs,
-                                            cache_kvs=self.cache_kvs,
-                                            config=self.config,
-                                            mp_degree=self.nranks
-                                        )
+        self.infer_engine = InferenceEngine(
+            model_dir=model_rank_path,
+            share_inputs=self.share_inputs,
+            cache_kvs=self.cache_kvs,
+            config=self.config,
+            mp_degree=self.nranks,
+        )
 
     def read_model_config(self):
         """
@@ -82,7 +112,7 @@ def read_model_config(self):
         Returns:
             model_config_json: dict, model config file
         """
-        model_config_json = json.load(open(self.config_file, 'r', encoding='utf-8'))
+        model_config_json = json.load(open(self.config_file, "r", encoding="utf-8"))
         return model_config_json
 
     def get_value(self, cfg, names):
@@ -95,9 +125,7 @@ def get_value(self, cfg, names):
             if name in cfg:
                 return cfg[name]
             break
-        raise Exception(
-            "Cannot find any one of key in {} in configuration file.".format(
-                names))
+        raise Exception("Cannot find any one of key in {} in configuration file.".format(names))
 
     def format_print_configuration(self):
         """
@@ -117,13 +145,13 @@ def load_model_init_val(self):
         """
         self.top_p = self.model_cfg.get("top_p", 0.0)
         self.temperature = self.model_cfg.get("temperature", 1.0)
-        self.rope_theta = self.model_cfg.get('rope_theta', 10000.0)
-        self.rope_scaling = self.model_cfg.get('rope_scaling', None)
-        self.penalty_score = self.model_cfg.get('penalty_score', 1.0)
-        self.frequency_score = self.model_cfg.get('frequency_score', 0.0)
-        self.presence_score = self.model_cfg.get('presence_score', 0.0)
-        self.min_length = self.model_cfg.get('min_length', 1)
-        self.max_length = self.model_cfg.get('max_length', 1024)
+        self.rope_theta = self.model_cfg.get("rope_theta", 10000.0)
+        self.rope_scaling = self.model_cfg.get("rope_scaling", None)
+        self.penalty_score = self.model_cfg.get("penalty_score", 1.0)
+        self.frequency_score = self.model_cfg.get("frequency_score", 0.0)
+        self.presence_score = self.model_cfg.get("presence_score", 0.0)
+        self.min_length = self.model_cfg.get("min_length", 1)
+        self.max_length = self.model_cfg.get("max_length", 1024)
 
         data_processor = DataProcessor()
         # reserve an eos token for request
@@ -149,9 +177,11 @@ def init_dist_env(self, seed=20):
 
     def init_inputs(self):
         # init all inputs
-        if "num_key_value_heads" in self.model_cfg and \
-                self.model_cfg["num_key_value_heads"] is not None and \
-                int(self.model_cfg["num_key_value_heads"]) > 0:
+        if (
+            "num_key_value_heads" in self.model_cfg
+            and self.model_cfg["num_key_value_heads"] is not None
+            and int(self.model_cfg["num_key_value_heads"]) > 0
+        ):
             kv_num_head = int(self.model_cfg["num_key_value_heads"]) // self.nranks
         else:
             kv_num_head = self.args.num_attention_heads // self.nranks
@@ -161,90 +191,155 @@ def init_inputs(self):
                 cache_type = self.args.dtype
             else:
                 cache_type = "uint8"
-
-            self.cache_kvs["key_caches_{}".format(i)] = paddle.full(shape=[
-                self.args.max_block_num, kv_num_head,
-                self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
-            ], fill_value=0, dtype=cache_type)
-            self.cache_kvs["value_caches_{}".format(i)] = paddle.full(shape=[
-                self.args.max_block_num, kv_num_head,
-                self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
-            ], fill_value=0, dtype=cache_type)
-
-        pre_max_block_num = (self.args.max_seq_len + self.args.block_size - 1) // self.args.block_size + self.args.enc_dec_block_num
+            
+            if "deepseek" in self.model_cfg["model_type"]:
+                self.cache_kvs["key_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size,
+                    self.qk_nope_head_dim + self.qk_rope_head_dim
+                ], fill_value=0, dtype=cache_type)
+                self.cache_kvs["value_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size, self.v_head_dim
+                ], fill_value=0, dtype=cache_type)
+            else:
+                self.cache_kvs["key_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
+                ], fill_value=0, dtype=cache_type)
+                self.cache_kvs["value_caches_{}".format(i)] = paddle.full(shape=[
+                    self.args.max_block_num, kv_num_head,
+                    self.args.block_size, self.args.hidden_size // self.args.num_attention_heads
+                ], fill_value=0, dtype=cache_type)
+
+        pre_max_block_num = (
+            self.args.max_seq_len + self.args.block_size - 1
+        ) // self.args.block_size + self.args.enc_dec_block_num
         self.share_inputs["block_tables"] = paddle.full(
-                        shape=[self.args.max_batch_size, pre_max_block_num], fill_value=-1, dtype="int32")
+            shape=[self.args.max_batch_size, pre_max_block_num], fill_value=-1, dtype="int32"
+        )
 
-        self.share_inputs['pre_ids'] = paddle.to_tensor(
-                        np.full((self.args.max_batch_size, self.args.max_dec_len), -1, dtype='int64'))
+        self.share_inputs["pre_ids"] = paddle.to_tensor(
+            np.full((self.args.max_batch_size, self.args.max_dec_len), -1, dtype="int64")
+        )
 
         tmp_position_ids = paddle.arange(self.args.max_seq_len).reshape((1, -1))
-        self.share_inputs['rope_emb'] = get_rotary_position_embedding(tmp_position_ids,
-                        self.args.hidden_size // self.args.num_attention_heads,
-                        self.rope_theta, self.rope_scaling)
-        self.share_inputs['input_ids'] = paddle.full(
-                        shape=[self.args.max_batch_size, self.args.max_seq_len],
-                        fill_value=self.pad_token_id, dtype='int64')
-        self.share_inputs['top_p'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.top_p, dtype="float32")
-        self.share_inputs['temperature'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.temperature, dtype="float32")
-        self.share_inputs['eos_token_id'] = paddle.to_tensor(
-                            np.zeros((self.eos_tokens_lens, 1)).reshape(-1, 1).astype("int64"))
-        self.share_inputs['penalty_score'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.penalty_score, dtype="float32")
-        self.share_inputs['frequency_score'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.frequency_score, dtype="float32")
-        self.share_inputs['presence_score'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.presence_score, dtype="float32")
-        self.share_inputs['seq_lens_this_time'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['seq_lens_encoder'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['step_seq_lens_encoder'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['seq_lens_decoder'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
-        self.share_inputs['step_idx'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
-        self.share_inputs['min_length'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.min_length, dtype="int64")
-        self.share_inputs['max_length'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=self.max_length, dtype="int64")
-        self.share_inputs['not_need_stop'] = paddle.full(
-                            shape=[1], fill_value=False, dtype="bool")
-        self.share_inputs['stop_flags'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=True, dtype="bool")
-        self.share_inputs['stop_nums'] = paddle.full(
-                            shape=[1], fill_value=self.args.max_batch_size, dtype="int64")
-        self.share_inputs['bad_tokens'] = paddle.full(
-                            shape=[1], fill_value=-1, dtype="int64")
-        self.share_inputs['next_tokens'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64")
-        self.share_inputs['is_block_step'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=False, dtype="bool")
-        self.share_inputs['encoder_block_lens'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
-        self.share_inputs['step_block_list'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32")
-        self.share_inputs['step_lens'] = paddle.full(shape=[1], fill_value=0, dtype="int32")
-        self.share_inputs['recover_block_list'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32")
-        self.share_inputs['recover_lens'] = paddle.full(
-                            shape=[1], fill_value=0, dtype="int32")
-        self.share_inputs['need_block_list'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32")
-        self.share_inputs['need_block_len'] = paddle.full(
-                            shape=[1], fill_value=0, dtype="int32")
-        self.share_inputs['used_list_len'] = paddle.full(
-                            shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
-        self.share_inputs['infer_seed'] = paddle.full(
-                            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
+        self.share_inputs["rope_emb"] = get_rotary_position_embedding(
+            tmp_position_ids,
+            self.args.hidden_size // self.args.num_attention_heads,
+            self.rope_theta,
+            self.rope_scaling,
+        )
+        self.share_inputs["input_ids"] = paddle.full(
+            shape=[self.args.max_batch_size, self.args.max_seq_len], fill_value=self.pad_token_id, dtype="int64"
+        )
+        self.share_inputs["top_p"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.top_p, dtype="float32"
+        )
+        self.share_inputs["temperature"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.temperature, dtype="float32"
+        )
+        self.share_inputs["eos_token_id"] = paddle.to_tensor(
+            np.zeros((self.eos_tokens_lens, 1)).reshape(-1, 1).astype("int64")
+        )
+        self.share_inputs["penalty_score"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.penalty_score, dtype="float32"
+        )
+        self.share_inputs["frequency_score"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.frequency_score, dtype="float32"
+        )
+        self.share_inputs["presence_score"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.presence_score, dtype="float32"
+        )
+        self.share_inputs["seq_lens_this_time"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["seq_lens_encoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["step_seq_lens_encoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["seq_lens_decoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["step_idx"] = paddle.full(shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
+        self.share_inputs["min_length"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.min_length, dtype="int64"
+        )
+        self.share_inputs["max_length"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=self.max_length, dtype="int64"
+        )
+        self.share_inputs["not_need_stop"] = paddle.full(shape=[1], fill_value=False, dtype="bool")
+        self.share_inputs["stop_flags"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=True, dtype="bool"
+        )
+        self.share_inputs["stop_nums"] = paddle.full(shape=[1], fill_value=self.args.max_batch_size, dtype="int64")
+        self.share_inputs["bad_tokens"] = paddle.full(shape=[1], fill_value=-1, dtype="int64")
+        self.share_inputs["next_tokens"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64"
+        )
+        self.share_inputs["is_block_step"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=False, dtype="bool"
+        )
+        self.share_inputs["encoder_block_lens"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=0, dtype="int32"
+        )
+        self.share_inputs["step_block_list"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32"
+        )
+        self.share_inputs["step_lens"] = paddle.full(shape=[1], fill_value=0, dtype="int32")
+        self.share_inputs["recover_block_list"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32"
+        )
+        self.share_inputs["recover_lens"] = paddle.full(shape=[1], fill_value=0, dtype="int32")
+        self.share_inputs["need_block_list"] = paddle.full(
+            shape=[self.args.max_batch_size], fill_value=-1, dtype="int32"
+        )
+        self.share_inputs["need_block_len"] = paddle.full(shape=[1], fill_value=0, dtype="int32")
+        self.share_inputs["used_list_len"] = paddle.full(shape=[self.args.max_batch_size], fill_value=0, dtype="int32")
+        self.share_inputs["infer_seed"] = paddle.full(shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int64")
         free_list = list(range(int(self.args.max_block_num * self.args.block_ratio)))
         self.free_list_len = len(free_list)
-        self.share_inputs['free_list'] = paddle.to_tensor(free_list, dtype="int32")
-        self.share_inputs['free_list_len'] = paddle.full(
-                            shape=[1], fill_value=self.free_list_len, dtype="int32")
+        self.share_inputs["free_list"] = paddle.to_tensor(free_list, dtype="int32")
+        self.share_inputs["free_list_len"] = paddle.full(shape=[1], fill_value=self.free_list_len, dtype="int32")
+
+        self.share_inputs["stop_seqs_len"] = paddle.full(
+            shape=[
+                self.max_stop_seqs_num,
+            ],
+            fill_value=0,
+            dtype="int32",
+        )
+        self.share_inputs["stop_seqs"] = paddle.full(
+            shape=[self.max_stop_seqs_num, self.stop_seqs_max_len], fill_value=-1, dtype="int64"
+        )
+
+
+        self.share_inputs["first_token_ids"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=-1, dtype="int64")
+        self.share_inputs["ori_seq_lens_encoder"] = paddle.full(
+            shape=[self.args.max_batch_size, 1], fill_value=0, dtype="int32")
+        # speculate decoding input
+        if self.is_speculate_decoding:
+            self.share_inputs["accept_tokens"] = paddle.full(
+                shape=[self.args.max_batch_size, self.speculate_config.speculate_max_draft_token_num + 1],
+                fill_value=0,
+                dtype="int64",
+            )
+            self.share_inputs["accept_num"] = paddle.full(
+                shape=[self.args.max_batch_size], fill_value=0, dtype="int32"
+            )
+            self.share_inputs["draft_tokens"] = paddle.full(
+                shape=[self.args.max_batch_size, self.speculate_config.speculate_max_draft_token_num + 1],
+                fill_value=0,
+                dtype="int64",
+            )
+            self.share_inputs["actual_draft_token_num"] = paddle.full(
+                shape=[self.args.max_batch_size],
+                fill_value=self.speculate_config.speculate_max_draft_token_num,
+                dtype="int32",
+            )
 
     def dy_input_preprocess(self, tasks):
         """
@@ -252,46 +347,74 @@ def dy_input_preprocess(self, tasks):
         """
         for i in range(len(tasks)):
             task = tasks[i]
-            idx = task['idx']
-            length = len(task['input_ids'])
-            self.share_inputs['input_ids'][idx:idx + 1, :length] = np.array(task['input_ids'])
-            if len(task['eos_token_ids']) < self.eos_tokens_lens:
-                task['eos_token_ids'].append(task['eos_token_ids'][0])
-            self.share_inputs['eos_token_id'][:] = np.array(task['eos_token_ids'], dtype="int64").reshape(-1, 1)
-            self.share_inputs['pre_ids'][idx:idx + 1] = -1
-            self.share_inputs['top_p'][idx:idx + 1] = task.get('topp', 0.7)
-            self.share_inputs['temperature'][idx:idx + 1] = task.get('temperature', 0.95)
-            self.share_inputs['penalty_score'][idx:idx + 1] = task.get('penalty_score', 1.0)
-            self.share_inputs['frequency_score'][idx:idx + 1] = task.get('frequency_score', 0.0)
-            self.share_inputs['presence_score'][idx:idx + 1] = task.get('presence_score', 0.0)
-            self.share_inputs['seq_lens_this_time'][idx:idx + 1] = length
-            self.share_inputs['step_seq_lens_encoder'][idx:idx + 1] = length
-            self.share_inputs['seq_lens_encoder'][idx:idx + 1] = length
-            self.share_inputs['seq_lens_decoder'][idx:idx + 1] = 0
-            self.share_inputs['step_idx'][idx:idx + 1] = 0
-            self.share_inputs['min_length'][idx:idx + 1] = task.get('min_dec_len', 1)
+            idx = task["idx"]
+            length = len(task["input_ids"])
+            self.share_inputs["input_ids"][idx : idx + 1, :length] = np.array(task["input_ids"])
+            if len(task["eos_token_ids"]) < self.eos_tokens_lens:
+                task["eos_token_ids"].append(task["eos_token_ids"][0])
+            self.share_inputs["eos_token_id"][:] = np.array(task["eos_token_ids"], dtype="int64").reshape(-1, 1)
+            self.share_inputs["pre_ids"][idx : idx + 1] = -1
+            self.share_inputs["top_p"][idx : idx + 1] = task.get("topp", 0.7)
+            self.share_inputs["temperature"][idx : idx + 1] = task.get("temperature", 0.95)
+            self.share_inputs["penalty_score"][idx : idx + 1] = task.get("penalty_score", 1.0)
+            self.share_inputs["frequency_score"][idx : idx + 1] = task.get("frequency_score", 0.0)
+            self.share_inputs["presence_score"][idx : idx + 1] = task.get("presence_score", 0.0)
+            self.share_inputs["seq_lens_this_time"][idx : idx + 1] = length
+            self.share_inputs["step_seq_lens_encoder"][idx : idx + 1] = length
+            self.share_inputs["seq_lens_encoder"][idx : idx + 1] = length
+            self.share_inputs["seq_lens_decoder"][idx : idx + 1] = 0
+            self.share_inputs["step_idx"][idx : idx + 1] = 0
+            self.share_inputs["min_length"][idx : idx + 1] = task.get("min_dec_len", 1)
             if "max_dec_len" in task:
-                max_dec_len = task['max_dec_len']
+                max_dec_len = task["max_dec_len"]
             elif "seq_len" in task:
-                max_dec_len = task['seq_len']
+                max_dec_len = task["seq_len"]
             else:
                 max_dec_len = self.args.max_dec_len
-            self.share_inputs['max_length'][idx:idx + 1] = max_dec_len
-            self.share_inputs['stop_flags'][idx:idx + 1] = False
+            self.share_inputs["max_length"][idx : idx + 1] = max_dec_len
+            self.share_inputs["stop_flags"][idx : idx + 1] = False
+
+
+            self.share_inputs['first_token_ids'][idx:idx + 1] =  self.share_inputs['input_ids'][idx:idx + 1, :1]
+            self.share_inputs["ori_seq_lens_encoder"][idx:idx + 1] = length
 
             if "infer_seed" in task:
-                self.share_inputs['infer_seed'][idx:idx + 1] = task['infer_seed']
+                self.share_inputs["infer_seed"][idx : idx + 1] = task["infer_seed"]
+
+            encoder_block_num = len(task["block_tables"])
+            self.share_inputs["encoder_block_lens"][idx : idx + 1] = encoder_block_num
+            self.share_inputs["block_tables"][idx : idx + 1, :] = -1
+            self.share_inputs["block_tables"][idx : idx + 1, :encoder_block_num] = np.array(
+                task["block_tables"], dtype="int32"
+            )
+
+            if "stop_seqs_len" in task:
+                stop_seqs_num = len(task["stop_seqs_len"])
+                for i in range(stop_seqs_num, self.max_stop_seqs_num):
+                    task["stop_seqs_len"].append(0)
+                self.share_inputs["stop_seqs_len"][:] = np.array(task["stop_seqs_len"], dtype="int32")
+                self.share_inputs["stop_seqs"][:stop_seqs_num, : len(task["stop_seqs"][0])] = np.array(
+                    task["stop_seqs"], dtype="int64"
+                )
 
-            encoder_block_num = len(task['block_tables'])
-            self.share_inputs['encoder_block_lens'][idx:idx + 1] = encoder_block_num
-            self.share_inputs["block_tables"][idx:idx + 1, :] = -1
-            self.share_inputs["block_tables"][idx:idx + 1, :encoder_block_num] = np.array(
-                                            task['block_tables'], dtype="int32")
+            if self.is_speculate_decoding:
+                self.share_inputs["draft_tokens"][idx : idx + 1] = np.zeros(
+                    [self.speculate_config.speculate_max_draft_token_num + 1]
+                )
+                self.share_inputs["actual_draft_token_num"][idx : idx + 1] = np.array(
+                    [self.speculate_config.speculate_max_draft_token_num]
+                )
 
     def step_cuda(self, seq_lens_this_time):
         """
         step cuda
         """
+        # whether speculate decoding
+        if self.is_speculate_decoding:
+            speculate_step_token_num = self.speculate_config.speculate_max_draft_token_num + 1
+        else:
+            speculate_step_token_num = 0
+
         step_paddle(self.share_inputs['stop_flags'], seq_lens_this_time,
                     self.share_inputs['step_seq_lens_encoder'],
                     self.share_inputs['seq_lens_encoder'],
@@ -303,8 +426,8 @@ def step_cuda(self, seq_lens_this_time):
                     self.share_inputs['need_block_len'], self.share_inputs['used_list_len'],
                     self.share_inputs['free_list'], self.share_inputs['free_list_len'],
                     self.share_inputs['input_ids'], self.share_inputs['pre_ids'],
-                    self.share_inputs['step_idx'], self.share_inputs['next_tokens'],
-                    self.args.block_size, self.args.enc_dec_block_num, self.args.first_token_id)
+                    self.share_inputs['step_idx'], self.share_inputs['next_tokens'], self.share_inputs['first_token_ids'],
+                    self.args.block_size, self.args.enc_dec_block_num, 0)
 
     def initialize_engine_ready_check_flag(self):
         """
@@ -316,10 +439,11 @@ def initialize_engine_ready_check_flag(self):
         """
         engine_ready_check_flag = np.zeros([1], dtype=np.int32)
         shm_engine_ready_check_flag = shared_memory.SharedMemory(
-                        name=self.config.get_unique_name("engine_ready_check_flag"))
-        engine_ready_check_flag_array = np.ndarray(engine_ready_check_flag.shape,
-                                            dtype=engine_ready_check_flag.dtype,
-                                            buffer=shm_engine_ready_check_flag.buf)
+            name=self.config.get_unique_name("engine_ready_check_flag")
+        )
+        engine_ready_check_flag_array = np.ndarray(
+            engine_ready_check_flag.shape, dtype=engine_ready_check_flag.dtype, buffer=shm_engine_ready_check_flag.buf
+        )
         return shm_engine_ready_check_flag, engine_ready_check_flag_array
 
     def initialize_engine_live_flag(self):
@@ -329,9 +453,9 @@ def initialize_engine_live_flag(self):
         Returns:
             infer_live_flag_shm: infer live flag
         """
-        infer_live_flag_shm = shared_memory.SharedMemory(create=True,
-                                            size=1,
-                                            name=self.config.get_unique_name("shm_flag_infer_{}_live".format(self.rank)))
+        infer_live_flag_shm = shared_memory.SharedMemory(
+            create=True, size=1, name=self.config.get_unique_name("shm_flag_infer_{}_live".format(self.rank))
+        )
         return infer_live_flag_shm
 
     def initialize_engine_healthy_recorded_time_flag(self):
@@ -343,10 +467,13 @@ def initialize_engine_healthy_recorded_time_flag(self):
         """
         engine_healthy_recorded_time = np.zeros([1], dtype=float)
         shm_engine_healthy_recorded_time = shared_memory.SharedMemory(
-                    name=self.config.get_unique_name("engine_healthy_recorded_time"))
-        engine_healthy_recorded_time_array = np.ndarray(engine_healthy_recorded_time.shape,
-                                            dtype=engine_healthy_recorded_time.dtype,
-                                            buffer=shm_engine_healthy_recorded_time.buf)
+            name=self.config.get_unique_name("engine_healthy_recorded_time")
+        )
+        engine_healthy_recorded_time_array = np.ndarray(
+            engine_healthy_recorded_time.shape,
+            dtype=engine_healthy_recorded_time.dtype,
+            buffer=shm_engine_healthy_recorded_time.buf,
+        )
         return shm_engine_healthy_recorded_time, engine_healthy_recorded_time_array
 
     def run(self):
@@ -355,35 +482,38 @@ def run(self):
         """
         flag_array = np.zeros([1], dtype=np.int32)
         shm_flag_broadcast = shared_memory.SharedMemory(
-                        name=self.config.get_unique_name("shm_pd_infer_flag_broadcast"))
-        flag_broadcast_array = np.ndarray(flag_array.shape,
-                                            dtype=flag_array.dtype,
-                                            buffer=shm_flag_broadcast.buf)
+            name=self.config.get_unique_name("shm_pd_infer_flag_broadcast")
+        )
+        flag_broadcast_array = np.ndarray(flag_array.shape, dtype=flag_array.dtype, buffer=shm_flag_broadcast.buf)
 
         flag_array = np.zeros([self.nranks], dtype=np.int32)
         shm_flag_ready = shared_memory.SharedMemory(name=self.config.get_unique_name("shm_flag_infer_ready"))
-        flag_ready_array = np.ndarray(flag_array.shape,
-                                        dtype=flag_array.dtype,
-                                        buffer=shm_flag_ready.buf)
+        flag_ready_array = np.ndarray(flag_array.shape, dtype=flag_array.dtype, buffer=shm_flag_ready.buf)
         flag_ready_array[self.rank] = 1
 
         flag_array = np.zeros([1], dtype=np.int32)
-        shm_flag_has_block_step = shared_memory.SharedMemory(name=self.config.get_unique_name("shm_flag_has_block_step"))
-        flag_has_block_step_array = np.ndarray(flag_array.shape,
-                                                dtype=flag_array.dtype,
-                                                buffer=shm_flag_has_block_step.buf)
+        shm_flag_has_block_step = shared_memory.SharedMemory(
+            name=self.config.get_unique_name("shm_flag_has_block_step")
+        )
+        flag_has_block_step_array = np.ndarray(  # noqa: F841
+            flag_array.shape, dtype=flag_array.dtype, buffer=shm_flag_has_block_step.buf
+        )
 
         use_custom_health_checker = self.config.use_custom_health_checker
         if use_custom_health_checker:
-            shm_engine_ready_check_flag_array, engine_ready_check_flag_array = self.initialize_engine_ready_check_flag()
+            (
+                shm_engine_ready_check_flag_array,
+                engine_ready_check_flag_array,
+            ) = self.initialize_engine_ready_check_flag()
             engine_ready_check_flag_array[0] = 1
-            shm_engine_healthy_recorded_time_array, engine_healthy_recorded_time_array = self.initialize_engine_healthy_recorded_time_flag()
+            (
+                shm_engine_healthy_recorded_time_array,
+                engine_healthy_recorded_time_array,
+            ) = self.initialize_engine_healthy_recorded_time_flag()
             engine_healthy_recorded_time_array[0] = time.time()
-            infer_live_flag_shm = self.initialize_engine_live_flag()
-        infer_seed_increment = paddle.full(shape=[self.args.max_batch_size, 1],
-                                            fill_value=4,
-                                            dtype="int64")
-        thread_executor = ThreadPoolExecutor(max_workers=1)
+            infer_live_flag_shm = self.initialize_engine_live_flag()  # noqa: F841
+        infer_seed_increment = paddle.full(shape=[self.args.max_batch_size, 1], fill_value=4, dtype="int64")
+        thread_executor = ThreadPoolExecutor(max_workers=1)  # noqa: F841
         seq_lens_this_time = None
         real_bsz = None
 
@@ -391,14 +521,17 @@ def run(self):
             if use_custom_health_checker:
                 engine_healthy_recorded_time_array[0] = time.time()
 
-            if self.rank == 0:
+            if self.rank % self.config.mp_num_per_node == 0:
                 if not self.infer_queue.empty():
-                    flag_broadcast_array[0] = 1
+                    if self.config.nnode > 1:
+                        self.infer_queue.read_finish_flag.set(1)
+                    else:
+                        flag_broadcast_array[0] = 1
 
             if self.nranks > 1:
                 paddle.distributed.barrier()
 
-            if flag_broadcast_array[0] == 1:
+            if flag_broadcast_array[0] == 1 or self.infer_queue.read_finish_flag.get() == 1:
                 logger.info(f'rank: {self.rank} start to get')
                 if seq_lens_this_time is not None:
                     self.share_inputs["seq_lens_this_time"][:real_bsz] = seq_lens_this_time
@@ -406,32 +539,36 @@ def run(self):
                 tasks, read_finish = self.infer_queue.get()
                 if read_finish:
                     flag_broadcast_array[0] = 0
+                    self.infer_queue.read_finish_flag.set(0)
 
                 req_dicts = []
                 for req_dict, bsz in tasks:
                     real_bsz = int(bsz)
                     req_dicts.extend(req_dict)
-                    logger.info(
-                        f'rank: {self.rank}, real_bsz: {real_bsz}, query_num: {len(req_dicts)}'
-                    )
+                    logger.info(f"rank: {self.rank}, real_bsz: {real_bsz}, query_num: {len(req_dicts)}")
 
                 self.dy_input_preprocess(req_dicts)
-                seq_lens_this_time = copy.deepcopy(
-                    self.share_inputs['seq_lens_this_time'][:real_bsz])
-                self.infer_engine.seq_lens_handle.share_external_data(
-                    seq_lens_this_time)
-                self.share_inputs['not_need_stop'][0] = True
+                seq_lens_this_time = copy.deepcopy(self.share_inputs["seq_lens_this_time"][:real_bsz])
+                self.infer_engine.seq_lens_handle.share_external_data(seq_lens_this_time)
+                self.share_inputs["not_need_stop"][0] = True
 
-            if not self.share_inputs['not_need_stop']:
+            if not self.share_inputs["not_need_stop"]:
                 if self.nranks > 1:
                     paddle.distributed.barrier()
 
                 time.sleep(0.001)
                 continue
 
+            if self.proposer is not None:
+                self.proposer.run(
+                    self.share_inputs,
+                    real_batch_size=seq_lens_this_time.shape[0],
+                    seq_lens_this_time=seq_lens_this_time,
+                )
+
             self.infer_engine.predictor.run()
-            self.share_inputs['infer_seed'].add_(infer_seed_increment)
-            self.share_inputs['infer_seed'][:] %= self.MAX_INFER_SEED
+            self.share_inputs["infer_seed"].add_(infer_seed_increment)
+            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
             if self.free_list_len > 0:
                 self.step_cuda(seq_lens_this_time)
 
@@ -444,6 +581,7 @@ class InferenceEngine(object):
         model_dir (string): root directory of inference model
         mp_degree (int): model parallel size
     """
+
     def __init__(self, model_dir, share_inputs, cache_kvs, config, mp_degree=1):
         self.config = config
         self.model_dir = model_dir
@@ -466,36 +604,24 @@ def _init_predictor(self):
         """
         predictor init
         """
-        device_id = self.rank % 8
-        self.model_file = os.path.join(self.model_dir, f"model.pdmodel")
-        self.param_file = os.path.join(self.model_dir, f"model.pdiparams")
+        device_id = self.rank % self.config.mp_num_per_node
+        if use_pir_api():
+            self.model_file = os.path.join(self.model_dir, f"model.json")
+            self.param_file = os.path.join(self.model_dir, f"model.pdiparams")
+        else:
+            self.model_file = os.path.join(self.model_dir, f"model.pdmodel")
+            self.param_file = os.path.join(self.model_dir, f"model.pdiparams")
         config = paddle.inference.Config(self.model_file, self.param_file)
 
-        config.switch_ir_optim(False)
         config.enable_use_gpu(100, device_id)
 
-        # distributed config
-        if self.mp_degree > 1:
-            trainer_endpoints = fleet.worker_endpoints()
-            current_endpoint = trainer_endpoints[self.rank]
-            dist_config = config.dist_config()
-            dist_config.set_ranks(self.nranks, self.rank)
-            dist_config.set_endpoints(trainer_endpoints, current_endpoint)
-            dist_config.enable_dist_model(True)
-            if self.config.distributed_config_path:
-                 dist_config.set_comm_init_config(self.config.distributed_config_path)
-            else:
-                raise Exception("Please set DISTRIBUTED_CONFIG env variable.")
-                logger.warning(
-                    f"Use default distributed config, please set env DISTRIBUTED_CONFIG"
-                )
-                dist_config.set_comm_init_config(
-                    os.path.join(Dir_Path + "/config", "rank_mapping_mp{}.csv".format(self.nranks)))
+        if use_pir_api():
+            config.enable_new_executor()
+            config.enable_new_ir()
 
-            config.set_dist_config(dist_config)
         self.predictor = paddle.inference.create_predictor(config)
         self.input_names = self.predictor.get_input_names()
-        self.seq_lens_handle = self.predictor.get_input_handle('seq_lens_this_time')
+        self.seq_lens_handle = self.predictor.get_input_handle("seq_lens_this_time")
 
     def share_data(self):
         """
@@ -511,69 +637,25 @@ def share_data(self):
             input_tensor = self.predictor.get_input_handle(name)
             input_tensor.share_external_data(self.share_inputs[name])
 
-    def predict(self, real_bsz):
-        """
-        predict
-        """
-        seq_lens_this_time = copy.deepcopy(
-            self.share_inputs['seq_lens_this_time'][:real_bsz])
-        self.seq_lens_handle.share_external_data(seq_lens_this_time)
-        self.share_inputs['not_need_stop'][0] = True
-        while self.share_inputs['not_need_stop']:
-            self.predictor.run()
-        self.share_inputs["seq_lens_this_time"][:real_bsz] = seq_lens_this_time
 
 
 def parse_args():
     """
     parse args from command line
     """
-    parser = argparse.ArgumentParser("Deploy LLM Inference")
-    parser.add_argument('-m',
-                        '--model_dir',
-                        type=str,
-                        default='./output',
-                        help='model dir')
-    parser.add_argument('-mp',
-                        '--mp_degree',
-                        type=int,
-                        default=1,
-                        help='mp degree')
-    parser.add_argument('-mbs',
-                        '--max_batch_size',
-                        type=int,
-                        default=34,
-                        help='max batch size')
-    parser.add_argument('--max_block_num', type=int, default=2000)
+    parser = argparse.ArgumentParser("FastDeploy LLM Inference")
+    parser.add_argument("-m", "--model_dir", type=str, default="./output", help="model dir")
+    parser.add_argument("-mp", "--mp_degree", type=int, default=1, help="mp degree")
+    parser.add_argument("-mbs", "--max_batch_size", type=int, default=34, help="max batch size")
+    parser.add_argument("--max_block_num", type=int, default=2000)
     parser.add_argument("--block_size", type=int, default=128)
-    parser.add_argument('--max_seq_len',
-                        type=int,
-                        default=3072,
-                        help='max_seq_len')
-    parser.add_argument('--max_dec_len',
-                        type=int,
-                        default=1024,
-                        help='max_dec_len')
-    parser.add_argument('--use_cache_kv_int8',
-                        type=int,
-                        default=0,
-                        help='use cache kv int8')
-    parser.add_argument('--dtype',
-                        type=str,
-                        default="bfloat16",
-                        help='input dtype')
-    parser.add_argument('--enc_dec_block_num',
-                        type=int,
-                        default=1,
-                        help="encoder's decoder num")
-    parser.add_argument('--block_ratio',
-                        type=float,
-                        default=0.7,
-                        help="block ratio")
-    parser.add_argument('--first_token_id',
-                        type=int,
-                        default=1,
-                        help="first token id")
+    parser.add_argument("--max_seq_len", type=int, default=3072, help="max_seq_len")
+    parser.add_argument("--max_dec_len", type=int, default=1024, help="max_dec_len")
+    parser.add_argument("--use_cache_kv_int8", type=int, default=0, help="use cache kv int8")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="input dtype")
+    parser.add_argument("--enc_dec_block_num", type=int, default=1, help="encoder's decoder num")
+    parser.add_argument("--block_ratio", type=float, default=0.7, help="block ratio")
+    parser.add_argument("--first_token_id", type=int, default=1, help="first token id")
     args = parser.parse_args()
     return args
 
diff --git a/llm/server/server/server/engine/task_queue_manager.py b/llm/server/server/server/engine/task_queue_manager.py
index 475365d47fba..a0b70c88b4a7 100644
--- a/llm/server/server/server/engine/task_queue_manager.py
+++ b/llm/server/server/server/engine/task_queue_manager.py
@@ -49,8 +49,9 @@ def __init__(self, rank=0, mp_num=8, port=56666):
         QueueManager.register('get_barrier1')
         QueueManager.register('get_barrier2')
         QueueManager.register('get_queue')
+        QueueManager.register('get_read_finish_flag')
 
-        self.client_manager = QueueManager(address=('127.0.0.1', port),
+        self.client_manager = QueueManager(address=(os.getenv("POD_0_IP","127.0.0.1"), port),
                                            authkey=b'infer_queue'
                                            )
         self.client_manager.connect()
@@ -60,6 +61,7 @@ def __init__(self, rank=0, mp_num=8, port=56666):
         self.barrier1 = self.client_manager.get_barrier1()
         self.barrier2 = self.client_manager.get_barrier2()
         self.queue = self.client_manager.get_queue()
+        self.read_finish_flag = self.client_manager.get_read_finish_flag()
         self.mp_num = mp_num
         self.rank = rank
         self.position = 1 << rank
@@ -155,7 +157,9 @@ def launch_queue_service(port, num_workers):
         QueueManager.register('get_barrier2', callable=lambda: barrier2)
         q = Queue()
         QueueManager.register("get_queue", callable=lambda: q)
-        m = QueueManager(address=('127.0.0.1', port), authkey=b'infer_queue')
+        read_finish_flag = Value("i", 0)
+        QueueManager.register("get_read_finish_flag", callable=lambda: read_finish_flag, proxytype=ValueProxy)
+        m = QueueManager(address=(os.getenv("POD_0_IP","127.0.0.1"), port), authkey=b'infer_queue')
         s = m.get_server()
         logger.info("launch queue service success")
         s.serve_forever()
diff --git a/llm/server/server/server/engine/token_processor.py b/llm/server/server/server/engine/token_processor.py
index 507a3d43bdf9..1213a9384b77 100644
--- a/llm/server/server/server/engine/token_processor.py
+++ b/llm/server/server/server/engine/token_processor.py
@@ -20,8 +20,9 @@
 from datetime import datetime
 
 import numpy as np
-from paddlenlp_ops import get_output
+from paddlenlp_ops import get_output, speculate_get_output
 from server.utils import datetime_diff, model_server_logger, monitor_logger
+from paddlenlp.utils.env import MAX_DRAFT_TOKENS, SPECULATE_MAX_BSZ
 
 
 class TokenProcessor(object):
@@ -37,7 +38,12 @@ def __init__(self, cfg):
         self.all_tokens = [[] for _ in range(self.cfg.max_batch_size)]
 
         self.tokens_counter = Counter()
-        self.output_tokens = paddle.full(shape=[self.cfg.max_batch_size + 2, 1], fill_value=2, dtype="int64")
+
+        self.is_speculate_decoding = self.cfg.get_speculate_config().speculate_method != "None"
+        if self.is_speculate_decoding:
+            self.output_tokens = paddle.full(shape=[SPECULATE_MAX_BSZ * MAX_DRAFT_TOKENS + SPECULATE_MAX_BSZ + 2, 1], fill_value=2, dtype="int64")
+        else:
+            self.output_tokens = paddle.full(shape=[self.cfg.max_batch_size + 2, 1], fill_value=2, dtype="int64")
         self.worker = None
 
         self.record_time_interval = int(os.getenv("RECORD_TIME_INTERVAL", "600"))
@@ -77,10 +83,14 @@ def process_sampling_results(self):
             try:
                 rank_id = 0
                 is_blocking = True
-                get_output(self.output_tokens, rank_id, is_blocking)
+                if self.is_speculate_decoding:
+                    speculate_get_output(self.output_tokens, rank_id, is_blocking)
+                else:
+                    get_output(self.output_tokens, rank_id, is_blocking)
 
                 if self.output_tokens[0, 0] == -2:
                     continue
+                
                 self._process_batch_output()
             except Exception as e:
                 model_server_logger.info("while get input_data error: {0} {1}".format(e, str(traceback.format_exc())))
@@ -101,14 +111,14 @@ def postprocess(self, batch_result, exist_finished_task=False):
             with open(result_file, "a") as f:
                 f.write("{}\n".format(result))
 
-    def _get_single_result(self, i, task_id, token_id, task):
+    def _get_single_result(self, i, task_id, token_ids, task):
         """
         processing single results
 
         Args:
             i (int): batch index
             task_id (str): task id
-            token_id (int): token id
+            token_ids (list): token id
             task (dict): task information
 
         Returns:
@@ -121,7 +131,7 @@ def _get_single_result(self, i, task_id, token_id, task):
         result = {
             "req_id": task_id,
             "is_end": 0,
-            "token_ids": [token_id],
+            "token_ids": token_ids,
             "send_idx": self.tokens_counter[task_id],
             "inference_time_cost": inference_time_cost,
             "infer_seed": task["infer_seed"],
@@ -137,26 +147,31 @@ def _get_single_result(self, i, task_id, token_id, task):
                     result[key] = str(task[key])
 
         # fill some extra information
-        if token_id in task["eos_token_ids"]:
-            result["is_end"] = 1
-            result["token_ids"] = []
-            result["tokens_all_num"] = len(self.all_tokens[i]) + 1
-            result["tokens_all_ids"] = self.all_tokens[i]
-
-            info_dict = {}
-            info_dict["req_id"] = task["req_id"]
-            info_dict["input_token_num"] = len(task["input_ids"])
-            info_dict["output_token_num"] = len(self.all_tokens[i])
-            if hasattr(task, "preprocess_start_time") and hasattr(task, "preprocess_end_time"):
-                info_dict["preprocess_cost_time"] = datetime_diff(task["preprocess_start_time"],
-                                                                  task["preprocess_end_time"])
-            if hasattr(task, "preprocess_end_time") and hasattr(task, "schedule_start_time"):
-                info_dict["cache_waiting_cost_time"] = datetime_diff(task["preprocess_end_time"],
-                                                                     task["schedule_start_time"])
-            info_dict["inference_time_cost"] = task["inference_time_cost"]
-            info_dict["version"] = "4.6"
-            info_dict["timestamp"] = time.time()
-            monitor_logger.info(f"{info_dict}")
+        result["token_ids"] = []
+        for token_id in token_ids:
+            if token_id in task["eos_token_ids"]:
+                result["is_end"] = 1
+                result["token_ids"] = []
+                result["tokens_all_num"] = len(self.all_tokens[i]) + 1
+                result["tokens_all_ids"] = self.all_tokens[i]
+
+                info_dict = {}
+                info_dict["req_id"] = task["req_id"]
+                info_dict["input_token_num"] = len(task["input_ids"])
+                info_dict["output_token_num"] = len(self.all_tokens[i])
+                if hasattr(task, "preprocess_start_time") and hasattr(task, "preprocess_end_time"):
+                    info_dict["preprocess_cost_time"] = datetime_diff(task["preprocess_start_time"],
+                                                                    task["preprocess_end_time"])
+                if hasattr(task, "preprocess_end_time") and hasattr(task, "schedule_start_time"):
+                    info_dict["cache_waiting_cost_time"] = datetime_diff(task["preprocess_end_time"],
+                                                                        task["schedule_start_time"])
+                info_dict["inference_time_cost"] = task["inference_time_cost"]
+                info_dict["version"] = "OpenSource"
+                info_dict["timestamp"] = time.time()
+                monitor_logger.info(f"{info_dict}")
+                break
+            else:
+                result["token_ids"].append(token_id)
 
         return result
 
@@ -177,7 +192,10 @@ def _process_batch_output(self):
         """
         tokens = self.output_tokens.numpy()
         batch = self.output_tokens[1, 0]
-        tokens = tokens[2:batch + 2]
+        if not self.is_speculate_decoding:
+            tokens = tokens[2:batch + 2]
+        else:
+            accept_num = tokens[2:batch + 2]
 
         batch_result = list()
         exist_finished_task = False
@@ -185,25 +203,31 @@ def _process_batch_output(self):
             if self.resource_manager.stop_flags[i]:
                 continue
 
-            token_id = int(tokens[i, 0])
-            if token_id < 0:
+            if not self.is_speculate_decoding:
+                token_ids = [int(tokens[i, 0])]
+            else:
+                token_ids = tokens[2 + SPECULATE_MAX_BSZ + i * MAX_DRAFT_TOKENS: 2 + SPECULATE_MAX_BSZ + i * MAX_DRAFT_TOKENS + accept_num[i, 0], 0].tolist()
+            
+            if any(token_id < 0 for token_id in token_ids):
                 continue
 
             task = self.resource_manager.tasks_list[i]
 
             task_id = task["req_id"]
-            result = self._get_single_result(i, task_id, token_id, task)
-
-            self.tokens_counter[task_id] += 1
-            if token_id not in task["eos_token_ids"]:
-                self.all_tokens[i].append(token_id)
-
-            self.number_of_output_tokens += 1
-            if token_id in task["eos_token_ids"]:
-                self._recycle_resources(task_id, i, task)
-                model_server_logger.info("req_id: {0} finished".format(task_id))
-                model_server_logger.info(f"{self.resource_manager.info()}")
-                exist_finished_task = True
+            result = self._get_single_result(i, task_id, token_ids, task)
+
+            for token_id in token_ids:
+                self.tokens_counter[task_id] += 1
+                if token_id not in task["eos_token_ids"]:
+                    self.all_tokens[i].append(token_id)
+
+                self.number_of_output_tokens += 1
+                if token_id in task["eos_token_ids"]:
+                    self._recycle_resources(task_id, i, task)
+                    model_server_logger.info("req_id: {0} finished".format(task_id))
+                    model_server_logger.info(f"{self.resource_manager.info()}")
+                    exist_finished_task = True
+                    break
             batch_result.append(result)
 
         self.postprocess(batch_result, exist_finished_task)
@@ -228,7 +252,10 @@ def process_sampling_results(self):
         while self._is_running:
             try:
                 rank_id = 0
-                get_output(self.output_tokens, rank_id, self._is_blocking)
+                if self.is_speculate_decoding:
+                    speculate_get_output(self.output_tokens, rank_id, self._is_blocking)
+                else:
+                    get_output(self.output_tokens, rank_id, self._is_blocking)
 
                 if self.output_tokens[0, 0] == -2:
                     continue
diff --git a/llm/server/server/server/http_server/api.py b/llm/server/server/server/http_server/api.py
index df9c066284f4..2e01ae039dba 100644
--- a/llm/server/server/server/http_server/api.py
+++ b/llm/server/server/server/http_server/api.py
@@ -31,6 +31,7 @@ class Req(BaseModel):
     req_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
     input_ids: Optional[List[int]] = None
     text: Optional[str] = None
+    stop_sequences: Optional[List] = None
     messages: Optional[List] = None
     max_dec_len: Optional[int] = None
     seq_len: Optional[int] = None
diff --git a/llm/server/server/server/triton_server.py b/llm/server/server/server/triton_server.py
index 601a1b017907..d565b4ce872b 100644
--- a/llm/server/server/server/triton_server.py
+++ b/llm/server/server/server/triton_server.py
@@ -98,11 +98,35 @@ def _push_mode_sender_thread(self):
             except Exception as e:
                     model_server_logger.error("Unexcepted error happend: {}, {}".format(e, str(traceback.format_exc())))
 
+    def _cache_special_tokens(self, batch_result):
+        for i in range(len(batch_result)):
+            is_end = batch_result[i].get("is_end", 0)
+            token_ids = batch_result[i]["token_ids"]
+            if is_end != 1:
+                if batch_result[i]["req_id"] not in self.token_buffer:
+                    self.token_buffer[batch_result[i]["req_id"]] = list()
+                    self.score_buffer[batch_result[i]["req_id"]] = list()
+                self.token_buffer[batch_result[i]["req_id"]].extend(token_ids)
+                self.score_buffer[batch_result[i]["req_id"]].extend(batch_result[i].get("token_scores", []))
+                batch_result[i]["token_ids"] = []
+                if "token_scores" in batch_result[i]:
+                    batch_result[i]["token_scores"] = []
+            else:
+                if batch_result[i]["req_id"] in self.token_buffer:
+                    batch_result[i]["token_ids"] = self.token_buffer[batch_result[i]
+                        ["req_id"]] + batch_result[i]["token_ids"]
+                    del self.token_buffer[batch_result[i]["req_id"]]
+                    if "token_scores" in batch_result[i]:
+                        batch_result[i]["token_scores"] = self.score_buffer[batch_result[i]
+                            ["req_id"]] + batch_result[i]["token_scores"]
+                    del self.score_buffer[batch_result[i]["req_id"]]
+
     def postprocess(self, batch_result, exist_finished_task=False):
         """
         single postprocess for triton
         """
         try:
+            self._cache_special_tokens(batch_result)
             self.cached_generated_tokens.put(batch_result)
         except Exception as e:
             model_server_logger.info(
@@ -168,7 +192,7 @@ def initialize(self, args):
 
         base_config = Config()
         self.cfg = TritonConfig(base_config)
-        self.cfg.print(file="log/deploy_init.info")
+        self.cfg.print(file="log/fastdeploy_init.info")
 
         # init engine
         self.token_processor = TritonTokenProcessor(self.cfg, self)
@@ -177,7 +201,9 @@ def initialize(self, args):
         self.engine.start()
         model_server_logger.info("Create engine success")
 
-        self._initialize_push_mode()
+        # Master node only
+        if self.cfg.nnode == 1 or os.getenv('POD_0_IP',"127.0.0.1") == self.cfg.host_ip:
+            self._initialize_push_mode()
         model_server_logger.info("Init triton server success")
 
 
diff --git a/llm/server/server/server/triton_server_helper.py b/llm/server/server/server/triton_server_helper.py
index b299cd4204f8..9ca3a7e4ae83 100644
--- a/llm/server/server/server/triton_server_helper.py
+++ b/llm/server/server/server/triton_server_helper.py
@@ -72,7 +72,7 @@ def check_infer_engine_process():
     return:
         status: bool, True if process is alive else False
     """
-    mp_num = int(env_config.mp_num)
+    mp_num = int(env_config.mp_num_per_node)
     for i in range(mp_num):
         try:
             infer_live_flag_shm = shared_memory.SharedMemory(name=env_config.get_unique_name("shm_flag_infer_{}_live".format(i)))
diff --git a/llm/utils/data.py b/llm/utils/data.py
index db9d417743d0..dbecb49778e6 100644
--- a/llm/utils/data.py
+++ b/llm/utils/data.py
@@ -59,11 +59,13 @@ def get_convert_example(model):
         "gpt",
         "yuan",
         "jamba",
+        "deepseek_v2",
+        "deepseek_v3",
     ]:
         return convert_example_common
     else:
         raise ValueError(
-            f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm, bloom, llama, qwen, mixtral, gemma, qwen2, qwen2_moe, yuan, jamba",
+            f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm, bloom, llama, qwen, mixtral, gemma, qwen2, qwen2_moe, yuan, jamba,deepseek_v2, deepseek_v3",
         )
 
 
diff --git a/paddlenlp/datasets/dataset.py b/paddlenlp/datasets/dataset.py
index cf810a5196fc..03e73035b5b2 100644
--- a/paddlenlp/datasets/dataset.py
+++ b/paddlenlp/datasets/dataset.py
@@ -20,6 +20,9 @@
 from collections import namedtuple
 from itertools import islice
 
+# Add this for extremely slow conection to hf sever even for local dataset.
+os.environ["HF_UPDATE_DOWNLOAD_COUNTS"] = "False"
+
 import datasets
 from multiprocess import Pool, RLock
 
@@ -117,6 +120,7 @@ def load_from_hf(path, name=None, splits=None, **kwargs):
             hf_datasets = load_hf_dataset(path, name=name, **kwargs)
         else:
             hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
+
     except FileNotFoundError:
         raise FileNotFoundError("Couldn't find the dataset script for '" + path + "' on PaddleNLP or HuggingFace")
     else:
diff --git a/paddlenlp/experimental/autonlp/text_classification.py b/paddlenlp/experimental/autonlp/text_classification.py
index 5df473387918..25c2d39bea07 100644
--- a/paddlenlp/experimental/autonlp/text_classification.py
+++ b/paddlenlp/experimental/autonlp/text_classification.py
@@ -47,6 +47,7 @@
 from ...utils.log import logger
 from .auto_trainer_base import AutoTrainerBase
 from .utils import UTCLoss
+from .utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 
 
 class AutoTrainerForTextClassification(AutoTrainerBase):
@@ -560,16 +561,16 @@ def export(self, export_path: str, trial_id: Optional[str] = None, compress: boo
         if os.path.exists(default_export_path):
             if "utc" in model_config["model_name_or_path"]:
                 files = [
-                    "model.pdiparams",
-                    "model.pdmodel",
+                    f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
+                    f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
                     "tokenizer_config.json",
                     "vocab.txt",
                     "taskflow_config.json",
                 ]
             else:
                 files = [
-                    "model.pdiparams",
-                    "model.pdmodel",
+                    f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
+                    f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
                     "tokenizer_config.json",
                     "vocab.txt",
                     "taskflow_config.json",
@@ -735,8 +736,8 @@ def _batch_generator_func():
             executor=exe,
             batch_generator=_batch_generator_func,
             model_dir=export_path,
-            model_filename="model.pdmodel",
-            params_filename="model.pdiparams",
+            model_filename=f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
+            params_filename=f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
             batch_size=batch_size,
             batch_nums=batch_nums,
             scope=None,
@@ -757,8 +758,8 @@ def _batch_generator_func():
         post_training_quantization.quantize()
         post_training_quantization.save_quantized_model(
             save_model_path=compress_path,
-            model_filename="model.pdmodel",
-            params_filename="model.pdiparams",
+            model_filename=f"model{PADDLE_INFERENCE_MODEL_SUFFIX}",
+            params_filename=f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}",
         )
 
         paddle.disable_static()
diff --git a/paddlenlp/experimental/transformers/deepseek_v2/__init__.py b/paddlenlp/experimental/transformers/deepseek_v2/__init__.py
new file mode 100644
index 000000000000..c2a7f656c636
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v2/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/deepseek_v2/modeling.py b/paddlenlp/experimental/transformers/deepseek_v2/modeling.py
new file mode 100644
index 000000000000..6d1364857ae5
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v2/modeling.py
@@ -0,0 +1,1240 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from functools import partial
+from typing import Tuple
+
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.distributed import fleet
+from paddle.nn.quant import weight_quantize
+
+from paddlenlp.experimental.transformers.fused_transformer_layers import (
+    FusedBlockMultiTransformer,
+    FusedBlockMultiTransformerWeightOnly,
+    FusedMultiTransformerConfig,
+    MLAConfig,
+    MoeConfig,
+    SpeculateConfig,
+)
+from paddlenlp.experimental.transformers.generation_utils import (
+    GenerationBlockInferenceModel,
+)
+from paddlenlp.experimental.transformers.utils import infererence_model_from_pretrained
+from paddlenlp.transformers import DeepseekV2Config, DeepseekV2PretrainedModel
+from paddlenlp.transformers.deepseek_v2.modeling import (
+    DeepseekV2LMHead,
+    yarn_find_correction_range,
+    yarn_get_mscale,
+    yarn_linear_ramp_mask,
+)
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+)
+from paddlenlp.transformers.model_utils import (
+    dy2st_nocheck_guard_context,
+    register_base_model,
+)
+from paddlenlp.utils.log import logger
+
+__all__ = ["DeepseekV2ForCausalLMBlockInferenceModel"]
+
+
+class DeepseekScalingRotaryEmbedding(nn.Layer):
+    """RotaryEmbedding extended with YaRN method.
+
+    Credits to Peng et al. github.com/jquesnelle/yarn
+    """
+
+    def __init__(
+        self,
+        rotary_dim: int,
+        max_position_embeddings: int,
+        base: int,
+        scaling_factor: float,
+        *,
+        extrapolation_factor: float = 1,
+        attn_factor: float = 1,
+        beta_fast: int = 32,
+        beta_slow: int = 1,
+        mscale: float = 1,
+        mscale_all_dim: float = 0,
+    ) -> None:
+        super().__init__()
+        self._dtype = paddle.get_default_dtype()
+
+        self.rotary_dim = rotary_dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+
+        self.scaling_factor = scaling_factor
+        self.extrapolation_factor = extrapolation_factor
+        self.attn_factor = attn_factor
+        self.beta_fast = beta_fast
+        self.beta_slow = beta_slow
+        # Get n-d magnitude scaling corrected for interpolation.
+        self.mscale = float(
+            yarn_get_mscale(self.scaling_factor, float(mscale))
+            / yarn_get_mscale(self.scaling_factor, float(mscale_all_dim))
+            * attn_factor
+        )
+
+        cos_cache, sin_cache = self._compute_cos_sin_cache()
+
+        self.cos_cache: paddle.Tensor
+        self.register_buffer("cos_cache", cos_cache, persistable=True)
+        self.sin_cache: paddle.Tensor
+        self.register_buffer("sin_cache", sin_cache, persistable=True)
+
+    def _compute_inv_freq(self, scaling_factor: float) -> paddle.Tensor:
+        pos_freqs = self.base ** (paddle.arange(0, self.rotary_dim, 2, dtype=paddle.float32) / self.rotary_dim)
+
+        inv_freq_extrapolation = 1.0 / pos_freqs
+        inv_freq_interpolation = 1.0 / (scaling_factor * pos_freqs)
+
+        low, high = yarn_find_correction_range(
+            self.beta_fast, self.beta_slow, self.rotary_dim, self.base, self.max_position_embeddings
+        )
+        # Get n-d rotational scaling corrected for extrapolation
+        inv_freq_mask = (1 - yarn_linear_ramp_mask(low, high, self.rotary_dim // 2)) * self.extrapolation_factor
+        inv_freq = inv_freq_interpolation * (1 - inv_freq_mask) + inv_freq_extrapolation * inv_freq_mask
+        return inv_freq
+
+    def _compute_cos_sin_cache(self) -> paddle.Tensor:
+        inv_freq = self._compute_inv_freq(self.scaling_factor)
+        t = paddle.arange(self.max_position_embeddings * self.scaling_factor, dtype=paddle.float32)
+
+        freqs = paddle.outer(t, inv_freq)
+        emb = paddle.concat((freqs, freqs), axis=-1)
+        cos = emb.cos() * self.mscale
+        sin = emb.sin() * self.mscale
+
+        return cos.cast(self._dtype), sin.cast(self._dtype)
+
+    def forward(
+        self,
+        position_ids: paddle.Tensor,
+        query: paddle.Tensor,
+        key: paddle.Tensor,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        cos = self.cos_cache[position_ids].unsqueeze(1)
+        sin = self.sin_cache[position_ids].unsqueeze(1)
+
+        def rotate_half(x):
+            """Rotates half the hidden axiss of the input."""
+            x1 = x[..., : x.shape[-1] // 2]
+            x2 = x[..., x.shape[-1] // 2 :]
+            return paddle.concat([-x2, x1], axis=-1)  # shape is the same as x
+
+        s, h, d = query.shape
+        query = query.reshape([s, h, d // 2, 2]).transpose([0, 1, 3, 2]).reshape([s, h, d])
+
+        s, h, d = key.shape
+        key = key.reshape([s, h, d // 2, 2]).transpose([0, 1, 3, 2]).reshape([s, h, d])
+
+        query = (query * cos) + (rotate_half(query) * sin)
+        key = (key * cos) + (rotate_half(key) * sin)
+
+        return query, key
+
+
+class DeepseekV2RMSNorm(nn.Layer):
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__()
+        self.eps = config.rms_norm_eps
+        self.weight = paddle.create_parameter(
+            shape=[config.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+
+    def forward(self, x):
+        return paddle.incubate.nn.functional.fused_rms_norm(x, self.weight, None, self.eps, begin_norm_axis=1)[0]
+
+
+@register_base_model
+class DeepseekV2BlockInferenceModel(DeepseekV2PretrainedModel):
+    def __init__(self, config: DeepseekV2Config, base_model_prefix: str):
+        super(DeepseekV2PretrainedModel, self).__init__(config)
+        self.base_model_prefix = base_model_prefix
+
+        self.config = config
+
+        self.max_seq_len = config.max_seq_len
+
+        self.vocab_size = config.vocab_size
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_layers = config.num_hidden_layers
+        self.rms_norm_eps = config.rms_norm_eps
+        self.quant_type = config.quant_type
+        self.rope_theta = config.rope_theta
+        self.return_full_hidden_states = config.get("return_full_hidden_states", False)
+
+        self.use_weight_only = False
+        if config.quant_type == "weight_only_int8":
+            self.use_weight_only = True
+            self.quant_algo = "weight_only_int8"
+        elif config.quant_type == "weight_only_int4":
+            self.use_weight_only = True
+            self.quant_algo = "weight_only_int4"
+
+        if self.use_weight_only:
+            assert (
+                self.quant_type == "weight_only_int8" or self.quant_type == "weight_only_int4"
+            ), f"Expected quant_type equal to 'weight_only_int8' or 'weight_only_int4', but received {self.quant_type}"
+
+        self.first_k_dense_replace = config.first_k_dense_replace
+        self.n_routed_experts = config.n_routed_experts
+
+        if config.tensor_parallel_degree > config.n_routed_experts:
+            raise ValueError(
+                f"Tensor parallel size {config.tensor_parallel_degree} is greater than "
+                f"the number of experts {config.n_routed_experts}."
+            )
+
+        if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                self.vocab_size,
+                self.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(
+                self.vocab_size,
+                self.hidden_size,
+            )
+
+        self.norm = DeepseekV2RMSNorm(config)
+
+        scaling_factor = config.rope_scaling.get("factor", 1)
+        original_max_position = config.rope_scaling.get("original_max_position_embeddings", 4096)
+        extra_kwargs = {
+            k: v
+            for k, v in config.rope_scaling.items()
+            if k in ("extrapolation_factor", "attn_factor", "beta_fast", "beta_slow", "mscale", "mscale_all_dim")
+        }
+        self.rotary_emb = DeepseekScalingRotaryEmbedding(
+            config.qk_rope_head_dim,
+            original_max_position,
+            config.rope_theta,
+            scaling_factor,
+            **extra_kwargs,
+        )
+
+        # get ring_id
+        ring_id = -1
+        try:
+            hcg = fleet.get_hybrid_communicate_group()
+            model_parallel_group = hcg.get_model_parallel_group()
+            ring_id = model_parallel_group.id
+        except:
+            pass
+
+        ln_scale_attrs = [
+            paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ln_scale") for idx in range(self.num_layers)
+        ]
+
+        q_a_proj_weight_attrs = None
+        q_a_layernorm_weight_attrs = None
+        q_b_proj_weight_attrs = None
+        q_proj_weight_attrs = None
+
+        if self.config.q_lora_rank is not None:
+            q_a_proj_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_a_proj_weight",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                for idx in range(self.num_layers)
+            ]
+            q_a_layernorm_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_a_layernorm_weight",
+                    initializer=paddle.nn.initializer.Constant(value=1.0),
+                )
+                for idx in range(self.num_layers)
+            ]
+            q_b_proj_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_b_proj_weight",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                for idx in range(self.num_layers)
+            ]
+        else:
+            q_proj_weight_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.q_proj_weight",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                for idx in range(self.num_layers)
+            ]
+
+        kv_a_proj_with_mqa_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.kv_a_proj_with_mqa_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        kv_a_layernorm_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.kv_a_layernorm_weight",
+                initializer=paddle.nn.initializer.Constant(value=1.0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        kv_b_proj_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.kv_b_proj_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+
+        out_proj_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.out_proj_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        ffn_ln_scale_attrs = [
+            paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ffn_ln_scale") for idx in range(self.num_layers)
+        ]
+        ffn1_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.ffn1_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        ffn2_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.ffn2_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        gate_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.gate_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+
+        e_score_correction_bias_attrs = None
+        if self.base_model_prefix.startswith("deepseek_v3"):
+            e_score_correction_bias_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.e_score_correction_bias",
+                    initializer=paddle.nn.initializer.Constant(value=0),
+                )
+                if idx >= self.config.first_k_dense_replace
+                else None
+                for idx in range(self.num_layers)
+            ]
+
+        shared_expert_ffn1_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn1_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+        shared_expert_ffn2_weight_attrs = [
+            paddle.ParamAttr(
+                name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn2_weight",
+                initializer=paddle.nn.initializer.Constant(value=0),
+            )
+            for idx in range(self.num_layers)
+        ]
+
+        q_proj_weight_scale_attrs = None
+        q_a_proj_weight_scale_attrs = None
+        q_b_proj_weight_scale_attrs = None
+        kv_a_proj_with_mqa_weight_scale_attrs = None
+        kv_b_proj_weight_scale_attrs = None
+
+        out_proj_weight_scale_attrs = None
+        ffn1_weight_scale_attrs = None
+        ffn2_weight_scale_attrs = None
+        shared_expert_ffn1_weight_scale_attrs = None
+        shared_expert_ffn2_weight_scale_attrs = None
+
+        if self.use_weight_only:
+            if self.config.q_lora_rank is not None:
+                q_proj_weight_scale_attrs = [
+                    paddle.ParamAttr(
+                        name=f"fuse{self.base_model_prefix}.{idx}.q_a_proj_weight_scale",
+                    )
+                    for idx in range(self.num_layers)
+                ]
+                q_b_proj_weight_scale_attrs = [
+                    paddle.ParamAttr(
+                        name=f"fuse{self.base_model_prefix}.{idx}.q_b_proj_weight_scale",
+                    )
+                    for idx in range(self.num_layers)
+                ]
+            else:
+                q_proj_weight_scale_attrs = [
+                    paddle.ParamAttr(
+                        name=f"fuse{self.base_model_prefix}.{idx}.q_proj_weight_scale",
+                    )
+                    for idx in range(self.num_layers)
+                ]
+
+            kv_a_proj_with_mqa_weight_scale_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.kv_a_proj_with_mqa_weight_scale",
+                )
+                for idx in range(self.num_layers)
+            ]
+            kv_b_proj_weight_scale_attrs = [
+                paddle.ParamAttr(
+                    name=f"fuse{self.base_model_prefix}.{idx}.kv_b_proj_weight_scale",
+                )
+                for idx in range(self.num_layers)
+            ]
+
+            out_proj_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.out_proj_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            ffn1_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ffn1_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            ffn2_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.ffn2_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            shared_expert_ffn1_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn1_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+            shared_expert_ffn2_weight_scale_attrs = [
+                paddle.ParamAttr(name=f"fuse{self.base_model_prefix}.{idx}.shared_expert_ffn2_weight_scale")
+                for idx in range(self.num_layers)
+            ]
+
+        mla_config = MLAConfig(
+            q_lora_rank=self.config.q_lora_rank,
+            kv_lora_rank=self.config.kv_lora_rank,
+            qk_nope_head_dim=self.config.qk_nope_head_dim,
+            qk_rope_head_dim=self.config.qk_rope_head_dim,
+            v_head_dim=self.config.v_head_dim,
+            mscale=yarn_get_mscale(scaling_factor, float(config.rope_scaling.get("mscale_all_dim", 1.0))),
+            q_proj_weight_attrs=q_proj_weight_attrs,
+            q_proj_weight_scale_attrs=q_proj_weight_scale_attrs,
+            q_a_proj_weight_attrs=q_a_proj_weight_attrs,
+            q_a_proj_weight_scale_attrs=q_a_proj_weight_scale_attrs,
+            q_a_layernorm_weight_attrs=q_a_layernorm_weight_attrs,
+            q_b_proj_weight_attrs=q_b_proj_weight_attrs,
+            q_b_proj_weight_scale_attrs=q_b_proj_weight_scale_attrs,
+            kv_a_proj_with_mqa_weight_attrs=kv_a_proj_with_mqa_weight_attrs,
+            kv_a_proj_with_mqa_weight_scale_attrs=kv_a_proj_with_mqa_weight_scale_attrs,
+            kv_a_layernorm_weight_attrs=kv_a_layernorm_weight_attrs,
+            kv_b_proj_weight_attrs=kv_b_proj_weight_attrs,
+            kv_b_proj_weight_scale_attrs=kv_b_proj_weight_scale_attrs,
+        )
+
+        moe_config = MoeConfig(
+            num_experts=self.n_routed_experts,
+            top_k=self.config.num_experts_per_tok,
+            topk_group=self.config.topk_group,
+            norm_topk_prob=self.config.norm_topk_prob,
+            routed_scaling_factor=self.config.routed_scaling_factor,
+            num_expert_group=self.config.n_group,
+            topk_method=self.config.topk_method,
+            moe_intermediate_size=self.config.moe_intermediate_size,
+            first_k_dense_replace=self.first_k_dense_replace,
+            shared_expert_with_gate=False,
+            shared_expert_intermediate_size=self.config.moe_intermediate_size * self.config.n_shared_experts,
+            shared_expert_ffn1_weight_attrs=shared_expert_ffn1_weight_attrs,
+            shared_expert_ffn1_weight_scale_attrs=shared_expert_ffn1_weight_scale_attrs,
+            shared_expert_ffn2_weight_attrs=shared_expert_ffn2_weight_attrs,
+            shared_expert_ffn2_weight_scale_attrs=shared_expert_ffn2_weight_scale_attrs,
+        )
+
+        speculate_config = SpeculateConfig(
+            speculate_method=config.get("speculate_method", None),
+            speculate_max_draft_token_num=config.get("speculate_max_draft_token_num", 5),
+            return_full_hidden_states=config.get("return_full_hidden_states", False),
+        )
+
+        transformer_config = FusedMultiTransformerConfig(
+            embed_dim=self.hidden_size,
+            num_heads=self.num_attention_heads,
+            kv_num_heads=self.num_key_value_heads,
+            intermediate_size=self.intermediate_size,
+            quant_type=self.quant_type,
+            activation="swiglu",
+            num_layers=config.num_hidden_layers,
+            nranks=config.tensor_parallel_degree,
+            ring_id=ring_id,
+            ln_scale_attrs=ln_scale_attrs,
+            linear_weight_attrs=out_proj_weight_attrs,
+            linear_weight_scale_attrs=out_proj_weight_scale_attrs,
+            ffn_ln_scale_attrs=ffn_ln_scale_attrs,
+            gate_weight_attrs=gate_weight_attrs,
+            ffn1_weight_attrs=ffn1_weight_attrs,
+            ffn1_weight_scale_attrs=ffn1_weight_scale_attrs,
+            ffn2_weight_attrs=ffn2_weight_attrs,
+            ffn2_weight_scale_attrs=ffn2_weight_scale_attrs,
+            e_score_correction_bias_attrs=e_score_correction_bias_attrs,
+            epsilon=self.rms_norm_eps,
+            rope_theta=self.rope_theta,
+            rotary_emb=self.rotary_emb,
+            norm_type="rmsnorm",
+            rank_id=config.tensor_parallel_rank,
+            moe_config=moe_config,
+            mla_config=mla_config,
+            append_attn=config.append_attn,
+            speculate_config=speculate_config,
+        )
+
+        self.set_transformer_block(transformer_config)
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        self.transformer_block.init_weight()
+
+        dtype = paddle.get_default_dtype()
+        embed_tokens_weight = paddle.to_tensor(state_dict[f"{self.base_model_prefix}.embed_tokens.weight"]).cast(
+            self.embed_tokens.weight.dtype
+        )
+        norm_weight = paddle.to_tensor(state_dict[f"{self.base_model_prefix}.norm.weight"]).cast(
+            self.norm.weight.dtype
+        )
+        self.embed_tokens.weight.set_value(embed_tokens_weight)
+        self.norm.weight.set_value(norm_weight)
+
+        if self.use_weight_only:
+            logger.info("weight only is enabled")
+        for idx in range(self.num_layers):
+            logger.info(f"set state for layer {idx}")
+
+            ln_scale = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.input_layernorm.weight"]
+            ).cast(self.transformer_block.ln_scales[idx].dtype)
+            self.transformer_block.ln_scales[idx].set_value(ln_scale)
+
+            if self.config.q_lora_rank is not None:
+                q_a_proj_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_a_proj.weight"]
+                ).cast(dtype)
+                q_a_layernorm_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_a_layernorm.weight"]
+                ).cast(self.transformer_block.q_a_layernorm_weights[idx].dtype)
+                q_b_proj_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_b_proj.weight"]
+                ).cast(dtype)
+
+                if self.use_weight_only:
+                    q_a_proj_quanted_weight, q_a_proj_weight_scale = weight_quantize(
+                        q_a_proj_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.q_a_proj_weights[idx].set_value(q_a_proj_quanted_weight)
+                    self.transformer_block.q_a_proj_weights_scale[idx].set_value(q_a_proj_weight_scale)
+
+                    q_b_proj_quanted_weight, q_b_proj_weight_scale = weight_quantize(
+                        q_b_proj_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.q_b_proj_weights[idx].set_value(q_b_proj_quanted_weight)
+                    self.transformer_block.q_a_layernorm_weights[idx].set_value(q_a_layernorm_weight)
+                    self.transformer_block.q_b_proj_weights_scale[idx].set_value(q_b_proj_weight_scale)
+                else:
+                    self.transformer_block.q_a_proj_weights[idx].set_value(q_a_proj_weight)
+                    self.transformer_block.q_a_layernorm_weights[idx].set_value(q_a_layernorm_weight)
+                    self.transformer_block.q_b_proj_weights[idx].set_value(q_b_proj_weight)
+            else:
+                q_proj_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.q_proj.weight"]
+                ).cast(dtype)
+
+                if self.use_weight_only:
+                    q_proj_quanted_weight, q_proj_weight_scale = weight_quantize(q_proj_weight, algo=self.quant_algo)
+                    self.transformer_block.q_proj_weights[idx].set_value(q_proj_quanted_weight)
+                    self.transformer_block.q_proj_weights_scale[idx].set_value(q_proj_weight_scale)
+                else:
+                    self.transformer_block.q_proj_weights[idx].set_value(q_proj_weight)
+
+            kv_a_proj_with_mqa_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.kv_a_proj_with_mqa.weight"]
+            ).cast(dtype)
+            kv_a_layernorm_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.kv_a_layernorm.weight"]
+            ).cast(self.transformer_block.kv_a_layernorm_weights[idx].dtype)
+            kv_b_proj_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.kv_b_proj.weight"]
+            ).cast(dtype)
+
+            if self.use_weight_only:
+                kv_a_proj_with_mqa_quanted_weight, kv_a_proj_with_mqa_weight_scale = weight_quantize(
+                    kv_a_proj_with_mqa_weight, algo=self.quant_algo
+                )
+                self.transformer_block.kv_a_proj_with_mqa_weights[idx].set_value(kv_a_proj_with_mqa_quanted_weight)
+                self.transformer_block.kv_a_proj_with_mqa_weights_scale[idx].set_value(kv_a_proj_with_mqa_weight_scale)
+
+                kv_b_proj_quanted_weight, kv_b_proj_weight_scale = weight_quantize(
+                    kv_b_proj_weight, algo=self.quant_algo
+                )
+                self.transformer_block.kv_b_proj_weights[idx].set_value(kv_b_proj_quanted_weight)
+                self.transformer_block.kv_a_layernorm_weights[idx].set_value(kv_a_layernorm_weight)
+                self.transformer_block.kv_b_proj_weights_scale[idx].set_value(kv_b_proj_weight_scale)
+            else:
+                self.transformer_block.kv_a_proj_with_mqa_weights[idx].set_value(kv_a_proj_with_mqa_weight)
+                self.transformer_block.kv_a_layernorm_weights[idx].set_value(kv_a_layernorm_weight)
+                self.transformer_block.kv_b_proj_weights[idx].set_value(kv_b_proj_weight)
+
+            linear_weight = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.self_attn.o_proj.weight"]
+            ).cast(dtype)
+
+            if self.use_weight_only:
+                linear_quanted_weight, linear_weight_scale = weight_quantize(linear_weight, algo=self.quant_algo)
+                self.transformer_block.linear_weights[idx].set_value(linear_quanted_weight)
+                self.transformer_block.linear_weights_scale[idx].set_value(linear_weight_scale)
+            else:
+                self.transformer_block.linear_weights[idx].set_value(linear_weight)
+
+            ffn_ln_scale = paddle.to_tensor(
+                state_dict[f"{self.base_model_prefix}.layers.{idx}.post_attention_layernorm.weight"],
+            ).cast(
+                self.transformer_block.ffn_ln_scales[idx].dtype,
+            )
+            self.transformer_block.ffn_ln_scales[idx].set_value(ffn_ln_scale)
+            if idx < self.first_k_dense_replace:
+                concated_ffn1_weight = np.concatenate(
+                    [
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.gate_proj.weight"],
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.up_proj.weight"],
+                    ],
+                    axis=-1,
+                )
+                ffn1_weight_tensor = paddle.to_tensor(concated_ffn1_weight).cast(paddle.get_default_dtype())
+
+                if self.use_weight_only:
+                    ffn1_quanted_weight_tensor, ffn1_weight_scale_tensor = weight_quantize(
+                        ffn1_weight_tensor, algo=self.quant_algo
+                    )
+                    self.transformer_block.ffn1_weights[idx].set_value(ffn1_quanted_weight_tensor)
+                    self.transformer_block.ffn1_weights_scale[idx].set_value(ffn1_weight_scale_tensor)
+                else:
+                    self.transformer_block.ffn1_weights[idx].set_value(ffn1_weight_tensor)
+
+                ffn2_weight_tensor = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.down_proj.weight"]
+                ).cast(paddle.get_default_dtype())
+                if self.use_weight_only:
+                    ffn2_quanted_weight_tensor, ffn2_weight_scale_tensor = weight_quantize(
+                        ffn2_weight_tensor, algo=self.quant_algo
+                    )
+                    self.transformer_block.ffn2_weights[idx].set_value(ffn2_quanted_weight_tensor)
+                    self.transformer_block.ffn2_weights_scale[idx].set_value(ffn2_weight_scale_tensor)
+                else:
+                    self.transformer_block.ffn2_weights[idx].set_value(ffn2_weight_tensor)
+            else:
+                ffn1_weights = []
+                ffn2_weights = []
+                ffn1_scales = []
+                ffn2_scales = []
+
+                for expert_idx in range(self.n_routed_experts):
+                    concated_gate_up_weight = np.concatenate(
+                        [
+                            state_dict[
+                                f"{self.base_model_prefix}.layers.{idx}.mlp.experts.{expert_idx}.gate_proj.weight"
+                            ],
+                            state_dict[
+                                f"{self.base_model_prefix}.layers.{idx}.mlp.experts.{expert_idx}.up_proj.weight"
+                            ],
+                        ],
+                        axis=-1,
+                    )
+                    ffn1_weight = paddle.to_tensor(concated_gate_up_weight).cast(dtype)
+                    ffn2_weight = paddle.to_tensor(
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.experts.{expert_idx}.down_proj.weight"]
+                    ).cast(dtype)
+
+                    if self.use_weight_only:
+                        ffn1_quanted_weight, ffn1_weight_scale = weight_quantize(ffn1_weight, algo=self.quant_algo)
+                        ffn2_quanted_weight, ffn2_weight_scale = weight_quantize(ffn2_weight, algo=self.quant_algo)
+                        ffn1_weights.append(ffn1_quanted_weight.reshape([self.transformer_block.config.embed_dim, -1]))
+                        ffn2_weights.append(ffn2_quanted_weight.reshape([-1, self.transformer_block.config.embed_dim]))
+                        ffn1_scales.append(ffn1_weight_scale)
+                        ffn2_scales.append(ffn2_weight_scale)
+                    else:
+                        ffn1_weights.append(ffn1_weight)
+                        ffn2_weights.append(ffn2_weight)
+
+                fused_moe_ffn1_weight = paddle.to_tensor(ffn1_weights)
+                fused_moe_ffn2_weight = paddle.to_tensor(ffn2_weights)
+                fused_moe_ffn1_weight_scale = paddle.to_tensor(ffn1_scales)
+                fused_moe_ffn2_weight_scale = paddle.to_tensor(ffn2_scales)
+                gate_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.gate.weight"]
+                ).cast("float32")
+
+                if self.base_model_prefix.startswith("deepseek_v3"):
+                    e_score_correction_bias = paddle.to_tensor(
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.gate.e_score_correction_bias"]
+                    ).cast("float32")
+                    self.transformer_block.e_score_correction_biases[idx].set_value(e_score_correction_bias)
+
+                self.transformer_block.ffn1_weights[idx].set_value(fused_moe_ffn1_weight)
+                self.transformer_block.ffn2_weights[idx].set_value(fused_moe_ffn2_weight)
+                self.transformer_block.gate_weights[idx].set_value(gate_weight)
+
+                if self.use_weight_only:
+                    self.transformer_block.ffn1_weights_scale[idx].set_value(fused_moe_ffn1_weight_scale)
+                    self.transformer_block.ffn2_weights_scale[idx].set_value(fused_moe_ffn2_weight_scale)
+
+                concated_gate_up_weight = np.concatenate(
+                    [
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.shared_experts.gate_proj.weight"],
+                        state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.shared_experts.up_proj.weight"],
+                    ],
+                    axis=-1,
+                )
+                shared_expert_ffn1_weight = paddle.to_tensor(concated_gate_up_weight).cast(dtype)
+                shared_expert_ffn2_weight = paddle.to_tensor(
+                    state_dict[f"{self.base_model_prefix}.layers.{idx}.mlp.shared_experts.down_proj.weight"]
+                ).cast(dtype)
+
+                if self.use_weight_only:
+                    shared_expert_ffn1_quanted_weight, shared_expert_ffn1_weight_scale = weight_quantize(
+                        shared_expert_ffn1_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.shared_expert_ffn1_weights[idx].set_value(shared_expert_ffn1_quanted_weight)
+                    self.transformer_block.shared_expert_ffn1_weights_scale[idx].set_value(
+                        shared_expert_ffn1_weight_scale
+                    )
+
+                    shared_expert_ffn2_quanted_weight, shared_expert_ffn2_weight_scale = weight_quantize(
+                        shared_expert_ffn2_weight, algo=self.quant_algo
+                    )
+                    self.transformer_block.shared_expert_ffn2_weights[idx].set_value(shared_expert_ffn2_quanted_weight)
+                    self.transformer_block.shared_expert_ffn2_weights_scale[idx].set_value(
+                        shared_expert_ffn2_weight_scale
+                    )
+                else:
+                    self.transformer_block.shared_expert_ffn1_weights[idx].set_value(shared_expert_ffn1_weight)
+                    self.transformer_block.shared_expert_ffn2_weights[idx].set_value(shared_expert_ffn2_weight)
+
+    def set_transformer_block(self, transformer_config):
+        if self.use_weight_only:
+            self.transformer_block = FusedBlockMultiTransformerWeightOnly(transformer_config)
+        else:
+            self.transformer_block = FusedBlockMultiTransformer(transformer_config)
+
+    def remove_padding(self, input_ids, seq_lens_this_time, draft_tokens=None, seq_lens_encoder=None):
+        cum_offsets_now = paddle.cumsum(self.max_seq_len - seq_lens_this_time)
+        token_num = paddle.sum(seq_lens_this_time)
+        from paddlenlp_ops import get_padding_offset_v2
+
+        ids_remove_padding, cum_offsets, padding_offset, cu_seqlens_q, cu_seqlens_k = get_padding_offset_v2(
+            input_ids, cum_offsets_now, token_num, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+        return ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        caches=None,
+        pre_caches=None,
+        **kwargs,
+    ):
+
+        seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+        draft_tokens = kwargs.get("draft_tokens", None)
+        seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+
+        ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k = self.remove_padding(
+            input_ids, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+
+        kwargs["cu_seqlens_q"] = cu_seqlens_q
+        kwargs["cu_seqlens_k"] = cu_seqlens_k
+        kwargs["padding_offsets"] = padding_offset
+        kwargs["max_input_length"] = self.max_seq_len
+
+        inputs_embeds = self.embed_tokens(ids_remove_padding)
+
+        with dy2st_nocheck_guard_context():
+            hidden_states, _ = self.transformer_block(
+                input_ids=input_ids,
+                src=inputs_embeds,
+                cum_offsets=cum_offsets,
+                attn_mask=attention_mask,
+                caches=caches,
+                pre_caches=pre_caches,
+                rotary_embs=None,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None,
+            cum_offsets=cum_offsets,
+        )
+
+
+@register_base_model
+class MTPDeepseekV2BlockInferenceModel(DeepseekV2BlockInferenceModel):
+    def __init__(self, config: DeepseekV2Config, base_model_prefix: str):
+        super().__init__(config, base_model_prefix)
+        from paddle.distributed.fleet.layers.mpu.mp_layers import ColumnParallelLinear
+
+        self.enorm = DeepseekV2RMSNorm(config)
+        self.hnorm = DeepseekV2RMSNorm(config)
+        self.norm = DeepseekV2RMSNorm(config)
+
+        if config.tensor_parallel_degree > 1:
+            self.eh_proj = ColumnParallelLinear(
+                self.hidden_size * 2, self.hidden_size, has_bias=True, gather_output=True, fuse_matmul_bias=True
+            )
+        else:
+            self.eh_proj = nn.Linear(self.hidden_size * 2, self.hidden_size, bias_attr=True)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        caches=None,
+        pre_caches=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        **kwargs,
+    ):
+        seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+        rope_emb = kwargs.get("rope_emb", None)
+        draft_tokens = kwargs.get("draft_tokens", None)
+        seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+        pre_hidden_states = kwargs.get("pre_hidden_states", None)
+        ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k = self.remove_padding(
+            input_ids, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+
+        kwargs["cu_seqlens_q"] = cu_seqlens_q
+        kwargs["cu_seqlens_k"] = cu_seqlens_k
+        kwargs["padding_offsets"] = padding_offset
+        kwargs["max_input_length"] = self.max_seq_len
+
+        inputs_embeds = self.embed_tokens(ids_remove_padding)
+        inputs_embeds = paddle.concat([self.enorm(inputs_embeds), self.hnorm(pre_hidden_states)], axis=-1)
+        inputs_embeds = self.eh_proj(inputs_embeds)
+
+        with dy2st_nocheck_guard_context():
+            hidden_states, _ = self.transformer_block(
+                input_ids=input_ids,
+                src=inputs_embeds,
+                cum_offsets=cum_offsets,
+                attn_mask=attention_mask,
+                caches=caches,
+                pre_caches=pre_caches,
+                rotary_embs=rope_emb,
+                post_rebuild_padding=True,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None,
+        )
+
+
+class DeepseekV2ForCausalLMBlockInferenceModel(GenerationBlockInferenceModel, DeepseekV2PretrainedModel):
+    """
+    Dynamic Batching for DeepseekV2 Model with pretraining tasks on top.
+    """
+
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config: DeepseekV2Config, base_model_prefix: str = "deepseek_v2"):
+        super().__init__(config)
+        self.base_model_prefix = base_model_prefix
+
+        self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
+        self.verify_window = config.get("speculate_verify_window", 2)
+        self.max_seq_len = config.max_seq_len
+        self.return_full_hidden_states = config.get("return_full_hidden_states", False)
+
+        self.deepseek_v2 = DeepseekV2BlockInferenceModel(config, base_model_prefix)
+        if config.tie_word_embeddings:
+            self.lm_head = DeepseekV2LMHead(
+                config, embedding_weights=self.deepseek_v2.embed_tokens.weight, transpose_y=True
+            )
+            self.tie_weights()
+        else:
+            self.lm_head = DeepseekV2LMHead(config)
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config: DeepseekV2Config, is_split=True):
+
+        logger.info("DeepseekV2 inference model _get_tensor_parallel_mappings")
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+
+            base_actions = {
+                "lm_head.weight": partial(fn, is_column=True),
+                "eh_proj.weight": partial(fn, is_column=True),
+                # Row Linear
+                "embed_tokens.weight": partial(fn, is_column=False),
+                "layers.0.self_attn.o_proj.weight": partial(fn, is_column=False),
+            }
+
+            # Column Linear
+            base_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.self_attn.q_b_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.self_attn.kv_b_proj.weight"] = partial(fn, is_column=True)
+
+            base_actions["layers.0.mlp.gate_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.up_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.down_proj.weight"] = partial(fn, is_column=False)
+
+            for expert_idx in range(config.n_routed_experts):
+                base_actions[f"layers.0.mlp.experts.{expert_idx}.up_proj.weight"] = partial(fn, is_column=True)
+                base_actions[f"layers.0.mlp.experts.{expert_idx}.gate_proj.weight"] = partial(fn, is_column=True)
+                base_actions[f"layers.0.mlp.experts.{expert_idx}.down_proj.weight"] = partial(fn, is_column=False)
+            base_actions["layers.0.mlp.shared_experts.up_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.gate_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.down_proj.weight"] = partial(fn, is_column=False)
+
+            # MTP parts
+            base_actions["layers.61.embed_tokens.weight"] = partial(fn, is_column=False)
+            base_actions["layers.61.eh_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.61.shared_head.head.weight"] = partial(fn, is_column=True)
+
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        return infererence_model_from_pretrained(cls, pretrained_model_name_or_path, args, kwargs)
+
+    @classmethod
+    def get_cache_kvs_shape(
+        cls, config: DeepseekV2Config, max_batch_size: int = None, max_length: int = None
+    ) -> list[list[int]]:
+        """get cache_kvs tensor for DeepseekV2 model
+
+        Args:
+            max_batch_size (int): the max batch size
+            max_length (int | None, optional): the max_length of cache_kvs. Defaults to None.
+
+        Returns:
+            list[paddle.Tensor]: the list tensor shape for cache
+        """
+        max_block_per_seq = (config.max_seq_len + config.block_size - 1) // config.block_size
+        if max_batch_size == -1:
+            max_block_nums = None
+        else:
+            max_block_nums = max_batch_size * max_block_per_seq
+
+        cache_kvs = []
+        for _ in range(config.num_hidden_layers):
+            cache_k_shape = [
+                max_block_nums,
+                config.num_key_value_heads // max(config.tensor_parallel_degree, 1),
+                config.block_size,
+                config.qk_nope_head_dim + config.qk_rope_head_dim,
+            ]
+            cache_v_shape = [
+                max_block_nums,
+                config.num_key_value_heads // max(config.tensor_parallel_degree, 1),
+                config.block_size,
+                config.v_head_dim,
+            ]
+            cache_kvs.append(cache_k_shape)
+            cache_kvs.append(cache_v_shape)
+        return cache_kvs
+
+    def prepare_inputs_for_generation(self, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        input_ids = kwargs["input_ids"]
+        src_mask = kwargs.get("src_mask", None)
+        block_tables = kwargs.get("block_tables", None)
+
+        pre_caches = kwargs.get("pre_caches", None)
+        caches = kwargs.get("caches", None)
+
+        seq_lens_this_time = kwargs["seq_lens_this_time"]
+        seq_lens_encoder = kwargs["seq_lens_encoder"]
+        seq_lens_decoder = kwargs["seq_lens_decoder"]
+        k_quant_scales = kwargs.get("k_quant_scales", None)
+        v_quant_scales = kwargs.get("v_quant_scales", None)
+        k_dequant_scales = kwargs.get("k_dequant_scales", None)
+        v_dequant_scales = kwargs.get("v_dequant_scales", None)
+
+        # speculative decoding related parameters
+        draft_tokens = kwargs.get("draft_tokens", None)
+        output_padding_offset = kwargs.get("output_padding_offset", None)
+
+        model_inputs = {
+            "input_ids": input_ids,
+            "src_mask": src_mask,
+            "rope_emb": None,
+            "pre_caches": pre_caches,
+            "caches": caches,
+            "seq_lens_this_time": seq_lens_this_time,
+            "seq_lens_encoder": seq_lens_encoder,
+            "seq_lens_decoder": seq_lens_decoder,
+            "block_tables": block_tables,
+            "k_quant_scales": k_quant_scales,
+            "v_quant_scales": v_quant_scales,
+            "k_dequant_scales": k_dequant_scales,
+            "v_dequant_scales": v_dequant_scales,
+            "draft_tokens": draft_tokens,
+            "output_padding_offset": output_padding_offset,
+        }
+        return model_inputs
+
+    def forward(
+        self,
+        input_ids,
+        src_mask=None,
+        pre_caches=None,
+        caches=None,
+        seq_lens_this_time=None,
+        seq_lens_encoder=None,
+        seq_lens_decoder=None,
+        rope_emb=None,
+        block_tables=None,
+        k_quant_scales=None,
+        v_quant_scales=None,
+        k_dequant_scales=None,
+        v_dequant_scales=None,
+        draft_tokens=None,
+        output_padding_offset=None,
+    ):
+        outputs = self.deepseek_v2(
+            input_ids,
+            src_mask=src_mask,
+            caches=caches,
+            rope_emb=None,
+            block_tables=block_tables,
+            pre_caches=pre_caches,
+            seq_lens_this_time=seq_lens_this_time,
+            seq_lens_encoder=seq_lens_encoder,
+            seq_lens_decoder=seq_lens_decoder,
+            k_quant_scales=k_quant_scales,
+            v_quant_scales=v_quant_scales,
+            k_dequant_scales=k_dequant_scales,
+            v_dequant_scales=v_dequant_scales,
+            draft_tokens=draft_tokens,
+            output_padding_offset=output_padding_offset,
+        )
+        if self.return_full_hidden_states:
+            from paddlenlp_ops import rebuild_padding_v2
+
+            full_hidden_states = outputs[0]
+            cum_offsets = outputs[1]
+            hidden_states = rebuild_padding_v2(
+                full_hidden_states,
+                cum_offsets,
+                seq_lens_decoder,
+                seq_lens_encoder,
+                output_padding_offset,
+                self.max_seq_len,
+            )
+        else:
+            hidden_states = outputs[0]
+        logits = self.lm_head(
+            hidden_states,
+            tensor_parallel_output=False,
+        )
+        if self.return_full_hidden_states:
+            return logits, full_hidden_states
+        else:
+            return logits
+
+        return logits
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        if "lm_head.weight" in state_dict:
+            self.lm_head.weight.set_value(
+                paddle.to_tensor(state_dict["lm_head.weight"]).cast(self.lm_head.weight.dtype)
+            )
+        self.deepseek_v2.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
+
+
+class MTPDeepseekV2ForCausalLMBlockInferenceModel(DeepseekV2ForCausalLMBlockInferenceModel):
+    def __init__(self, config, base_model_prefix):
+        super(DeepseekV2ForCausalLMBlockInferenceModel, self).__init__(config, base_model_prefix="deepseek_v3_mtp")
+        self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
+        self.verify_window = config.get("speculate_verify_window", 2)
+        self.max_seq_len = config.max_seq_len
+
+        self.mtp = MTPDeepseekV2BlockInferenceModel(config, base_model_prefix="deepseek_v3_mtp")
+        self.tensor_parallel_rank = config.tensor_parallel_rank
+        if config.tie_word_embeddings:
+            self.lm_head = DeepseekV2LMHead(config, embedding_weights=self.llama.embed_tokens.weight, transpose_y=True)
+            self.tie_weights()
+        else:
+            self.lm_head = DeepseekV2LMHead(config)
+
+    def prepare_inputs_for_generation(self, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        input_ids = kwargs["input_ids"]
+        src_mask = kwargs.get("src_mask", None)
+        block_tables = kwargs.get("block_tables", None)
+
+        pre_caches = kwargs.get("pre_caches", None)
+        caches = kwargs.get("caches", None)
+
+        seq_lens_this_time = kwargs["seq_lens_this_time"]
+        seq_lens_encoder = kwargs["seq_lens_encoder"]
+        seq_lens_decoder = kwargs["seq_lens_decoder"]
+        k_quant_scales = kwargs.get("k_quant_scales", None)
+        v_quant_scales = kwargs.get("v_quant_scales", None)
+        k_dequant_scales = kwargs.get("k_dequant_scales", None)
+        v_dequant_scales = kwargs.get("v_dequant_scales", None)
+
+        # speculative decoding related parameters
+        draft_tokens = kwargs.get("draft_tokens", None)
+        output_padding_offset = kwargs.get("output_padding_offset", None)
+        hidden_states = kwargs.get("hidden_states", None)
+
+        model_inputs = {
+            "input_ids": input_ids,
+            "src_mask": src_mask,
+            "rope_emb": None,
+            "pre_caches": pre_caches,
+            "caches": caches,
+            "seq_lens_this_time": seq_lens_this_time,
+            "seq_lens_encoder": seq_lens_encoder,
+            "seq_lens_decoder": seq_lens_decoder,
+            "block_tables": block_tables,
+            "k_quant_scales": k_quant_scales,
+            "v_quant_scales": v_quant_scales,
+            "k_dequant_scales": k_dequant_scales,
+            "v_dequant_scales": v_dequant_scales,
+            "draft_tokens": draft_tokens,
+            "output_padding_offset": output_padding_offset,
+            "pre_hidden_states": hidden_states,
+        }
+        return model_inputs
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        if "lm_head.weight" in state_dict:
+            self.lm_head.weight.set_value(
+                paddle.to_tensor(state_dict["lm_head.weight"]).cast(self.lm_head.weight.dtype)
+            )
+
+        self.mtp.enorm.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.enorm.weight"]).cast(self.lm_head.weight.dtype)
+        )
+        self.mtp.hnorm.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.hnorm.weight"]).cast(self.lm_head.weight.dtype)
+        )
+        self.mtp.norm.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.norm.weight"]).cast(self.lm_head.weight.dtype)
+        )
+        self.mtp.eh_proj.weight.set_value(
+            paddle.to_tensor(state_dict["deepseek_v3_mtp.eh_proj.weight"]).cast(self.lm_head.weight.dtype)
+        )
+
+        self.mtp.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
+
+    def forward(
+        self,
+        input_ids,
+        src_mask=None,
+        pre_caches=None,
+        caches=None,
+        seq_lens_this_time=None,
+        seq_lens_encoder=None,
+        seq_lens_decoder=None,
+        rope_emb=None,
+        block_tables=None,
+        k_quant_scales=None,
+        v_quant_scales=None,
+        k_dequant_scales=None,
+        v_dequant_scales=None,
+        draft_tokens=None,
+        output_padding_offset=None,
+        pre_hidden_states=None,
+    ):
+        outputs = self.mtp(
+            input_ids,
+            src_mask=src_mask,
+            caches=caches,
+            rope_emb=rope_emb,
+            block_tables=block_tables,
+            pre_caches=pre_caches,
+            seq_lens_this_time=seq_lens_this_time,
+            seq_lens_encoder=seq_lens_encoder,
+            seq_lens_decoder=seq_lens_decoder,
+            k_quant_scales=k_quant_scales,
+            v_quant_scales=v_quant_scales,
+            k_dequant_scales=k_dequant_scales,
+            v_dequant_scales=v_dequant_scales,
+            draft_tokens=draft_tokens,
+            output_padding_offset=output_padding_offset,
+            pre_hidden_states=pre_hidden_states,
+        )
+
+        hidden_states = outputs[0]
+
+        logits = self.lm_head(
+            hidden_states,
+            tensor_parallel_output=False,
+        )
+
+        return logits, hidden_states
diff --git a/paddlenlp/experimental/transformers/deepseek_v3/__init__.py b/paddlenlp/experimental/transformers/deepseek_v3/__init__.py
new file mode 100644
index 000000000000..c2a7f656c636
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v3/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/deepseek_v3/modeling.py b/paddlenlp/experimental/transformers/deepseek_v3/modeling.py
new file mode 100644
index 000000000000..5a63a7a548ff
--- /dev/null
+++ b/paddlenlp/experimental/transformers/deepseek_v3/modeling.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from paddlenlp.experimental.transformers.deepseek_v2.modeling import (
+    DeepseekV2ForCausalLMBlockInferenceModel,
+    MTPDeepseekV2ForCausalLMBlockInferenceModel,
+)
+from paddlenlp.transformers import DeepseekV3Config
+
+__all__ = ["DeepseekV3ForCausalLMBlockInferenceModel"]
+
+
+class DeepseekV3ForCausalLMBlockInferenceModel(DeepseekV2ForCausalLMBlockInferenceModel):
+    def __init__(self, config: DeepseekV3Config, base_model_prefix: str = "deepseek_v3"):
+        super().__init__(config, base_model_prefix)
+
+
+class MTPDeepseekV3ForCausalLMBlockInferenceModel(MTPDeepseekV2ForCausalLMBlockInferenceModel):
+    def __init__(self, config: DeepseekV3Config, base_model_prefix: str = "deepseek_v3_mtp"):
+        super().__init__(config, base_model_prefix)
diff --git a/paddlenlp/experimental/transformers/fused_transformer_layers.py b/paddlenlp/experimental/transformers/fused_transformer_layers.py
index 00d80cad9b07..a9de2fea3469 100644
--- a/paddlenlp/experimental/transformers/fused_transformer_layers.py
+++ b/paddlenlp/experimental/transformers/fused_transformer_layers.py
@@ -76,6 +76,7 @@ def use_cutlass_fp8_gemm():
 
 __all__ = [
     "MoeConfig",
+    "MLAConfig",
     "FusedMultiTransformerConfig",
     "FusedMultiTransformerBase",
     "FusedMultiTransformerPostLayernorm",
@@ -107,8 +108,16 @@ def _set_var_distributed(var):
 class MoeConfig:
     num_experts: int = 0
     top_k: int = 0
+    topk_method: Optional[str] = None
+    num_expert_group: int = 1
+    topk_group: Optional[int] = None
     norm_topk_prob: bool = True
     moe_every2: bool = False
+    first_k_dense_replace: int = 0
+    moe_intermediate_size: int = 0
+    routed_scaling_factor: float = 1.0
+
+    shared_expert_with_gate: bool = True
 
     shared_expert_intermediate_size: int = 0
     shared_expert_ffn1_weight_attrs: Optional[List[paddle.ParamAttr]] = None
@@ -121,7 +130,11 @@ def has_moe(self) -> bool:
         return self.num_experts > 1
 
     def use_moe(self, i: int) -> bool:
-        return self.has_moe() and (self.moe_every2 is False or (self.moe_every2 and i % 2 == 1))
+        return (
+            self.has_moe()
+            and (self.moe_every2 is False or (self.moe_every2 and i % 2 == 1))
+            and i >= self.first_k_dense_replace
+        )
 
     def has_shared_expert(self) -> bool:
         return self.has_moe() and self.shared_expert_intermediate_size > 0
@@ -141,6 +154,39 @@ class AvxConfig:
 class SpeculateConfig:
     speculate_max_draft_token_num: int = 5
     speculate_method: str = None
+    return_full_hidden_states: bool = False
+
+
+@dataclass
+class MLAConfig:
+    q_lora_rank: int = None
+    kv_lora_rank: int = None
+    qk_nope_head_dim: int = None
+    qk_rope_head_dim: int = None
+    v_head_dim: int = None
+
+    mscale: float = 1.0
+
+    q_proj_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+
+    q_a_proj_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_a_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_a_layernorm_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_b_proj_weight_attrs: Optional[List[paddle.ParamAttr]] = None
+    q_b_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+    kv_a_proj_with_mqa_weight_attrs: List[paddle.ParamAttr] = None
+    kv_a_proj_with_mqa_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+    kv_a_layernorm_weight_attrs: List[paddle.ParamAttr] = None
+    kv_b_proj_weight_attrs: List[paddle.ParamAttr] = None
+    kv_b_proj_weight_scale_attrs: Optional[List[paddle.ParamAttr]] = None
+
+    def use_mla(self) -> bool:
+        return self.kv_lora_rank is not None
+
+    @property
+    def qk_head_dim(self) -> int:
+        return self.qk_nope_head_dim + self.qk_rope_head_dim
 
 
 class FusedMultiTransformerConfig:
@@ -148,13 +194,14 @@ def __init__(
         self,
         embed_dim,
         num_heads,
-        dim_feedforward,
+        intermediate_size,
         quant_type="",
         dropout_rate=0.0,
         activation="gelu",
         norm_type="layernorm",
         use_neox_rotary_style=False,
         rope_theta=10000.0,
+        rotary_emb=None,
         normalize_before=True,
         ln_scale_attrs=None,
         ln_bias_attrs=None,
@@ -180,6 +227,7 @@ def __init__(
         ffn2_weight_attrs=None,
         ffn2_weight_scale_attrs=None,
         ffn2_bias_attrs=None,
+        e_score_correction_bias_attrs=None,
         qkv_out_scale_attrs=None,
         linear_out_scale_attrs=None,
         ffn1_out_scale_attrs=None,
@@ -208,6 +256,7 @@ def __init__(
         moe_config=MoeConfig(),
         avx_config=AvxConfig(),
         speculate_config=SpeculateConfig(),
+        mla_config=MLAConfig(),
     ):
         self.embed_dim = embed_dim
         self.num_heads = num_heads
@@ -215,11 +264,14 @@ def __init__(
             self.kv_num_heads = kv_num_heads
         else:
             self.kv_num_heads = num_heads
-        self.dim_feedforward = dim_feedforward
+        self.intermediate_size = intermediate_size
         self.dropout_rate = dropout_rate
         self.activation = activation
         self.norm_type = norm_type
         self.rope_theta = rope_theta
+
+        self.rotary_emb = rotary_emb
+
         self.use_neox_rotary_style = use_neox_rotary_style
         self.normalize_before = normalize_before
         self.ln_scale_attrs = ln_scale_attrs
@@ -250,6 +302,8 @@ def __init__(
         self.ffn2_weight_scale_attrs = ffn2_weight_scale_attrs
         self.ffn2_bias_attrs = ffn2_bias_attrs
 
+        self.e_score_correction_bias_attrs = e_score_correction_bias_attrs
+
         self.qkv_out_scale_attrs = qkv_out_scale_attrs
         self.linear_out_scale_attrs = linear_out_scale_attrs
         self.ffn1_out_scale_attrs = ffn1_out_scale_attrs
@@ -286,6 +340,7 @@ def __init__(
         self.moe_config = moe_config
         self.avx_config = avx_config
         self.speculate_config = speculate_config
+        self.mla_config = mla_config
 
 
 class FusedMultiTransformerBase(Layer):
@@ -298,8 +353,8 @@ def __init__(self, config: FusedMultiTransformerConfig):
             config.embed_dim
         )
         assert config.num_heads > 0, "Expected nhead to be greater than 0, " "but received {}".format(config.num_heads)
-        assert config.dim_feedforward > 0, "Expected dim_feedforward to be greater than 0, but received {}".format(
-            config.dim_feedforward
+        assert config.intermediate_size > 0, "Expected intermediate_size to be greater than 0, but received {}".format(
+            config.intermediate_size
         )
 
         # self.normalize_before = normalize_before
@@ -332,27 +387,38 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.activation = config.activation
 
         self.embed_dim = config.embed_dim
-        self.head_dim = config.embed_dim // config.num_heads
-        assert self.head_dim * config.num_heads == config.embed_dim, "embed_dim must be divisible by num_heads"
+        if config.mla_config.use_mla():
+            self.head_dim = config.mla_config.v_head_dim
+        else:
+            self.head_dim = config.embed_dim // config.num_heads
+            assert self.head_dim * config.num_heads == config.embed_dim, "embed_dim must be divisible by num_heads"
 
         # tensor model parallel
         if config.nranks > 1:
             assert config.ring_id != -1
         assert config.num_heads % config.nranks == 0
-        assert config.dim_feedforward % config.nranks == 0
+        assert config.intermediate_size % config.nranks == 0
         assert config.moe_config.shared_expert_intermediate_size % config.nranks == 0
+        assert config.moe_config.moe_intermediate_size % config.nranks == 0
         self.num_heads = config.num_heads // config.nranks
         self.kv_num_heads = config.kv_num_heads // config.nranks
-        dim_feedforward = config.dim_feedforward // config.nranks
-        self.dim_feedforward = dim_feedforward
-        shared_expert_intermediate_size = config.moe_config.shared_expert_intermediate_size // config.nranks
-        self.config.moe_config.shared_expert_intermediate_size = shared_expert_intermediate_size
+        self.intermediate_size = config.intermediate_size // config.nranks
+        self.config.moe_config.shared_expert_intermediate_size //= config.nranks
+        self.config.moe_config.moe_intermediate_size //= config.nranks
 
         self.num_layers = config.num_layers
         assert self.num_layers > 0
-        if isinstance(config.qkv_weight_attrs, (list, tuple)):
+        if config.qkv_weight_attrs is not None and isinstance(config.qkv_weight_attrs, (list, tuple)):
             assert self.num_layers == len(config.qkv_weight_attrs)
 
+        if self.config.mla_config.use_mla():
+            mscale = self.config.mla_config.mscale
+            self.softmax_scale = float(self.config.mla_config.qk_head_dim**-0.5) * mscale * mscale
+        else:
+            self.softmax_scale = float(self.head_dim**-0.5)
+
+        self.position_ids: list[int] = []
+
         self.weight_dtype = self._dtype
         self.create_params_type = self.get_weight_create_dype()
 
@@ -362,10 +428,12 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.ffn_ln_scales, self.ffn_ln_biases = [], []
         self.ffn1_biases = []
         self.ffn2_biases = []
-        if self.config.moe_config.has_shared_expert():
-            self.shared_expert_gate_weights = []
-            self.shared_expert_ffn1_weights = []
-            self.shared_expert_ffn2_weights = []
+        self.e_score_correction_biases = []
+
+        self.shared_expert_gate_weights = []
+        self.shared_expert_ffn1_weights = []
+        self.shared_expert_ffn2_weights = []
+
         self.cache_k_scales, self.cache_v_scales = [], []
         self.cache_k_out_scales, self.cache_v_out_scales = [], []
 
@@ -382,11 +450,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn_ln_bias_attr = self.get_attr(config.ffn_ln_bias_attrs, i)
             ffn1_bias_attr = self.get_attr(config.ffn1_bias_attrs, i)
             ffn2_bias_attr = self.get_attr(config.ffn2_bias_attrs, i)
-
-            if self.config.moe_config.use_shared_expert(i):
-                shared_expert_gate_weight_attr = self.get_attr(config.moe_config.shared_expert_gate_weight_attrs, i)
-                shared_expert_ffn1_weight_attr = self.get_attr(config.moe_config.shared_expert_ffn1_weight_attrs, i)
-                shared_expert_ffn2_weight_attr = self.get_attr(config.moe_config.shared_expert_ffn2_weight_attrs, i)
+            e_score_correction_bias_attr = self.get_attr(config.e_score_correction_bias_attrs, i)
 
             cache_k_scale_attr = self.get_attr(config.cache_k_scale_attrs, i)
             cache_v_scale_attr = self.get_attr(config.cache_v_scale_attrs, i)
@@ -447,21 +511,34 @@ def __init__(self, config: FusedMultiTransformerConfig):
             if ffn1_bias_attr:
                 if self.config.moe_config.use_moe(i):
                     ffn1_bias = self.create_parameter(
-                        shape=[self.config.moe_config.num_experts, self.dim_feedforward * 2]
+                        shape=[self.config.moe_config.num_experts, self.intermediate_size * 2]
                         if self.activation.endswith("glu")
-                        else [self.config.moe_config.num_experts, self.dim_feedforward],
+                        else [self.config.moe_config.num_experts, self.intermediate_size],
                         attr=ffn1_bias_attr,
                         dtype=self._dtype,
                         is_bias=True,
                     )
                 else:
                     ffn1_bias = self.create_parameter(
-                        shape=[dim_feedforward * 2] if self.activation.endswith("glu") else [dim_feedforward],
+                        shape=[self.intermediate_size * 2]
+                        if self.activation.endswith("glu")
+                        else [self.intermediate_size],
                         attr=ffn1_bias_attr,
                         dtype=self._dtype,
                         is_bias=True,
                     )
 
+            e_score_correction_bias = None
+            if e_score_correction_bias_attr:
+                if self.config.moe_config.use_moe(i):
+                    if self.config.moe_config.topk_method == "noaux_tc":
+                        e_score_correction_bias = self.create_parameter(
+                            shape=[self.config.moe_config.num_experts],
+                            attr=e_score_correction_bias_attr,
+                            dtype="float32",
+                            is_bias=True,
+                        )
+
             ffn2_bias = None
             if ffn2_bias_attr:
                 if self.config.moe_config.use_moe(i):
@@ -479,26 +556,9 @@ def __init__(self, config: FusedMultiTransformerConfig):
                         is_bias=True,
                     )
 
-            if self.config.moe_config.use_shared_expert(i):
-                shared_expert_ffn1_weight = self.create_parameter(
-                    shape=self.shared_expert_ffn1_weight_shape,
-                    attr=shared_expert_ffn1_weight_attr,
-                    dtype=self.create_params_type,
-                )
-                shared_expert_ffn2_weight = self.create_parameter(
-                    shape=self.shared_expert_ffn2_weight_shape,
-                    attr=shared_expert_ffn2_weight_attr,
-                    dtype=self.create_params_type,
-                )
-                shared_expert_gate_weight = self.create_parameter(
-                    shape=self.shared_expert_gate_weight_shape,
-                    attr=shared_expert_gate_weight_attr,
-                    dtype=self._helper.get_default_dtype(),
-                )
-
             cache_scale_dtype = "float32"
             if self.config.append_attn:
-                cache_scale_dtype = paddle.get_default_dtype()
+                cache_scale_dtype = self._dtype
 
             cache_k_scale = None
             if cache_k_scale_attr:
@@ -541,9 +601,6 @@ def __init__(self, config: FusedMultiTransformerConfig):
                 # column parallel
                 _set_var_distributed(qkv_bias)
                 _set_var_distributed(ffn1_bias)
-                if self.config.moe_config.use_shared_expert(i):
-                    _set_var_distributed(shared_expert_ffn1_weight)
-                    _set_var_distributed(shared_expert_ffn2_weight)
 
             self.ln_scales.append(ln_scale)
             self.ln_biases.append(ln_bias)
@@ -554,11 +611,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             self.ffn_ln_biases.append(ffn_ln_bias)
             self.ffn1_biases.append(ffn1_bias)
             self.ffn2_biases.append(ffn2_bias)
-
-            if self.config.moe_config.use_shared_expert(i):
-                self.shared_expert_ffn1_weights.append(shared_expert_ffn1_weight)
-                self.shared_expert_ffn2_weights.append(shared_expert_ffn2_weight)
-                self.shared_expert_gate_weights.append(shared_expert_gate_weight)
+            self.e_score_correction_biases.append(e_score_correction_bias)
 
             self.cache_k_scales.append(cache_k_scale)
             self.cache_v_scales.append(cache_v_scale)
@@ -574,11 +627,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             self._add_parameter(ffn_ln_bias)
             self._add_parameter(ffn1_bias)
             self._add_parameter(ffn2_bias)
-
-            if self.config.moe_config.use_shared_expert(i):
-                self._add_parameter(shared_expert_ffn1_weight)
-                self._add_parameter(shared_expert_ffn2_weight)
-                self._add_parameter(shared_expert_gate_weight)
+            self._add_parameter(e_score_correction_bias)
 
             self._add_parameter(cache_k_scale)
             self._add_parameter(cache_v_scale)
@@ -587,10 +636,6 @@ def __init__(self, config: FusedMultiTransformerConfig):
 
         self.dropout_rate = config.dropout_rate
 
-        from paddle.incubate.nn.functional import fused_linear
-
-        self.linear = fused_linear
-
     def init_weight(self):
         self.qkv_weights = []
         self.linear_weights = []
@@ -598,19 +643,86 @@ def init_weight(self):
         self.ffn1_weights = []
         self.ffn2_weights = []
 
+        self.q_proj_weights = []
+        self.q_a_proj_weights = []
+        self.q_a_layernorm_weights = []
+        self.q_b_proj_weights = []
+        self.kv_a_proj_with_mqa_weights = []
+        self.kv_a_layernorm_weights = []
+        self.kv_b_proj_weights = []
+
         for i in range(self.num_layers):
-            qkv_weight_attr = self.get_attr(self.config.qkv_weight_attrs, i)
             linear_weight_attr = self.get_attr(self.config.linear_weight_attrs, i)
             gate_weight_attr = self.get_attr(self.config.gate_weight_attrs, i)
             ffn1_weight_attr = self.get_attr(self.config.ffn1_weight_attrs, i)
             ffn2_weight_attr = self.get_attr(self.config.ffn2_weight_attrs, i)
 
-            qkv_weight = self.create_parameter(
-                shape=self.qkv_weight_shape,
-                attr=qkv_weight_attr,
-                dtype=self.create_params_type,
-                is_bias=False,
-            )
+            qkv_weight = None
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    q_proj_weight_attr = self.get_attr(self.config.mla_config.q_proj_weight_attrs, i)
+                    q_proj_weight = self.create_parameter(
+                        shape=self.q_proj_weight_shape,
+                        attr=q_proj_weight_attr,
+                        dtype=self.create_params_type,
+                        is_bias=False,
+                    )
+                else:
+                    q_a_proj_weight_attr = self.get_attr(self.config.mla_config.q_a_proj_weight_attrs, i)
+                    q_a_layernorm_weight_attr = self.get_attr(self.config.mla_config.q_a_layernorm_weight_attrs, i)
+                    q_b_proj_weight_attr = self.get_attr(self.config.mla_config.q_b_proj_weight_attrs, i)
+                    q_a_proj_weight = self.create_parameter(
+                        shape=self.q_a_proj_weight_shape,
+                        attr=q_a_proj_weight_attr,
+                        dtype=self.create_params_type,
+                        is_bias=False,
+                    )
+                    q_a_layernorm_weight = self.create_parameter(
+                        shape=[self.config.mla_config.q_lora_rank],
+                        attr=q_a_layernorm_weight_attr,
+                        dtype=self._norm_weight_dtype,
+                        is_bias=False,
+                    )
+                    q_b_proj_weight = self.create_parameter(
+                        shape=self.q_b_proj_weight_shape,
+                        attr=q_b_proj_weight_attr,
+                        dtype=self.create_params_type,
+                        is_bias=False,
+                    )
+
+                kv_a_proj_with_mqa_weight_attr = self.get_attr(
+                    self.config.mla_config.kv_a_proj_with_mqa_weight_attrs, i
+                )
+                kv_a_layernorm_weight_attr = self.get_attr(self.config.mla_config.kv_a_layernorm_weight_attrs, i)
+                kv_b_proj_weight_attr = self.get_attr(self.config.mla_config.kv_b_proj_weight_attrs, i)
+
+                kv_a_proj_with_mqa_weight = self.create_parameter(
+                    shape=self.kv_a_proj_with_mqa_weight_shape,
+                    attr=kv_a_proj_with_mqa_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
+                kv_a_layernorm_weight = self.create_parameter(
+                    shape=[self.config.mla_config.kv_lora_rank],
+                    attr=kv_a_layernorm_weight_attr,
+                    dtype=self._norm_weight_dtype,
+                    is_bias=False,
+                )
+                kv_b_proj_weight = self.create_parameter(
+                    shape=self.kv_b_proj_weight_shape,
+                    attr=kv_b_proj_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
+            else:
+                qkv_weight_attr = self.get_attr(self.config.qkv_weight_attrs, i)
+                qkv_weight = self.create_parameter(
+                    shape=self.qkv_weight_shape,
+                    attr=qkv_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
+
             linear_weight = self.create_parameter(
                 shape=self.linear_weight_shape,
                 attr=linear_weight_attr,
@@ -619,7 +731,6 @@ def init_weight(self):
             )
 
             gate_weight = None
-
             if self.config.moe_config.use_moe(i):
                 gate_weight = self.create_parameter(
                     shape=[self.config.embed_dim, self.config.moe_config.num_experts],
@@ -636,6 +747,12 @@ def init_weight(self):
                     dtype=self.create_params_type,
                     is_bias=False,
                 )
+                ffn2_weight = self.create_parameter(
+                    shape=self.moe_ffn2_weight_shape,
+                    attr=ffn2_weight_attr,
+                    dtype=self.create_params_type,
+                    is_bias=False,
+                )
             else:
                 ffn1_weight = self.create_parameter(
                     shape=self.ffn1_weight_shape,
@@ -643,20 +760,44 @@ def init_weight(self):
                     dtype=self.create_params_type,
                     is_bias=False,
                 )
-            if self.config.moe_config.use_moe(i):
                 ffn2_weight = self.create_parameter(
-                    shape=self.moe_ffn2_weight_shape,
+                    shape=self.ffn2_weight_shape,
                     attr=ffn2_weight_attr,
                     dtype=self.create_params_type,
                     is_bias=False,
                 )
-            else:
-                ffn2_weight = self.create_parameter(
-                    shape=self.ffn2_weight_shape,
-                    attr=ffn2_weight_attr,
+
+            shared_expert_ffn1_weight = None
+            shared_expert_ffn2_weight = None
+            shared_expert_gate_weight = None
+            if self.config.moe_config.use_shared_expert(i):
+                if self.config.moe_config.shared_expert_with_gate:
+                    shared_expert_gate_weight_attr = self.get_attr(
+                        self.config.moe_config.shared_expert_gate_weight_attrs, i
+                    )
+                shared_expert_ffn1_weight_attr = self.get_attr(
+                    self.config.moe_config.shared_expert_ffn1_weight_attrs, i
+                )
+                shared_expert_ffn2_weight_attr = self.get_attr(
+                    self.config.moe_config.shared_expert_ffn2_weight_attrs, i
+                )
+
+                shared_expert_ffn1_weight = self.create_parameter(
+                    shape=self.shared_expert_ffn1_weight_shape,
+                    attr=shared_expert_ffn1_weight_attr,
                     dtype=self.create_params_type,
-                    is_bias=False,
                 )
+                shared_expert_ffn2_weight = self.create_parameter(
+                    shape=self.shared_expert_ffn2_weight_shape,
+                    attr=shared_expert_ffn2_weight_attr,
+                    dtype=self.create_params_type,
+                )
+                if self.config.moe_config.shared_expert_with_gate:
+                    shared_expert_gate_weight = self.create_parameter(
+                        shape=self.shared_expert_gate_weight_shape,
+                        attr=shared_expert_gate_weight_attr,
+                        dtype=self._helper.get_default_dtype(),
+                    )
 
             # tensor model parallel
             if self.config.nranks > 1:
@@ -667,16 +808,54 @@ def init_weight(self):
                 _set_var_distributed(linear_weight)
                 _set_var_distributed(ffn2_weight)
 
-            self.qkv_weights.append(qkv_weight)
+                if self.config.moe_config.use_shared_expert(i):
+                    _set_var_distributed(shared_expert_ffn1_weight)
+                    _set_var_distributed(shared_expert_ffn2_weight)
+
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self.q_proj_weights.append(q_proj_weight)
+                else:
+                    self.q_a_proj_weights.append(q_a_proj_weight)
+                    self.q_a_layernorm_weights.append(q_a_layernorm_weight)
+                    self.q_b_proj_weights.append(q_b_proj_weight)
+                self.kv_a_proj_with_mqa_weights.append(kv_a_proj_with_mqa_weight)
+                self.kv_a_layernorm_weights.append(kv_a_layernorm_weight)
+                self.kv_b_proj_weights.append(kv_b_proj_weight)
+            else:
+                self.qkv_weights.append(qkv_weight)
+
             self.linear_weights.append(linear_weight)
 
-            if gate_weight is not None:
-                self.gate_weights.append(gate_weight)
+            self.gate_weights.append(gate_weight)
             self.ffn1_weights.append(ffn1_weight)
             self.ffn2_weights.append(ffn2_weight)
 
-            self._add_parameter(qkv_weight)
+            self.shared_expert_ffn1_weights.append(shared_expert_ffn1_weight)
+            self.shared_expert_ffn2_weights.append(shared_expert_ffn2_weight)
+            self.shared_expert_gate_weights.append(shared_expert_gate_weight)
+
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self._add_parameter(q_proj_weight)
+                else:
+                    self._add_parameter(q_a_proj_weight)
+                    self._add_parameter(q_a_layernorm_weight)
+                    self._add_parameter(q_b_proj_weight)
+                self._add_parameter(kv_a_proj_with_mqa_weight)
+                self._add_parameter(kv_a_layernorm_weight)
+                self._add_parameter(kv_b_proj_weight)
+            else:
+                self._add_parameter(qkv_weight)
+
+            if self.config.moe_config.use_shared_expert(i):
+                self._add_parameter(shared_expert_ffn1_weight)
+                self._add_parameter(shared_expert_ffn2_weight)
+                if self.config.moe_config.shared_expert_with_gate:
+                    self._add_parameter(shared_expert_gate_weight)
+
             self._add_parameter(linear_weight)
+
             if gate_weight is not None:
                 self._add_parameter(gate_weight)
             self._add_parameter(ffn1_weight)
@@ -697,27 +876,55 @@ def _add_parameter(self, param):
         self._parameters[param.name] = param
 
     def init_weight_shape(self, config):
-        self.qkv_weight_shape = (
-            [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim]
-            if config.trans_qkvw
-            else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
-        )
+
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is None:
+                self.q_proj_weight_shape = [
+                    self.config.embed_dim,
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                ]
+            else:
+                self.q_a_proj_weight_shape = [self.config.embed_dim, self.config.mla_config.q_lora_rank]
+                self.q_b_proj_weight_shape = [
+                    self.config.mla_config.q_lora_rank,
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                ]
+
+            self.kv_a_proj_with_mqa_weight_shape = [
+                self.config.embed_dim,
+                self.config.mla_config.kv_lora_rank + self.config.mla_config.qk_rope_head_dim,
+            ]
+            self.kv_b_proj_weight_shape = [
+                self.config.mla_config.kv_lora_rank,
+                self.num_heads * (self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim),
+            ]
+        else:
+            self.qkv_weight_shape = (
+                [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim]
+                if config.trans_qkvw
+                else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
+            )
+
         self.linear_weight_shape = [self.num_heads * self.head_dim, self.embed_dim]
 
         self.ffn1_weight_shape = (
-            [self.embed_dim, self.dim_feedforward * 2]
+            [self.embed_dim, self.intermediate_size * 2]
             if self.activation.endswith("glu")
-            else [self.embed_dim, self.dim_feedforward]
+            else [self.embed_dim, self.intermediate_size]
         )
-        self.ffn2_weight_shape = [self.dim_feedforward, self.embed_dim]
+        self.ffn2_weight_shape = [self.intermediate_size, self.embed_dim]
 
-        if self.config.moe_config.has_moe() is True:
+        if self.config.moe_config.has_moe():
             self.moe_ffn1_weight_shape = (
-                [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward * 2]
+                [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size * 2]
                 if self.activation.endswith("glu")
-                else [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward]
+                else [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size]
             )
-            self.moe_ffn2_weight_shape = [self.config.moe_config.num_experts, self.dim_feedforward, self.embed_dim]
+            self.moe_ffn2_weight_shape = [
+                self.config.moe_config.num_experts,
+                self.config.moe_config.moe_intermediate_size,
+                self.embed_dim,
+            ]
 
         if self.config.moe_config.has_shared_expert():
             self.shared_expert_ffn1_weight_shape = [
@@ -728,10 +935,11 @@ def init_weight_shape(self, config):
                 self.config.moe_config.shared_expert_intermediate_size,
                 self.embed_dim,
             ]
-            self.shared_expert_gate_weight_shape = [
-                self.embed_dim,
-                1,
-            ]
+            if self.config.moe_config.shared_expert_with_gate:
+                self.shared_expert_gate_weight_shape = [
+                    self.embed_dim,
+                    1,
+                ]
 
     def skip_quant(self, layer_name, layer_idx):
         return False
@@ -740,7 +948,7 @@ def get_weight_create_dype(self):
         return self._dtype
 
     def compute_layernorm_before_qkv(self, src, i):
-        if i == 0:
+        if i == 0 and not self.config.speculate_config.speculate_method == "eagle":
             ln_out = self.norm_func(src, self.ln_scales[i], self.ln_biases[i], self._epsilon, begin_norm_axis=1)[0]
         else:
             ln_out = src
@@ -748,14 +956,65 @@ def compute_layernorm_before_qkv(self, src, i):
         return ln_out
 
     def compute_qkv_linear(self, ln_out, i):
-        if paddle.version.cuda() == "False" or float(paddle.version.cuda()) < 11.6:
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is not None:
+                query = paddle.matmul(ln_out, self.q_a_proj_weights[i])
+                query = self.norm_func(
+                    x=query,
+                    norm_weight=self.q_a_layernorm_weights[i],
+                    norm_bias=None,
+                    epsilon=self._epsilon,
+                    begin_norm_axis=1,
+                )[0]
+                query = paddle.matmul(query, self.q_b_proj_weights[i])
+            else:
+                query = paddle.matmul(ln_out, self.q_proj_weights[i])
+
+            query = query.reshape([-1, self.num_heads, self.config.mla_config.qk_head_dim])
+            query_nope, query_pe = query.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+
+            compressed_kv = paddle.matmul(ln_out, self.kv_a_proj_with_mqa_weights[i])
+            compressed_kv, key_pe = compressed_kv.split(
+                [self.config.mla_config.kv_lora_rank, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+            key_pe = key_pe.reshape([-1, 1, self.config.mla_config.qk_rope_head_dim])
+            compressed_kv = self.norm_func(
+                x=compressed_kv,
+                norm_weight=self.kv_a_layernorm_weights[i],
+                norm_bias=None,
+                epsilon=self._epsilon,
+                begin_norm_axis=1,
+            )[0]
+            key_value = paddle.matmul(compressed_kv, self.kv_b_proj_weights[i])
+            key_value = key_value.reshape(
+                [-1, self.num_heads, self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim]
+            )
+            key_nope, value = key_value.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.v_head_dim], axis=-1
+            )
+            query_pe, key_pe = self.config.rotary_emb(self.position_ids, query_pe, key_pe)
+
+            query[..., self.config.mla_config.qk_nope_head_dim :] = query_pe
+            key = paddle.empty_like(query)
+            key[..., : self.config.mla_config.qk_nope_head_dim] = key_nope
+            key[..., self.config.mla_config.qk_nope_head_dim :] = key_pe
+
+            qkv_out = paddle.concat(
+                [
+                    query.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    key.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    value.reshape([-1, self.num_heads * self.config.mla_config.v_head_dim]),
+                ],
+                axis=-1,
+            )
+        else:
             qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
             if self.qkv_biases[i] is not None:
                 qkv_out = paddle.add(qkv_out, self.qkv_biases[i])
-            return qkv_out
-        else:
-            # This method requires CUDA version >= 11.6.
-            return self.linear(ln_out, self.qkv_weights[i], self.qkv_biases[i], transpose_weight=True)
+
+        return qkv_out
 
     def compute_qkv(self, src, residual_input, i):
         ln_out = self.compute_layernorm_before_qkv(src, i)
@@ -819,7 +1078,7 @@ def compute_fmha(
             seq_lens,
             seq_lens + pre_caches_length,
             mask=attn_mask,
-            scale=float(self.head_dim**-0.5),
+            scale=self.softmax_scale,
         )
 
         return transpose_remove_padding(qktv_out, seq_lens, padding_offset)
@@ -892,19 +1151,113 @@ def compute_ffn_layernorm(self, out_linear_out, residual_input, i):
         return tmp_out, residual_input
 
     def compute_fused_moe(self, tmp_out, i):
-        fused_moe_out = fused_moe(
-            tmp_out,
-            self.gate_weights[i],
-            self.ffn1_weights[i],
-            self.ffn2_weights[i],
-            self.ffn1_biases[i],
-            None,
-            self.ffn2_biases[i],
-            None,
-            "None",
-            self.config.moe_config.top_k,
-            self.config.moe_config.norm_topk_prob,
-        )
+        e_score_correction_bias = self.e_score_correction_biases[i]
+
+        def get_moe_scores(
+            gating_output: paddle.Tensor,
+            config: MoeConfig,
+        ) -> tuple[paddle.Tensor, paddle.Tensor]:
+
+            num_token = gating_output.shape[0]
+            num_expert_group = config.num_expert_group
+            topk_group = config.topk_group
+
+            # Compute softmax or sigmoid scores based on the topk_method
+            if config.topk_method == "greedy":
+                scores = paddle.nn.functional.softmax(gating_output, axis=-1)
+                return scores, scores
+            elif config.topk_method == "group_limited_greedy":
+                scores = paddle.nn.functional.softmax(gating_output, axis=-1)
+                scores_no_bias = scores
+                group_scores = scores.reshape([num_token, num_expert_group, -1]).max(axis=-1)  # [n, num_expert_group]
+            elif config.topk_method == "noaux_tc":
+                if e_score_correction_bias is None:
+                    raise ValueError("e_score_correction_bias must be provided for 'noaux_tc' method.")
+                scores = paddle.nn.functional.sigmoid(gating_output)
+                # 原始 scores
+                scores_no_bias = scores
+                scores = scores + e_score_correction_bias.unsqueeze(0)
+                group_scores = (
+                    scores.reshape([num_token, num_expert_group, -1]).topk(2, axis=-1)[0].sum(axis=-1)
+                )  # [n, num_expert_group]
+            else:
+                raise ValueError(
+                    f"Unsupported topk_method: {config.topk_method}. Please choose 'group_limited_greedy' or 'noaux_tc'."
+                )
+
+            # Identify top-k groups
+            group_idx = paddle.topk(group_scores, k=topk_group, axis=-1, sorted=False)[1]  # [n, topk_group]
+
+            group_mask = paddle.zeros_like(group_scores, dtype="int64")  # [n, num_expert_group]
+            group_mask = paddle.put_along_axis(group_mask, group_idx, 1, axis=1)
+
+            # Apply group mask to the scores
+            score_mask = (
+                group_mask.unsqueeze(-1)
+                .expand([num_token, num_expert_group, scores.shape[-1] // num_expert_group])
+                .reshape([num_token, -1])
+                .astype("float32")
+            )  # [n, e]
+
+            # Scale the scores with the mask and scaling factor
+            scores = scores * score_mask
+
+            # renormalize 和 refactor 在后面做
+            return scores, scores_no_bias
+
+        if self.config.moe_config.topk_method is not None:
+            from paddle.incubate.nn.functional import moe_dispatch, moe_ffn, moe_reduce
+
+            gate_out = paddle.matmul(tmp_out.cast("float32"), self.gate_weights[i])
+            # 应用各种策略后重塑的 scores
+            scores, scores_no_bias = get_moe_scores(gate_out, self.config.moe_config)
+            # topk 在 moe_dispatch 中
+            (
+                permute_input,
+                token_nums_per_expert,
+                permute_indices_per_token,
+                top_k_weights,
+                top_k_indices,
+            ) = moe_dispatch(tmp_out, scores, self.config.moe_config.top_k, False, topk_only_mode=True)
+
+            ffn_out = moe_ffn(
+                permute_input,
+                token_nums_per_expert,
+                self.ffn1_weights[i],
+                self.ffn2_weights[i],
+                self.ffn1_biases[i],
+                self.ffn1_weights_scale[i] if hasattr(self, "ffn1_weights_scale") else None,
+                self.ffn2_weights_scale[i] if hasattr(self, "ffn2_weights_scale") else None,
+                self.quant_type if hasattr(self, "quant_type") else "None",
+            )
+
+            if e_score_correction_bias is not None:
+                top_k_weights = scores_no_bias.take_along_axis(top_k_indices, axis=1)
+
+            # reduce 中会做 topk 个 weight 的 norm 和 routed_scaling_factor
+            fused_moe_out = moe_reduce(
+                ffn_out,
+                top_k_weights,
+                permute_indices_per_token,
+                top_k_indices,
+                self.ffn2_biases[i],
+                norm_topk_prob=self.config.moe_config.norm_topk_prob,
+                routed_scaling_factor=self.config.moe_config.routed_scaling_factor,
+            )
+        else:
+            fused_moe_out = fused_moe(
+                tmp_out,
+                self.gate_weights[i],
+                self.ffn1_weights[i],
+                self.ffn2_weights[i],
+                self.ffn1_biases[i],
+                self.ffn1_weights_scale[i] if hasattr(self, "ffn1_weights_scale") else None,
+                self.ffn2_biases[i],
+                self.ffn2_weights_scale[i] if hasattr(self, "ffn2_weights_scale") else None,
+                self.quant_type if hasattr(self, "quant_type") else "None",
+                self.config.moe_config.top_k,
+                self.config.moe_config.norm_topk_prob,
+            )
         return fused_moe_out
 
     def compute_activation(self, ffn1_out, i):
@@ -917,7 +1270,6 @@ def compute_ffn2(self, ffn1_out, i):
         return paddle.matmul(ffn1_out, self.ffn2_weights[i])
 
     def compute_bias_residual_layernorm(self, ffn2_out, residual_input, i, num_layers):
-
         if i != num_layers - 1:
             norm_out = self.norm_func(
                 ffn2_out,
@@ -945,13 +1297,24 @@ def compute_shared_expert(self, tmp_out, i):
         ffn1_out = paddle.matmul(tmp_out, self.shared_expert_ffn1_weights[i])
         ffn1_out = fused_bias_act(ffn1_out, None, act_method=self.activation)
         ffn2_out = paddle.matmul(ffn1_out, self.shared_expert_ffn2_weights[i])
-        gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
-        gate_out = paddle.nn.functional.sigmoid(gate_out)
-        shared_expert_output = gate_out * ffn2_out
-        return shared_expert_output
+        if self.config.moe_config.shared_expert_with_gate:
+            gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
+            gate_out = paddle.nn.functional.sigmoid(gate_out)
+            return gate_out * ffn2_out
+        return ffn2_out
 
     def pre_process(self, **kwargs):
-        pass
+        if self.config.mla_config.use_mla():
+            seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+            seq_lens_decoder = kwargs.get("seq_lens_decoder", None)
+            seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+            position_ids_shape = paddle.sum(seq_lens_this_time)
+            self.position_ids = paddle.zeros(shape=position_ids_shape, dtype=seq_lens_encoder.dtype)
+
+            from paddlenlp_ops import get_position_ids
+
+            # In-place operations that compute the position_ids.
+            get_position_ids(seq_lens_encoder, seq_lens_decoder, seq_lens_this_time, self.position_ids)
 
     def post_process(self, **kwargs):
         time_step = kwargs.get("time_step", None)
@@ -1022,9 +1385,9 @@ def forward(
         kwargs["cum_offsets"] = cum_offsets
 
         if caches is not None:
-            assert len(caches) == len(self.qkv_weights) or len(caches) == 2 * len(self.qkv_weights)
+            assert len(caches) == len(self.linear_weights) or len(caches) == 2 * len(self.linear_weights)
 
-        assert self.num_layers == len(self.qkv_weights)
+        assert self.num_layers == len(self.linear_weights)
 
         max_enc_len_this_time, max_dec_len_this_time = self.compute_max_len(
             kwargs.get("seq_lens_encoder", None), kwargs.get("seq_lens_decoder", None), cum_offsets
@@ -1051,6 +1414,7 @@ def forward(
                 kwargs.get("seq_lens_encoder", None),
                 kwargs.get("seq_lens_decoder", None),
                 max_enc_len_this_time,
+                max_dec_len_this_time,
                 kwargs.get("seq_lens_this_time", None),
                 kwargs.get("cum_offsets", None),
                 self.num_heads // self.kv_num_heads,
@@ -1175,13 +1539,17 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.ffn1_weights_scale = []
         self.ffn2_weights_scale = []
 
-        if self.config.moe_config.has_shared_expert():
-            self.shared_expert_ffn1_weights_scale = []
-            self.shared_expert_ffn2_weights_scale = []
+        self.q_proj_weights_scale = []
+        self.q_a_proj_weights_scale = []
+        self.q_b_proj_weights_scale = []
+        self.kv_a_proj_with_mqa_weights_scale = []
+        self.kv_b_proj_weights_scale = []
+
+        self.shared_expert_ffn1_weights_scale = []
+        self.shared_expert_ffn2_weights_scale = []
 
         for i in range(self.num_layers):
 
-            qkv_weight_scale_attr = self.get_attr(config.qkv_weight_scale_attrs, i)
             linear_weight_scale_attr = self.get_attr(config.linear_weight_scale_attrs, i)
             ffn1_weight_scale_attr = self.get_attr(config.ffn1_weight_scale_attrs, i)
             ffn2_weight_scale_attr = self.get_attr(config.ffn2_weight_scale_attrs, i)
@@ -1194,12 +1562,59 @@ def __init__(self, config: FusedMultiTransformerConfig):
                     config.moe_config.shared_expert_ffn2_weight_scale_attrs, i
                 )
 
-            qkv_weight_scale = self.create_parameter(
-                shape=[(self.num_heads + 2 * self.kv_num_heads) * self.head_dim],
-                attr=qkv_weight_scale_attr,
-                dtype=self.weight_scale_dtype,
-                is_bias=False,
-            )
+            qkv_weight_scale = None
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    q_proj_weight_scale_attr = self.get_attr(self.config.mla_config.q_proj_weight_scale_attrs, i)
+                    q_proj_weight_scale = self.create_parameter(
+                        shape=[self.num_heads * (self.config.mla_config.qk_head_dim)],
+                        attr=q_proj_weight_scale_attr,
+                        dtype=self.weight_scale_dtype,
+                        is_bias=False,
+                    )
+                else:
+                    q_a_proj_weight_scale_attr = self.get_attr(self.config.mla_config.q_a_proj_weight_scale_attrs, i)
+                    q_b_proj_weight_scale_attr = self.get_attr(self.config.mla_config.q_b_proj_weight_scale_attrs, i)
+                    q_a_proj_weight_scale = self.create_parameter(
+                        shape=[self.config.mla_config.q_lora_rank],
+                        attr=q_a_proj_weight_scale_attr,
+                        dtype=self.weight_scale_dtype,
+                        is_bias=False,
+                    )
+                    q_b_proj_weight_scale = self.create_parameter(
+                        shape=[self.num_heads * (self.config.mla_config.qk_head_dim)],
+                        attr=q_b_proj_weight_scale_attr,
+                        dtype=self.weight_scale_dtype,
+                        is_bias=False,
+                    )
+
+                kv_a_proj_with_mqa_weight_scale_attr = self.get_attr(
+                    self.config.mla_config.kv_a_proj_with_mqa_weight_scale_attrs, i
+                )
+                kv_b_proj_weight_scale_attr = self.get_attr(self.config.mla_config.kv_b_proj_weight_scale_attrs, i)
+
+                kv_a_proj_with_mqa_weight_scale = self.create_parameter(
+                    shape=[self.config.mla_config.kv_lora_rank + self.config.mla_config.qk_rope_head_dim],
+                    attr=kv_a_proj_with_mqa_weight_scale_attr,
+                    dtype=self.weight_scale_dtype,
+                    is_bias=False,
+                )
+                kv_b_proj_weight_scale = self.create_parameter(
+                    shape=[
+                        self.num_heads * (self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim)
+                    ],
+                    attr=kv_b_proj_weight_scale_attr,
+                    dtype=self.weight_scale_dtype,
+                    is_bias=False,
+                )
+            else:
+                qkv_weight_scale_attr = self.get_attr(config.qkv_weight_scale_attrs, i)
+                qkv_weight_scale = self.create_parameter(
+                    shape=[(self.num_heads + 2 * self.kv_num_heads) * self.head_dim],
+                    attr=qkv_weight_scale_attr,
+                    dtype=self.weight_scale_dtype,
+                    is_bias=False,
+                )
 
             linear_weight_scale = self.create_parameter(
                 shape=[self.embed_dim],
@@ -1210,16 +1625,18 @@ def __init__(self, config: FusedMultiTransformerConfig):
 
             if self.config.moe_config.use_moe(i):
                 ffn1_weight_scale = self.create_parameter(
-                    shape=[self.config.moe_config.num_experts, self.dim_feedforward * 2]
+                    shape=[self.config.moe_config.num_experts, self.config.moe_config.moe_intermediate_size * 2]
                     if config.activation.endswith("glu")
-                    else [self.config.moe_config.num_experts, self.dim_feedforward],
+                    else [self.config.moe_config.num_experts, self.config.moe_config.moe_intermediate_size],
                     attr=ffn1_weight_scale_attr,
                     dtype=self.weight_scale_dtype,
                     is_bias=False,
                 )
             else:
                 ffn1_weight_scale = self.create_parameter(
-                    shape=[self.dim_feedforward * 2] if config.activation.endswith("glu") else [self.dim_feedforward],
+                    shape=[self.intermediate_size * 2]
+                    if config.activation.endswith("glu")
+                    else [self.intermediate_size],
                     attr=ffn1_weight_scale_attr,
                     dtype=self.weight_scale_dtype,
                     is_bias=False,
@@ -1240,6 +1657,8 @@ def __init__(self, config: FusedMultiTransformerConfig):
                     is_bias=False,
                 )
 
+            shared_expert_ffn1_weight_scale = None
+            shared_expert_ffn2_weight_scale = None
             if self.config.moe_config.use_shared_expert(i):
                 shared_expert_ffn1_weight_scale = self.create_parameter(
                     shape=[self.config.moe_config.shared_expert_intermediate_size * 2],
@@ -1254,16 +1673,35 @@ def __init__(self, config: FusedMultiTransformerConfig):
                     is_bias=False,
                 )
 
-            self.qkv_weights_scale.append(qkv_weight_scale)
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self.q_proj_weights_scale.append(q_proj_weight_scale)
+                else:
+                    self.q_a_proj_weights_scale.append(q_a_proj_weight_scale)
+                    self.q_b_proj_weights_scale.append(q_b_proj_weight_scale)
+                self.kv_a_proj_with_mqa_weights_scale.append(kv_a_proj_with_mqa_weight_scale)
+                self.kv_b_proj_weights_scale.append(kv_b_proj_weight_scale)
+            else:
+                self.qkv_weights_scale.append(qkv_weight_scale)
+
             self.linear_weights_scale.append(linear_weight_scale)
             self.ffn1_weights_scale.append(ffn1_weight_scale)
             self.ffn2_weights_scale.append(ffn2_weight_scale)
 
-            if self.config.moe_config.use_shared_expert(i):
-                self.shared_expert_ffn1_weights_scale.append(shared_expert_ffn1_weight_scale)
-                self.shared_expert_ffn2_weights_scale.append(shared_expert_ffn2_weight_scale)
+            self.shared_expert_ffn1_weights_scale.append(shared_expert_ffn1_weight_scale)
+            self.shared_expert_ffn2_weights_scale.append(shared_expert_ffn2_weight_scale)
+
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self._add_parameter(q_proj_weight_scale)
+                else:
+                    self._add_parameter(q_a_proj_weight_scale)
+                    self._add_parameter(q_b_proj_weight_scale)
+                self._add_parameter(kv_a_proj_with_mqa_weight_scale)
+                self._add_parameter(kv_b_proj_weight_scale)
+            else:
+                self._add_parameter(qkv_weight_scale)
 
-            self._add_parameter(qkv_weight_scale)
             self._add_parameter(linear_weight_scale)
             self._add_parameter(ffn1_weight_scale)
             self._add_parameter(ffn2_weight_scale)
@@ -1278,27 +1716,68 @@ def get_weight_create_dype(self):
     def init_weight_shape(self, config):
         super().init_weight_shape(config)
 
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is None:
+                self.q_proj_weight_shape = [
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                    self.config.embed_dim,
+                ]
+            else:
+                self.q_a_proj_weight_shape = [self.config.mla_config.q_lora_rank, self.config.embed_dim]
+                self.q_b_proj_weight_shape = [
+                    self.num_heads * (self.config.mla_config.qk_head_dim),
+                    self.config.mla_config.q_lora_rank,
+                ]
+
+            self.kv_a_proj_with_mqa_weight_shape = [
+                self.config.mla_config.kv_lora_rank + self.config.mla_config.qk_rope_head_dim,
+                self.config.embed_dim,
+            ]
+            self.kv_b_proj_weight_shape = [
+                self.num_heads * (self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim),
+                self.config.mla_config.kv_lora_rank,
+            ]
+        else:
+            self.qkv_weight_shape = (
+                [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim]
+                if config.trans_qkvw
+                else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
+            )
+
         self.linear_weight_shape = [self.embed_dim, self.num_heads * self.head_dim]
         self.ffn1_weight_shape = (
-            [self.dim_feedforward * 2, self.embed_dim]
+            [self.intermediate_size * 2, self.embed_dim]
             if self.activation.endswith("glu")
-            else [self.dim_feedforward, self.embed_dim]
+            else [self.intermediate_size, self.embed_dim]
         )
-        self.ffn2_weight_shape = [self.embed_dim, self.dim_feedforward]
+        self.ffn2_weight_shape = [self.embed_dim, self.intermediate_size]
 
         if config.quant_type == "weight_only_int4":
-            self.qkv_weight_shape[0] //= 2
+            if self.config.mla_config.use_mla():
+                if self.config.mla_config.q_lora_rank is None:
+                    self.q_proj_weight_shape[0] //= 2
+                else:
+                    self.q_a_proj_weight_shape[0] //= 2
+                    self.q_b_proj_weight_shape[0] //= 2
+                self.kv_a_proj_with_mqa_weight_shape[0] //= 2
+                self.kv_b_proj_weight_shape[0] //= 2
+            else:
+                self.qkv_weight_shape[0] //= 2
             self.linear_weight_shape[0] //= 2
             self.ffn1_weight_shape[0] //= 2
             self.ffn2_weight_shape[0] //= 2
 
-        if self.config.moe_config.has_moe() is True:
+        if self.config.moe_config.has_moe():
             self.moe_ffn1_weight_shape = (
-                [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward * 2]
+                [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size * 2]
                 if self.activation.endswith("glu")
-                else [self.config.moe_config.num_experts, self.embed_dim, self.dim_feedforward]
+                else [self.config.moe_config.num_experts, self.embed_dim, self.config.moe_config.moe_intermediate_size]
             )
-            self.moe_ffn2_weight_shape = [self.config.moe_config.num_experts, self.dim_feedforward, self.embed_dim]
+            self.moe_ffn2_weight_shape = [
+                self.config.moe_config.num_experts,
+                self.config.moe_config.moe_intermediate_size,
+                self.embed_dim,
+            ]
 
             if config.quant_type == "weight_only_int4":
                 if config.moe_config.has_shared_expert():
@@ -1317,22 +1796,105 @@ def init_weight_shape(self, config):
                 self.embed_dim,
                 self.config.moe_config.shared_expert_intermediate_size,
             ]
-            self.shared_expert_gate_weight_shape = [
-                self.embed_dim,
-                1,
-            ]
+            if self.config.moe_config.shared_expert_with_gate:
+                self.shared_expert_gate_weight_shape = [
+                    self.embed_dim,
+                    1,
+                ]
             if config.quant_type == "weight_only_int4":
                 self.shared_expert_ffn1_weight_shape[0] //= 2
                 self.shared_expert_ffn2_weight_shape[0] //= 2
 
     def compute_qkv_linear(self, ln_out, i):
-        return weight_only_linear(
-            ln_out,
-            weight=self.qkv_weights[i],
-            bias=self.qkv_biases[i],
-            weight_scale=self.qkv_weights_scale[i],
-            weight_dtype=self.weight_dtype,
-        )
+        if self.config.mla_config.use_mla():
+            if self.config.mla_config.q_lora_rank is not None:
+                query = weight_only_linear(
+                    ln_out,
+                    weight=self.q_a_proj_weights[i],
+                    weight_scale=self.q_a_proj_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+                query = self.norm_func(
+                    x=query,
+                    norm_weight=self.q_a_layernorm_weights[i],
+                    norm_bias=None,
+                    epsilon=self._epsilon,
+                    begin_norm_axis=1,
+                )[0]
+                query = weight_only_linear(
+                    query,
+                    weight=self.q_b_proj_weights[i],
+                    weight_scale=self.q_b_proj_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+            else:
+                query = weight_only_linear(
+                    ln_out,
+                    weight=self.q_proj_weights[i],
+                    weight_scale=self.q_proj_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+
+            query = query.reshape([-1, self.num_heads, self.config.mla_config.qk_head_dim])
+            query_nope, query_pe = query.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+
+            compressed_kv = weight_only_linear(
+                ln_out,
+                weight=self.kv_a_proj_with_mqa_weights[i],
+                weight_scale=self.kv_a_proj_with_mqa_weights_scale[i],
+                weight_dtype=self.weight_dtype,
+            )
+            compressed_kv, key_pe = compressed_kv.split(
+                [self.config.mla_config.kv_lora_rank, self.config.mla_config.qk_rope_head_dim], axis=-1
+            )
+            key_pe = key_pe.reshape([-1, 1, self.config.mla_config.qk_rope_head_dim])
+            compressed_kv = self.norm_func(
+                x=compressed_kv,
+                norm_weight=self.kv_a_layernorm_weights[i],
+                norm_bias=None,
+                epsilon=self._epsilon,
+                begin_norm_axis=1,
+            )[0]
+            key_value = weight_only_linear(
+                compressed_kv,
+                weight=self.kv_b_proj_weights[i],
+                weight_scale=self.kv_b_proj_weights_scale[i],
+                weight_dtype=self.weight_dtype,
+            )
+            key_value = key_value.reshape(
+                [-1, self.num_heads, self.config.mla_config.qk_nope_head_dim + self.config.mla_config.v_head_dim]
+            )
+            key_nope, value = key_value.split(
+                [self.config.mla_config.qk_nope_head_dim, self.config.mla_config.v_head_dim], axis=-1
+            )
+
+            query_pe, key_pe = self.config.rotary_emb(self.position_ids, query_pe, key_pe)
+
+            query[..., self.config.mla_config.qk_nope_head_dim :] = query_pe
+            key = paddle.empty_like(query)
+            key[..., : self.config.mla_config.qk_nope_head_dim] = key_nope
+            key[..., self.config.mla_config.qk_nope_head_dim :] = key_pe
+
+            qkv_out = paddle.concat(
+                [
+                    query.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    key.reshape([-1, self.num_heads * self.config.mla_config.qk_head_dim]),
+                    value.reshape([-1, self.num_heads * self.config.mla_config.v_head_dim]),
+                ],
+                axis=-1,
+            )
+        else:
+            qkv_out = weight_only_linear(
+                ln_out,
+                weight=self.qkv_weights[i],
+                bias=self.qkv_biases[i],
+                weight_scale=self.qkv_weights_scale[i],
+                weight_dtype=self.weight_dtype,
+            )
+
+        return qkv_out
 
     def compute_out_linear(self, fmha_out, i):
         return weight_only_linear(
@@ -1342,22 +1904,6 @@ def compute_out_linear(self, fmha_out, i):
             weight_dtype=self.weight_dtype,
         )
 
-    def compute_fused_moe(self, tmp_out, i):
-        fused_moe_out = fused_moe(
-            tmp_out,
-            self.gate_weights[i],
-            self.ffn1_weights[i],
-            self.ffn2_weights[i],
-            self.ffn1_biases[i],
-            self.ffn1_weights_scale[i],
-            self.ffn2_biases[i],
-            self.ffn2_weights_scale[i],
-            self.quant_type,
-            self.config.moe_config.top_k,
-            self.config.moe_config.norm_topk_prob,
-        )
-        return fused_moe_out
-
     def compute_ffn1(self, tmp_out, i):
         return weight_only_linear(
             tmp_out,
@@ -1381,21 +1927,18 @@ def compute_shared_expert(self, tmp_out, i):
             weight_scale=self.shared_expert_ffn1_weights_scale[i],
             weight_dtype=self.weight_dtype,
         )
-
         ffn1_out = fused_bias_act(ffn1_out, None, act_method=self.activation)
-
         ffn2_out = weight_only_linear(
             ffn1_out,
             weight=self.shared_expert_ffn2_weights[i],
             weight_scale=self.shared_expert_ffn2_weights_scale[i],
             weight_dtype=self.weight_dtype,
         )
-
-        gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
-        gate_out = paddle.nn.functional.sigmoid(gate_out)
-
-        shared_expert_output = gate_out * ffn2_out
-        return shared_expert_output
+        if self.config.moe_config.shared_expert_with_gate:
+            gate_out = paddle.matmul(tmp_out, self.shared_expert_gate_weights[i])
+            gate_out = paddle.nn.functional.sigmoid(gate_out)
+            return gate_out * ffn2_out
+        return ffn2_out
 
 
 class FusedMultiTransformerWeightOnlyPostLayernorm(
@@ -1413,8 +1956,8 @@ def __init__(self, config: FusedMultiTransformerConfig):
             config.embed_dim
         )
         assert config.num_heads > 0, "Expected nhead to be greater than 0, " "but received {}".format(config.num_heads)
-        assert config.dim_feedforward > 0, "Expected dim_feedforward to be greater than 0, but received {}".format(
-            config.dim_feedforward
+        assert config.intermediate_size > 0, "Expected intermediate_size to be greater than 0, but received {}".format(
+            config.intermediate_size
         )
         self._dtype = "float32"
         self._epsilon = config.epsilon
@@ -1430,9 +1973,9 @@ def __init__(self, config: FusedMultiTransformerConfig):
         assert self.head_dim * config.num_heads == config.embed_dim, "embed_dim must be divisible by num_heads"
 
         assert config.num_heads % config.nranks == 0
-        assert config.dim_feedforward % config.nranks == 0
+        assert config.intermediate_size % config.nranks == 0
 
-        dim_feedforward = config.dim_feedforward
+        intermediate_size = config.intermediate_size
         self.num_heads = config.num_heads
         self.cache_dtype = self.config.avx_config.cache_dtype
         self.kv_num_heads = config.kv_num_heads
@@ -1444,7 +1987,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
         self.weight_dtype = self._dtype
         self.create_params_type = self._dtype
         self.activation = config.activation
-        self.intermediate_size = dim_feedforward
+        self.intermediate_size = intermediate_size
         self.max_positions = self.config.avx_config.max_position_embeddings
         self.max_pos_embed = self.config.avx_config.max_position_embeddings
         self.hiddensize = self.num_heads * self.head_dim
@@ -1546,7 +2089,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             gate_bias = None
             if gate_bias_attr:
                 gate_bias = self.create_parameter(
-                    shape=[config.dim_feedforward],
+                    shape=[config.intermediate_size],
                     attr=gate_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -1560,7 +2103,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             up_bias = None
             if up_bias_attr:
                 up_bias = self.create_parameter(
-                    shape=[config.dim_feedforward],
+                    shape=[config.intermediate_size],
                     attr=up_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -1719,7 +2262,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
                 default_initializer=paddle.nn.initializer.Constant(-1),
             )
             ffn1_out_scale = self.create_parameter(
-                shape=[self.dim_feedforward * 2] if self.activation.endswith("glu") else [self.dim_feedforward],
+                shape=[self.intermediate_size * 2] if self.activation.endswith("glu") else [self.intermediate_size],
                 attr=ffn1_out_scale_attr,
                 dtype="float32",
                 is_bias=False,
@@ -1748,13 +2291,13 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn2_shift = None
             if ffn2_shift_attr:
                 ffn2_shift = self.create_parameter(
-                    shape=[self.dim_feedforward], attr=ffn2_shift_attr, dtype=self._dtype, is_bias=False
+                    shape=[self.intermediate_size], attr=ffn2_shift_attr, dtype=self._dtype, is_bias=False
                 )
 
             ffn2_smooth = None
             if ffn2_smooth_attr:
                 ffn2_smooth = self.create_parameter(
-                    shape=[self.dim_feedforward], attr=ffn2_smooth_attr, dtype=self._dtype, is_bias=False
+                    shape=[self.intermediate_size], attr=ffn2_smooth_attr, dtype=self._dtype, is_bias=False
                 )
 
             self.qkv_out_scales.append(qkv_out_scale)
@@ -1857,8 +2400,7 @@ def init_weight(self):
             self.qkv_weights.append(qkv_weight)
             self.linear_weights.append(linear_weight)
 
-            if gate_weight is not None:
-                self.gate_weights.append(gate_weight)
+            self.gate_weights.append(gate_weight)
             self.ffn1_weights.append(ffn1_weight)
             self.ffn2_weights.append(ffn2_weight)
 
@@ -1894,11 +2436,11 @@ def init_weight_shape(self, config):
         if not paddle.is_compiled_with_rocm():
             self.linear_weight_shape = [self.embed_dim, self.num_heads * self.head_dim]
             self.ffn1_weight_shape = (
-                [self.dim_feedforward * 2, self.embed_dim]
+                [self.intermediate_size * 2, self.embed_dim]
                 if self.activation.endswith("glu")
-                else [self.dim_feedforward, self.embed_dim]
+                else [self.intermediate_size, self.embed_dim]
             )
-            self.ffn2_weight_shape = [self.embed_dim, self.dim_feedforward]
+            self.ffn2_weight_shape = [self.embed_dim, self.intermediate_size]
 
     def compute_layernorm_before_qkv(self, src, i):
         if i == 0:
@@ -1919,11 +2461,14 @@ def compute_layernorm_before_qkv(self, src, i):
         return ln_out
 
     def compute_qkv_linear(self, ln_out, i):
-        if paddle.is_compiled_with_rocm():
-            qkv_out = paddle.matmul(ln_out, self.qkv_weights[i])
+        if self.config.mla_config.use_mla():
+            raise NotImplementedError("Not support MLA yet.")
         else:
-            qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
-        return qkv_out
+            if paddle.is_compiled_with_rocm():
+                qkv_out = paddle.matmul(ln_out, self.qkv_weights[i])
+            else:
+                qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
+            return qkv_out
 
     def compute_fmha(
         self,
@@ -1980,7 +2525,7 @@ def compute_fmha(
             seq_lens,
             seq_lens + pre_caches_length,
             mask=attn_mask,
-            scale=float(self.head_dim**-0.5),
+            scale=self.softmax_scale,
         )
 
         fmha_out = transpose_remove_padding(qktv_out, seq_lens, padding_offset)
@@ -2188,8 +2733,9 @@ def compute_attn(
                 "none",  # cache_quant_type
                 self.use_neox_rotary_style,
                 kwargs.get("max_input_length", -1),
-                0.0,
-                0.0,
+                self.softmax_scale,  # softmax_scale
+                0.0,  # quant_max_bound
+                0.0,  # quant_min_bound
                 0.0,  # out_linear_in_scale
                 self.config.speculate_config.speculate_max_draft_token_num,
                 True,  # causal
@@ -2278,6 +2824,9 @@ def compute_attn(
                     rope_theta=self.config.rope_theta,
                 )[0]
 
+        if self.config.mla_config.use_mla():
+            fmha_out = fmha_out.reshape([-1, self.num_heads * self.config.mla_config.v_head_dim])
+
         out_linear_out = self.compute_out_linear(fmha_out, i)
 
         return out_linear_out
@@ -2289,16 +2838,19 @@ def post_process(self, **kwargs):
         seq_lens_decoder = kwargs.get("seq_lens_decoder", None)
         max_input_length = kwargs.get("max_input_length", -1)
         output_padding_offset = kwargs.get("output_padding_offset", None)  # only used in speculative decoding
-        out = rebuild_padding_v2(
-            multi_block_output,
-            cum_offsets,
-            seq_lens_decoder,
-            seq_lens_encoder,
-            output_padding_offset,
-            max_input_length,
-        )
 
-        return out
+        if self.config.speculate_config.return_full_hidden_states:
+            return multi_block_output
+        else:
+            out = rebuild_padding_v2(
+                multi_block_output,
+                cum_offsets,
+                seq_lens_decoder,
+                seq_lens_encoder,
+                output_padding_offset,
+                max_input_length,
+            )
+            return out
 
 
 class FusedBlockMultiTransformerWeightOnly(FusedBlockMultiTransformer, FusedMultiTransformerWeightOnly):
@@ -2382,6 +2934,7 @@ def compute_attn(
                 cache_quant_type_str,
                 self.use_neox_rotary_style,
                 kwargs.get("max_input_length", -1),
+                self.softmax_scale,
                 self.quant_max_bound,
                 self.quant_min_bound,
                 self.act_scales["out_linear_in_scale"][i],
@@ -2468,7 +3021,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn1_0_bias = None
             if ffn1_0_bias_attr:
                 ffn1_0_bias = self.create_parameter(
-                    shape=[self.dim_feedforward],
+                    shape=[self.intermediate_size],
                     attr=ffn1_0_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -2477,7 +3030,7 @@ def __init__(self, config: FusedMultiTransformerConfig):
             ffn1_1_bias = None
             if ffn1_1_bias_attr:
                 ffn1_1_bias = self.create_parameter(
-                    shape=[self.dim_feedforward],
+                    shape=[self.intermediate_size],
                     attr=ffn1_1_bias_attr,
                     dtype=self._dtype,
                     is_bias=True,
@@ -2575,9 +3128,9 @@ def init_weight_shape(self, config):
             else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim]
         )
         self.linear_weight_shape = [self.num_heads * self.head_dim, self.embed_dim]
-        self.ffn1_0_weight_shape = [self.dim_feedforward, self.embed_dim]
-        self.ffn1_1_weight_shape = [self.dim_feedforward, self.embed_dim]
-        self.ffn2_weight_shape = [self.embed_dim, self.dim_feedforward]
+        self.ffn1_0_weight_shape = [self.intermediate_size, self.embed_dim]
+        self.ffn1_1_weight_shape = [self.intermediate_size, self.embed_dim]
+        self.ffn2_weight_shape = [self.embed_dim, self.intermediate_size]
 
     def get_weight_create_dype(self, layer_name=None, layer_idx=None):
         """
@@ -2610,25 +3163,25 @@ def compute_layernorm_before_qkv(self, src, i):
         return ln_out
 
     def compute_qkv_linear(self, ln_out, i):
-        """
-        For fake parameter
-        """
-        if paddle.is_compiled_with_rocm() or float(paddle.version.cuda()) < 11.6:
-            qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
-            if self.qkv_biases[i] is not None:
-                qkv_out = paddle.add(qkv_out, self.qkv_biases[i])
-            return qkv_out
+        if self.config.mla_config.use_mla():
+            raise NotImplementedError("Not support MLA yet.")
         else:
-            qkv_out = fp8_gemm_fused(
-                ln_out,
-                self.qkv_weights[i],
-                transpose_x=False,
-                transpose_y=True,
-                bias=self.qkv_biases[i],
-                scale=self.weight_scales["qkv_weight_scale"][i] / (self.act_scales["qkv_in_scale"][i] * 448 * 448),
-                output_dtype=self._dtype,
-                act="identity",
-            )
+            if paddle.is_compiled_with_rocm() or float(paddle.version.cuda()) < 11.6:
+                qkv_out = paddle.matmul(ln_out, self.qkv_weights[i], False, True)
+                if self.qkv_biases[i] is not None:
+                    qkv_out = paddle.add(qkv_out, self.qkv_biases[i])
+                return qkv_out
+            else:
+                qkv_out = fp8_gemm_fused(
+                    ln_out,
+                    self.qkv_weights[i],
+                    transpose_x=False,
+                    transpose_y=True,
+                    bias=self.qkv_biases[i],
+                    scale=self.weight_scales["qkv_weight_scale"][i] / (self.act_scales["qkv_in_scale"][i] * 448 * 448),
+                    output_dtype=self._dtype,
+                    act="identity",
+                )
 
             return qkv_out
 
@@ -2741,6 +3294,7 @@ def compute_attn(
                 cache_quant_type_str,
                 self.use_neox_rotary_style,
                 kwargs.get("max_input_length", -1),
+                self.softmax_scale,
                 self.quant_max_bound,
                 self.quant_min_bound,
                 self.act_scales["out_linear_in_scale"][i],
diff --git a/paddlenlp/experimental/transformers/generation_utils.py b/paddlenlp/experimental/transformers/generation_utils.py
index e224f7b2a1c9..e2a75b911d4f 100644
--- a/paddlenlp/experimental/transformers/generation_utils.py
+++ b/paddlenlp/experimental/transformers/generation_utils.py
@@ -110,7 +110,7 @@ def to_static(self, output_path: str, config: dict):
             input_spec[16] = paddle.static.InputSpec(shape=[None, 2, 1], dtype="int64", name="tgt_pos")  # tgt_pos
         elif self.config["model_type"] and "gpt" in self.config.model_type:
             input_spec[2] = paddle.static.InputSpec(shape=[None], dtype="int64", name="position_ids")  # position_ids
-        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec, full_graph=True)
         paddle.jit.save(
             model, output_path, skip_prune_program=True
         )  # Note(Zhengzekang): If we prune program it may cause some inference error.
@@ -539,7 +539,7 @@ def to_static(self, output_path: str, config: dict):
             ]
             input_spec.extend(speculate_spec)
 
-        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec, full_graph=True)
         paddle.jit.save(
             model, output_path, skip_prune_program=True
         )  # Note(Zhengzekang): If we prune program it may cause some inference error.
@@ -646,7 +646,15 @@ def generate(
         model_kwargs["accept_num"] = accept_num
         model_kwargs["actual_draft_token_num"] = actual_draft_token_num
 
-        if self.config.decode_strategy == "speculate_decoding":
+        if self.config.decode_strategy == "draft_model_sample":
+            ret = self.draft_model_sample(
+                eos_token_id,
+                top_k=0,
+                top_p=top_p,
+                temperature=temperature,
+                **model_kwargs,
+            )
+        elif self.config.decode_strategy == "speculate_decoding":
             ret = self.speculate_decoding(
                 eos_token_id,
                 top_k=0,
@@ -727,23 +735,24 @@ def _post_process_(
             if self.config.tensor_parallel_degree > 1:
                 paddle.distributed.broadcast(next_tokens, 0)
 
-            from paddlenlp_ops import update_inputs_v2
+            with paddle.base.framework._stride_in_no_check_dy2st_diff():
+                from paddlenlp_ops import update_inputs_v2
 
-            update_inputs_v2(
-                model_kwargs["stop_flags"],
-                model_kwargs["step_idx"],
-                model_kwargs["not_need_stop"],
-                model_kwargs["seq_lens_this_time"],
-                model_kwargs["seq_lens_encoder"],
-                model_kwargs["seq_lens_decoder"],
-                model_kwargs["max_dec_len"],
-                model_kwargs["input_ids"],
-                model_kwargs["stop_nums"],
-                next_tokens,
-                model_kwargs["is_block_step"],
-                eos_token_id,
-                model_kwargs["next_tokens"],
-            )
+                update_inputs_v2(
+                    model_kwargs["stop_flags"],
+                    model_kwargs["step_idx"],
+                    model_kwargs["not_need_stop"],
+                    model_kwargs["seq_lens_this_time"],
+                    model_kwargs["seq_lens_encoder"],
+                    model_kwargs["seq_lens_decoder"],
+                    model_kwargs["max_dec_len"],
+                    model_kwargs["input_ids"],
+                    model_kwargs["stop_nums"],
+                    next_tokens,
+                    model_kwargs["is_block_step"],
+                    eos_token_id,
+                    model_kwargs["next_tokens"],
+                )
 
             from paddlenlp_ops import save_output
 
@@ -822,28 +831,26 @@ def _post_process_(
             probs = F.softmax(logits)
 
             from paddlenlp_ops import (
+                speculate_clear_accept_nums,
                 speculate_save_output,
                 speculate_set_value_by_flags_and_idx,
-                speculate_verify_and_update,
+                speculate_update,
+                speculate_verify,
                 top_p_candidates,
             )
 
             verify_scores, verify_tokens, actual_candidate_len = top_p_candidates(
                 probs, top_p, model_kwargs["output_padding_offset"], self.max_candidate_len, self.max_seq_len
-            )  # [token_num, max_candidate_len]
+            )
 
-            # Speculate Verify And Update
-            speculate_verify_and_update(
+            speculate_verify(
                 model_kwargs["accept_tokens"],
                 model_kwargs["accept_num"],
                 model_kwargs["step_idx"],
+                model_kwargs["stop_flags"],
                 model_kwargs["seq_lens_encoder"],
                 model_kwargs["seq_lens_decoder"],
-                model_kwargs["stop_flags"],
-                model_kwargs["not_need_stop"],
-                model_kwargs[
-                    "draft_tokens"
-                ],  # Both input and output, need to write the last 1 token accepted to position 0.
+                model_kwargs["draft_tokens"],  # 既是输入又是输出，需要把接收的最后1个token写入到第0个位置
                 model_kwargs["seq_lens_this_time"],
                 verify_tokens,
                 verify_scores,
@@ -859,6 +866,25 @@ def _post_process_(
                 True,  # enable_topp
             )
 
+            if self.config.tensor_parallel_degree > 1:
+                paddle.distributed.broadcast(model_kwargs["accept_tokens"], 0)
+                paddle.distributed.broadcast(model_kwargs["accept_num"], 0)
+                paddle.distributed.broadcast(model_kwargs["step_idx"], 0)
+                paddle.distributed.broadcast(model_kwargs["stop_flags"], 0)
+
+            speculate_update(
+                model_kwargs["seq_lens_encoder"],
+                model_kwargs["seq_lens_decoder"],
+                model_kwargs["not_need_stop"],
+                model_kwargs["draft_tokens"],
+                model_kwargs["actual_draft_token_num"],
+                model_kwargs["accept_tokens"],
+                model_kwargs["accept_num"],
+                model_kwargs["stop_flags"],
+                model_kwargs["seq_lens_this_time"],
+                model_kwargs["is_block_step"],
+            )
+
             speculate_save_output(
                 model_kwargs["accept_tokens"],
                 model_kwargs["accept_num"],
@@ -867,7 +893,7 @@ def _post_process_(
             )
 
             # If seq_lens_decoder is 0 (means stop), accept_num should be set to 0
-            model_kwargs["accept_num"][model_kwargs["seq_lens_decoder"] == 0] = 0
+            speculate_clear_accept_nums(model_kwargs["accept_num"], model_kwargs["seq_lens_decoder"])
 
             # Update pre_ids through accept tokens
             speculate_set_value_by_flags_and_idx(
@@ -891,7 +917,38 @@ def _post_process_(
         # encoder
         outputs = _forward_(**model_kwargs)  # [bs, 1, dim_embed]
         # first decoder
-        next_tokens = _post_process_(
+        _post_process_(
+            outputs[0] if isinstance(outputs, tuple) else outputs,
+            top_k,
+            top_p,
+            penalty_score,
+            frequency_score,
+            presence_score,
+            temperature,
+            model_kwargs,
+        )
+        if self.return_full_hidden_states:
+            return outputs[1]
+        else:
+            return None
+
+    def draft_model_sample(
+        self,
+        eos_token_id,
+        top_k,
+        top_p,
+        penalty_score,
+        frequency_score,
+        presence_score,
+        temperature=None,
+        min_tokens_to_keep=1,
+        **model_kwargs
+    ):
+        def _forward_(**args):
+            model_inputs = self.prepare_inputs_for_generation(**args)
+            return self(**model_inputs)
+
+        def _post_process_(
             outputs,
             top_k,
             top_p,
@@ -900,9 +957,49 @@ def _post_process_(
             presence_score,
             temperature,
             model_kwargs,
+        ):
+            logits = paddle.cast(outputs, paddle.float32)
+
+            probs = F.softmax(logits)
+
+            _, inter_next_tokens = paddle.tensor.top_p_sampling(probs, top_p, seed=-1)
+
+            if self.config.tensor_parallel_degree > 1:
+                paddle.distributed.broadcast(inter_next_tokens, 0)
+
+            from paddlenlp_ops import draft_model_update
+
+            draft_model_update(
+                inter_next_tokens,
+                model_kwargs["draft_tokens"],
+                model_kwargs["pre_ids"],
+                model_kwargs["seq_lens_this_time"],
+                model_kwargs["seq_lens_encoder"],
+                model_kwargs["seq_lens_decoder"],
+                model_kwargs["step_idx"],
+                model_kwargs["output_cum_offsets"],
+                model_kwargs["stop_flags"],
+                model_kwargs["not_need_stop"],
+                model_kwargs["max_dec_len"],
+                eos_token_id,
+                model_kwargs["base_model_draft_tokens"],  # Write generated tokens
+                self.max_seq_len,
+                model_kwargs["substep"],
+            )
+
+        output_padding_offset, output_cum_offsets = self.get_output_padding_offset(
+            model_kwargs["seq_lens_this_time"], model_kwargs["seq_lens_encoder"], model_kwargs["seq_lens_decoder"]
         )
+        model_kwargs["output_padding_offset"] = output_padding_offset
+        model_kwargs["output_cum_offsets"] = output_cum_offsets
 
-        return next_tokens
+        outputs, eagle_hidden_states = _forward_(**model_kwargs)  # [bs, 1, dim_embed]
+        # first decoder
+        _post_process_(
+            outputs, top_k, top_p, penalty_score, frequency_score, presence_score, temperature, model_kwargs
+        )
+
+        return eagle_hidden_states
 
 
 class GenerationAvxInferenceModel(GenerationMixin):
@@ -937,7 +1034,7 @@ def to_static(self, output_path: str, config: dict):
             config.get("logits_processors", None),
             None,
         ]
-        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec, full_graph=True)
         paddle.jit.save(
             model, output_path, skip_prune_program=True
         )  # Note(Zhengzekang): If we prune program it may cause some inference error.
diff --git a/paddlenlp/experimental/transformers/llama/modeling.py b/paddlenlp/experimental/transformers/llama/modeling.py
index 59f47405910a..f35c0867ab3d 100644
--- a/paddlenlp/experimental/transformers/llama/modeling.py
+++ b/paddlenlp/experimental/transformers/llama/modeling.py
@@ -188,7 +188,7 @@ def __init__(self, config: LlamaConfig):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_layers,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             activation="silu",
             num_layers=self.num_layers,
             ln_scale_attrs=ln_scale_attrs,
@@ -397,6 +397,7 @@ def __init__(self, config: LlamaConfig):
         self.epsilon = config.rms_norm_eps
         self.max_position_embeddings = config.max_position_embeddings
         self.quant_type = config.get("quant_type", "")
+        self.return_full_hidden_states = config.get("return_full_hidden_states", False)
 
         self.rope_theta = config.rope_theta
         self.use_neox = True
@@ -609,12 +610,13 @@ def __init__(self, config: LlamaConfig):
         speculate_config = SpeculateConfig(
             speculate_method=config.get("speculate_method", None),
             speculate_max_draft_token_num=config.get("speculate_max_draft_token_num", 5),
+            return_full_hidden_states=config.get("return_full_hidden_states", False),
         )
         transformer_config = FusedMultiTransformerConfig(
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
@@ -985,24 +987,26 @@ def set_quant_scale(self):
                         self.transformer_block.cache_v_out_scales[i_layer].set_value(weight_scale)
 
     @paddle.no_grad()
-    def set_state_dict(self, state_dict):
+    def set_state_dict(self, state_dict, is_eagle=False):
         self.set_quant_scale()
         self.transformer_block.init_weight()
         split_fn = split_param_func()
         self.embed_tokens.weight.set_value(
             paddle.to_tensor(state_dict["llama.embed_tokens.weight"]).cast(self.embed_tokens.weight.dtype)
         )
-        self.norm.weight.set_value(paddle.to_tensor(state_dict["llama.norm.weight"]).cast(self.norm.weight.dtype))
+        if not is_eagle:
+            self.norm.weight.set_value(paddle.to_tensor(state_dict["llama.norm.weight"]).cast(self.norm.weight.dtype))
         if self.use_weight_only:
             logger.info("weight only is enabled")
         for idx in range(self.config.num_hidden_layers):
             logger.info(f"set state for layer {idx}")
 
-            self.transformer_block.ln_scales[idx].set_value(
-                paddle.to_tensor(state_dict["llama.layers.{}.input_layernorm.weight".format(idx)]).cast(
-                    self.transformer_block.ln_scales[idx].dtype
+            if not is_eagle:
+                self.transformer_block.ln_scales[idx].set_value(
+                    paddle.to_tensor(state_dict["llama.layers.{}.input_layernorm.weight".format(idx)]).cast(
+                        self.transformer_block.ln_scales[idx].dtype
+                    )
                 )
-            )
             if "llama.layers.{}.self_attn.qkv_proj.weight".format(idx) in state_dict.keys():
                 concated_qkv_weight = paddle.to_tensor(
                     np.concatenate(
@@ -1443,6 +1447,75 @@ def forward(
             )
         hidden_states = self.norm(hidden_states)
 
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None,
+            cum_offsets=cum_offsets,
+        )
+
+
+@register_base_model
+class EagleForLlamaInferenceModel(LlamaBlockInferenceModel):
+    def __init__(self, config: LlamaConfig):
+        self.append_attn = config.append_attn
+        super().__init__(config)
+        self.max_seq_len = config.max_seq_len
+        self.block_size = config.block_size
+        from paddle.distributed.fleet.layers.mpu.mp_layers import ColumnParallelLinear
+
+        if config.tensor_parallel_degree > 1:
+            self.fc = ColumnParallelLinear(
+                self.hidden_size * 2, self.hidden_size, has_bias=True, gather_output=False, fuse_matmul_bias=True
+            )
+        else:
+            self.fc = nn.Linear(self.hidden_size * 2, self.hidden_size, bias_attr=True)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        caches=None,
+        pre_caches=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        **kwargs,
+    ):
+        seq_lens_this_time = kwargs.get("seq_lens_this_time", None)
+        rope_emb = kwargs.get("rope_emb", None)
+        draft_tokens = kwargs.get("draft_tokens", None)
+        seq_lens_encoder = kwargs.get("seq_lens_encoder", None)
+        pre_hidden_states = kwargs.get("pre_hidden_states", None)
+        ids_remove_padding, padding_offset, cum_offsets, cu_seqlens_q, cu_seqlens_k = self.remove_padding(
+            input_ids, seq_lens_this_time, draft_tokens, seq_lens_encoder
+        )
+
+        kwargs["cu_seqlens_q"] = cu_seqlens_q
+        kwargs["cu_seqlens_k"] = cu_seqlens_k
+        kwargs["padding_offsets"] = padding_offset
+        kwargs["max_input_length"] = self.max_seq_len
+
+        inputs_embeds = self.embed_tokens(ids_remove_padding)
+        inputs_embeds = paddle.concat([inputs_embeds, pre_hidden_states], axis=-1)
+        inputs_embeds = self.fc(inputs_embeds)
+
+        with dy2st_nocheck_guard_context():
+            hidden_states, _ = self.transformer_block(
+                input_ids=input_ids,
+                src=inputs_embeds,
+                cum_offsets=cum_offsets,
+                attn_mask=attention_mask,
+                caches=caches,
+                pre_caches=pre_caches,
+                rotary_embs=rope_emb,
+                post_rebuild_padding=True,
+                **kwargs,
+            )
+        # hidden_states = self.norm(hidden_states)
+
         return BaseModelOutputWithPastAndCrossAttentions(
             last_hidden_state=hidden_states,
             past_key_values=None,
@@ -1730,6 +1803,7 @@ def __init__(self, config):
         self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
         self.verify_window = config.get("speculate_verify_window", 2)
         self.max_seq_len = config.max_seq_len
+        self.return_full_hidden_states = config.get("return_full_hidden_states", False)
 
         self.llama = LlamaBlockInferenceModel(config)
         if config.tie_word_embeddings:
@@ -1929,14 +2003,31 @@ def forward(
             draft_tokens=draft_tokens,
             output_padding_offset=output_padding_offset,
         )
-
-        hidden_states = outputs[0]
+        # hidden_states = outputs[0]
+        if self.return_full_hidden_states:
+            from paddlenlp_ops import rebuild_padding_v2
+
+            # full_hidden_states = outputs[1]
+            full_hidden_states = outputs[0]
+            cum_offsets = outputs[1]
+            hidden_states = rebuild_padding_v2(
+                full_hidden_states,
+                cum_offsets,
+                seq_lens_decoder,
+                seq_lens_encoder,
+                output_padding_offset,
+                self.max_seq_len,
+            )
+        else:
+            hidden_states = outputs[0]
         logits = self.lm_head(
             hidden_states,
             tensor_parallel_output=False,
         )
-
-        return logits
+        if self.return_full_hidden_states:
+            return logits, full_hidden_states
+        else:
+            return logits
 
     @paddle.no_grad()
     def set_state_dict(self, state_dict):
@@ -1947,6 +2038,121 @@ def set_state_dict(self, state_dict):
         self.llama.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
 
 
+class EagleLlamaForCausalLMBlockInferenceModel(LlamaForCausalLMBlockInferenceModel):
+    def __init__(self, config):
+        super(LlamaForCausalLMBlockInferenceModel, self).__init__(config)
+        self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
+        self.verify_window = config.get("speculate_verify_window", 2)
+        self.max_seq_len = config.max_seq_len
+
+        self.eagle = EagleForLlamaInferenceModel(config)
+        if config.tie_word_embeddings:
+            self.lm_head = LlamaLMHead(config, embedding_weights=self.llama.embed_tokens.weight, transpose_y=True)
+            self.tie_weights()
+        else:
+            self.lm_head = LlamaLMHead(config)
+
+    def prepare_inputs_for_generation(self, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        input_ids = kwargs["input_ids"]
+        src_mask = kwargs.get("src_mask", None)
+        block_tables = kwargs.get("block_tables", None)
+
+        pre_caches = kwargs.get("pre_caches", None)
+        caches = kwargs.get("caches", None)
+
+        rope_emb = kwargs["rope_emb"]
+        seq_lens_this_time = kwargs["seq_lens_this_time"]
+        seq_lens_encoder = kwargs["seq_lens_encoder"]
+        seq_lens_decoder = kwargs["seq_lens_decoder"]
+        k_quant_scales = kwargs.get("k_quant_scales", None)
+        v_quant_scales = kwargs.get("v_quant_scales", None)
+        k_dequant_scales = kwargs.get("k_dequant_scales", None)
+        v_dequant_scales = kwargs.get("v_dequant_scales", None)
+
+        # speculative decoding related parameters
+        draft_tokens = kwargs.get("draft_tokens", None)
+        output_padding_offset = kwargs.get("output_padding_offset", None)
+        hidden_states = kwargs.get("hidden_states", None)
+
+        model_inputs = {
+            "input_ids": input_ids,
+            "src_mask": src_mask,
+            "rope_emb": rope_emb,
+            "pre_caches": pre_caches,
+            "caches": caches,
+            "seq_lens_this_time": seq_lens_this_time,
+            "seq_lens_encoder": seq_lens_encoder,
+            "seq_lens_decoder": seq_lens_decoder,
+            "block_tables": block_tables,
+            "k_quant_scales": k_quant_scales,
+            "v_quant_scales": v_quant_scales,
+            "k_dequant_scales": k_dequant_scales,
+            "v_dequant_scales": v_dequant_scales,
+            "draft_tokens": draft_tokens,
+            "output_padding_offset": output_padding_offset,
+            "pre_hidden_states": hidden_states,
+        }
+        return model_inputs
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        if "lm_head.weight" in state_dict:
+            self.lm_head.weight.set_value(
+                paddle.to_tensor(state_dict["lm_head.weight"]).cast(self.lm_head.weight.dtype)
+            )
+        self.eagle.fc.weight.set_value(paddle.to_tensor(state_dict["llama.fc.weight"]).cast(self.lm_head.weight.dtype))
+        self.eagle.fc.bias.set_value(paddle.to_tensor(state_dict["llama.fc.bias"]).cast(self.lm_head.weight.dtype))
+        self.eagle.set_state_dict({k: state_dict[k] for k in state_dict.keys()}, True)
+
+    def forward(
+        self,
+        input_ids,
+        src_mask=None,
+        pre_caches=None,
+        caches=None,
+        seq_lens_this_time=None,
+        seq_lens_encoder=None,
+        seq_lens_decoder=None,
+        rope_emb=None,
+        block_tables=None,
+        k_quant_scales=None,
+        v_quant_scales=None,
+        k_dequant_scales=None,
+        v_dequant_scales=None,
+        draft_tokens=None,
+        output_padding_offset=None,
+        pre_hidden_states=None,
+    ):
+        outputs = self.eagle(
+            input_ids,
+            src_mask=src_mask,
+            caches=caches,
+            rope_emb=rope_emb,
+            block_tables=block_tables,
+            pre_caches=pre_caches,
+            seq_lens_this_time=seq_lens_this_time,
+            seq_lens_encoder=seq_lens_encoder,
+            seq_lens_decoder=seq_lens_decoder,
+            k_quant_scales=k_quant_scales,
+            v_quant_scales=v_quant_scales,
+            k_dequant_scales=k_dequant_scales,
+            v_dequant_scales=v_dequant_scales,
+            draft_tokens=draft_tokens,
+            output_padding_offset=output_padding_offset,
+            pre_hidden_states=pre_hidden_states,
+        )
+
+        hidden_states = outputs[0]
+
+        logits = self.lm_head(
+            hidden_states,
+            tensor_parallel_output=False,
+        )
+
+        return logits, hidden_states
+
+
 class LlamaForMiniGPT4InferenceModel(LlamaForCausalLMInferenceModel):
     """
     This class is 99% like LlamaForCausalLMInferenceModel.
diff --git a/paddlenlp/experimental/transformers/mixtral/modeling.py b/paddlenlp/experimental/transformers/mixtral/modeling.py
index 27e638d9d9f1..73479fc243d1 100644
--- a/paddlenlp/experimental/transformers/mixtral/modeling.py
+++ b/paddlenlp/experimental/transformers/mixtral/modeling.py
@@ -289,13 +289,14 @@ def __init__(self, config: MixtralConfig):
             top_k=self.moe_topk,
             norm_topk_prob=True,
             moe_every2=self.moe_every2,
+            moe_intermediate_size=self.intermediate_size,
         )
 
         transformer_config = FusedMultiTransformerConfig(
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
@@ -643,9 +644,9 @@ def set_state_dict(self, state_dict):
                     )
                     ffn1_quanted_weight_list.append(
                         ffn1_quanted_weight_list_i.reshape(
-                            [self.transformer_block.embed_dim, self.transformer_block.dim_feedforward * 2]
+                            [self.transformer_block.embed_dim, self.transformer_block.intermediate_size * 2]
                             if self.quant_type == "weight_only_int8"
-                            else [self.transformer_block.embed_dim, self.transformer_block.dim_feedforward]
+                            else [self.transformer_block.embed_dim, self.transformer_block.intermediate_size]
                         )
                     )
                     ffn1_quanted_weight_scale.append(ffn1_quanted_weight_scale_i)
@@ -682,9 +683,9 @@ def set_state_dict(self, state_dict):
                     )
                     ffn2_quanted_weight_list.append(
                         ffn2_quanted_weight_list_i.reshape(
-                            [self.transformer_block.dim_feedforward, self.transformer_block.embed_dim]
+                            [self.transformer_block.intermediate_size, self.transformer_block.embed_dim]
                             if self.quant_type == "weight_only_int8"
-                            else [self.transformer_block.dim_feedforward, self.transformer_block.embed_dim // 2]
+                            else [self.transformer_block.intermediate_size, self.transformer_block.embed_dim // 2]
                         )
                     )
                     ffn2_quanted_weight_scale.append(ffn2_quanted_weight_scale_i)
diff --git a/paddlenlp/experimental/transformers/proposers.py b/paddlenlp/experimental/transformers/proposers.py
index f2a1d2b0a50f..6196d621ee9f 100644
--- a/paddlenlp/experimental/transformers/proposers.py
+++ b/paddlenlp/experimental/transformers/proposers.py
@@ -15,7 +15,17 @@
 
 from abc import ABC, abstractmethod
 
+import numpy as np
 import paddle
+from paddlenlp_ops import (
+    draft_model_postprocess,
+    draft_model_preprocess,
+    eagle_get_base_model_hidden_states,
+    eagle_get_self_hidden_states,
+)
+
+from paddlenlp.transformers import AutoConfig, AutoInferenceModelForCausalLM
+from paddlenlp.trl import llm_utils
 
 
 class Proposer(ABC):
@@ -35,6 +45,20 @@ def run(self, model_inputs: dict[str, paddle.Tensor], **kargs):
         """
         raise NotImplementedError()
 
+    @abstractmethod
+    def insert_query(self, **kwargs):
+        """
+        Insert new query
+        """
+        pass
+
+    @abstractmethod
+    def postprocess(self, **kargs):
+        """
+        Postprocessing finished query
+        """
+        pass
+
 
 class InferenceWithReferenceProposer(Proposer):
     """
@@ -84,6 +108,7 @@ def run(self, model_inputs: dict[str, paddle.Tensor], **kargs):
             seq_lens_this_time,
             seq_lens_encoder,
             seq_lens_decoder,
+            model_inputs["max_length"].cpu(),
             kargs["real_batch_size"],
             self.max_ngram_size,
             self.max_draft_token_num,
@@ -92,3 +117,268 @@ def run(self, model_inputs: dict[str, paddle.Tensor], **kargs):
         model_inputs["draft_tokens"][:] = draft_tokens.cuda()
         model_inputs["seq_lens_encoder"][:] = seq_lens_encoder.cuda()
         kargs["seq_lens_this_time"][:] = seq_lens_this_time.cuda()
+
+    def insert_query(self, **kwargs):
+        """
+        Insert new query
+        """
+        pass
+
+    def postprocess(self, **kwargs):
+        """
+        Postprocessing finished query
+        """
+
+
+class ModelProposer(Proposer):
+    """
+    用于类 Model 的 Proposer 基类
+    在输入输出中匹配符合的tokens作为 draft tokens
+    """
+
+    def __init__(self, args, **kwargs):
+        super().__init__()
+        self.args = self.build_args(args)
+        self.model_args = kwargs.get("model_args", None)
+        self.draft_type = args.speculate_method
+        self.dtype = self.args.dtype
+        assert self.draft_type in (
+            "draft_model",
+            "eagle",
+            "mtp",
+        ), f"draft_type support [draft_model, eagle], but get {self.draft_type}"
+
+        self.max_draft_tokens = self.args.speculate_max_draft_token_num
+        self.actual_draft_token_num = self.max_draft_tokens
+        self.batch_size = self.args.batch_size
+        self.init_predictor()
+
+    def build_args(self, args):
+        from copy import deepcopy
+
+        draft_model_args = deepcopy(args)
+        draft_model_args.quant_type = args.draft_model_quant_type
+        draft_model_args.model_name_or_path = args.draft_model_name_or_path
+        draft_model_args.decode_strategy = "draft_model_sample"
+        draft_model_args.mode = "dynamic"
+        draft_model_args.return_full_hidden_states = 0
+        return draft_model_args
+
+    def init_predictor(self):
+        """
+        init_predictor
+        """
+
+        tensor_parallel_rank, tensor_parallel_degree = llm_utils.init_dist_env()
+
+        self.config = AutoConfig.from_pretrained(self.args.draft_model_name_or_path)
+        self.model = AutoInferenceModelForCausalLM.from_pretrained(
+            self.args.model_name_or_path,
+            config=self.config,
+            predictor_args=self.args,
+            model_args=self.model_args,
+            dtype=self.args.dtype,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+            spec_model_type=self.draft_type,
+        )
+
+        # prepare model_inputs
+        self.model_inputs = {}
+
+        self.cache_kvs_shape = self.model.get_cache_kvs_shape(self.model.config, self.args.batch_size)
+        cachekv_dtype = self.dtype if self.config.cachekv_int8_type is None else "uint8"
+        self.cache_kvs = [paddle.zeros(shape, dtype=cachekv_dtype) for shape in self.cache_kvs_shape]
+
+        self.max_block_nums = self.cache_kvs_shape[0][0]
+        self.free_list = list(range(self.max_block_nums))
+        self.pre_ids = paddle.to_tensor(np.zeros((self.batch_size, self.args.total_max_length)).astype("int64") - 1)
+        self.rope_theta = self.config.get("rope_theta", 10000.0)
+        self.rope_scaling = self.config.get("rope_scaling", None)
+
+        self.head_dim = self.cache_kvs_shape[0][-1]
+        self.rope_emb = llm_utils.get_rotary_position_embedding(
+            paddle.arange(self.args.total_max_length).reshape((1, -1)),
+            self.head_dim,
+            self.rope_theta,
+            self.rope_scaling,
+        )
+
+    def run(self, share_inputs, **kwargs):
+        self.run_preprocess(share_inputs)
+        self.run_infer(share_inputs, **kwargs)
+        self.run_postprocess(share_inputs)
+
+    def insert_query(self, **kwargs):
+        real_bs = kwargs.get("real_bs")
+        seq_lens = kwargs.get("seq_lens")
+        base_model_inputs = kwargs.get("base_model_inputs")
+
+        max_sec_len = self.args.total_max_length
+        self.model_inputs["block_tables"] = paddle.full_like(
+            base_model_inputs["block_tables"], fill_value=-1, dtype="int32"
+        )
+        for i in range(real_bs):
+            real_len = seq_lens[i] + self.args.max_length
+            if real_len > max_sec_len:
+                self.free_list = list(range(self.max_block_nums))
+                # self.used_list = [[] for _ in range(self.beam_batch_size)]
+                raise ValueError(
+                    f"input_len({seq_lens[i]}) + \
+max_dec_len({self.args.max_length}) > max_seq_len({max_sec_len})"
+                )
+            for j in range((real_len + self.args.block_size - 1) // self.args.block_size):
+                used_block_id = self.free_list.pop()
+                self.model_inputs["block_tables"][i, j] = used_block_id
+
+        self.model_inputs["input_ids"] = paddle.clone(base_model_inputs["input_ids"])
+        self.model_inputs["seq_lens_this_time"] = paddle.clone(base_model_inputs["seq_lens_this_time"])
+        self.model_inputs["seq_lens_encoder"] = paddle.clone(base_model_inputs["seq_lens_encoder"])
+        self.model_inputs["seq_lens_decoder"] = paddle.clone(base_model_inputs["seq_lens_decoder"])
+        self.model_inputs["step_idx"] = paddle.clone(base_model_inputs["step_idx"])
+        self.model_inputs["stop_flags"] = paddle.clone(base_model_inputs["stop_flags"])
+        self.model_inputs["stop_nums"] = paddle.clone(base_model_inputs["stop_nums"])
+        self.model_inputs["not_need_stop"] = paddle.to_tensor([False], dtype="bool", place="cpu")
+        self.model_inputs["pre_ids"] = self.pre_ids
+        self.model_inputs["rope_emb"] = self.rope_emb
+        self.model_inputs["cache_kvs"] = self.cache_kvs
+        self.model_inputs["top_p"] = base_model_inputs["top_p"]
+        self.model_inputs["temperature"] = base_model_inputs["temperature"]
+        self.model_inputs["eos_token_id"] = base_model_inputs["eos_token_id"]
+        self.model_inputs["penalty_score"] = base_model_inputs["penalty_score"]
+        self.model_inputs["frequency_score"] = base_model_inputs["frequency_score"]
+        self.model_inputs["presence_score"] = base_model_inputs["presence_score"]
+        self.model_inputs["max_length"] = base_model_inputs["max_length"]
+        self.model_inputs["min_length"] = base_model_inputs["min_length"]
+        self.model_inputs["bad_tokens"] = base_model_inputs["bad_tokens"]
+        self.model_inputs["next_tokens"] = paddle.full(shape=[self.batch_size, 1], fill_value=-1, dtype="int64")
+        self.model_inputs["base_model_draft_tokens"] = base_model_inputs["draft_tokens"]
+        self.model_inputs["draft_tokens"] = paddle.full(shape=[self.batch_size, 2], fill_value=-1, dtype="int64")
+
+        self.first_token_record = paddle.full(shape=[self.batch_size, 1], fill_value=-1, dtype="int32")
+        self.model_inputs["substep"] = 0
+        for i in range(real_bs):
+            self.model_inputs["pre_ids"][i, 0] = self.model_inputs["input_ids"][i, -1]
+            self.first_token_record[i : i + 1] = seq_lens[i]
+
+    def run_preprocess(self, share_inputs):
+        """
+        update draft model parameteds
+        """
+        draft_model_preprocess(
+            self.model_inputs["draft_tokens"],
+            self.model_inputs["input_ids"],
+            self.model_inputs["stop_flags"],
+            self.model_inputs["seq_lens_this_time"],
+            self.model_inputs["seq_lens_encoder"],
+            self.model_inputs["seq_lens_decoder"],
+            self.model_inputs["step_idx"],
+            self.first_token_record,
+            self.model_inputs["not_need_stop"],
+            share_inputs["accept_tokens"],
+            share_inputs["accept_num"],
+            share_inputs["seq_lens_encoder"],
+            share_inputs["seq_lens_decoder"],
+            share_inputs["step_idx"],
+            share_inputs["stop_flags"],
+            share_inputs["draft_tokens"],
+            self.max_draft_tokens,
+            self.draft_type in ["eagle", "mtp"],
+        )
+
+    def run_infer(self, share_inputs, **kwargs):
+        """
+        Should be implemented by subclasses.
+        """
+        raise NotImplementedError("Subclasses mut implement this function")
+
+    def run_postprocess(self, share_inputs):
+        """
+        Update base model draft_tokens
+        """
+        draft_model_postprocess(
+            share_inputs["draft_tokens"],
+            share_inputs["seq_lens_this_time"],
+            share_inputs["seq_lens_encoder"],
+            share_inputs["stop_flags"],
+        )
+
+    def postprocess(self, base_model_inputs):
+        for i in range(self.batch_size):
+            if not base_model_inputs["stop_flags"][i]:
+                break
+        self.pre_ids[:] = -1
+        self.free_list = list(range(self.max_block_nums))
+        # self.used_list = [[] for _ in range(self.beam_batch_size)]
+
+
+# Not verified Now. Reserverd API.
+class DraftModelProposer(ModelProposer):
+    """
+    用于 Draft Model 的 Proposer
+    在输入输出中匹配符合的tokens作为 draft tokens
+    """
+
+    def insert_query(self, **kwargs):
+        super().insert_query(**kwargs)
+        real_bs = kwargs.get("real_bs")
+        self.model_inputs["seq_lens_encoder"] += 1
+        self.model_inputs["seq_lens_this_time"] += 1
+        for i in range(real_bs):
+            self.first_token_record[i : i + 1] += 1
+
+    def run_infer(self, share_inputs, **kwargs):
+        if self.model_inputs["not_need_stop"]:
+            with paddle.no_grad():
+                self.model_inputs["substep"] = 0
+                while self.model_inputs["substep"] < self.max_draft_tokens and self.model_inputs["not_need_stop"]:
+                    self.model(**self.model_inputs)
+                    self.model_inputs["substep"] += 1
+
+
+class EagleProposer(ModelProposer):
+    """
+    用于 Eagle 的 Proposer
+    在输入输出中匹配符合的tokens作为 draft tokens
+    """
+
+    def insert_query(self, **kwargs):
+        super().insert_query(**kwargs)
+        base_model_inputs = kwargs.get("base_model_inputs")
+
+        self.model_inputs["input_ids"][:, :-1] = base_model_inputs["input_ids"][:, 1:]
+        self.last_seq_lens_this_time = paddle.full_like(
+            base_model_inputs["seq_lens_this_time"], fill_value=-1, dtype="int32"
+        )
+
+    def run_infer(self, share_inputs, **kwargs):
+        base_model_full_hidden_states = kwargs.get("base_model_full_hidden_states", None)
+        if self.model_inputs["not_need_stop"]:
+            base_model_hidden_states = eagle_get_base_model_hidden_states(
+                base_model_full_hidden_states,
+                self.model_inputs["seq_lens_this_time"],
+                self.model_inputs["seq_lens_encoder"],
+                self.model_inputs["seq_lens_decoder"],
+                self.model_inputs["stop_flags"],
+                share_inputs["accept_num"],
+                share_inputs["seq_lens_this_time"],
+                share_inputs["seq_lens_encoder"],
+                self.actual_draft_token_num,
+            )
+            self.model_inputs["hidden_states"] = base_model_hidden_states
+
+        with paddle.no_grad():
+            self.model_inputs["substep"] = 0
+            while self.model_inputs["not_need_stop"] and self.model_inputs["substep"] < self.max_draft_tokens:
+                self.last_seq_lens_this_time[:] = self.model_inputs["seq_lens_this_time"][:]
+                output_hidden_states = self.model.generate(**self.model_inputs)
+                self.model_inputs["substep"] += 1
+                if self.model_inputs["not_need_stop"] and self.model_inputs["substep"] < self.actual_draft_token_num:
+                    self.model_inputs["hidden_states"] = eagle_get_self_hidden_states(
+                        output_hidden_states,
+                        self.last_seq_lens_this_time,
+                        self.model_inputs["seq_lens_this_time"],
+                        self.model_inputs["step_idx"],
+                    )
+                else:
+                    self.model_inputs["hidden_states"] = None
diff --git a/paddlenlp/experimental/transformers/qwen2/modeling.py b/paddlenlp/experimental/transformers/qwen2/modeling.py
index 6098079d9084..a54b00f27dca 100644
--- a/paddlenlp/experimental/transformers/qwen2/modeling.py
+++ b/paddlenlp/experimental/transformers/qwen2/modeling.py
@@ -87,8 +87,10 @@ def forward(self, x):
 
 @register_base_model
 class Qwen2InferenceModel(Qwen2PretrainedModel):
-    def __init__(self, config: Qwen2Config):
+    def __init__(self, config: Qwen2Config, base_model_prefix: str):
         super(Qwen2PretrainedModel, self).__init__(config)
+        self.base_model_prefix = base_model_prefix
+
         self.vocab_size = config.vocab_size
         self.hidden_size = config.hidden_size
         self.num_attention_heads = config.num_attention_heads
@@ -306,7 +308,7 @@ def __init__(self, config: Qwen2Config):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.intermediate_size,
+            intermediate_size=self.intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
@@ -771,7 +773,9 @@ def set_state_dict(self, state_dict):
                     ffn1_weight.cast(self.transformer_block.ffn1_weights[idx].dtype)
                 )
 
-            ffn2_weight = paddle.to_tensor(state_dict[f"{model_prefix}.mlp.down_proj.weight"])
+            ffn2_weight = paddle.to_tensor(state_dict[f"{model_prefix}.mlp.down_proj.weight"]).cast(
+                paddle.get_default_dtype()
+            )
             if self.use_weight_only:
                 ffn2_quanted_weight, ffn2_weight_scale = weight_quantize(ffn2_weight, algo=self.quant_algo)
                 self.transformer_block.ffn2_weights[idx].set_value(ffn2_quanted_weight)
@@ -1051,9 +1055,11 @@ def forward(
 
 
 class Qwen2ForCausalLMInferenceModel(GenerationInferenceModel, Qwen2PretrainedModel):
-    def __init__(self, config: Qwen2Config, **kwargs):
-        super(Qwen2ForCausalLMInferenceModel, self).__init__(config)
-        self.qwen2 = Qwen2InferenceModel(config)
+    def __init__(self, config: Qwen2Config, base_model_prefix: str = "qwen2"):
+        super().__init__(config)
+        self.base_model_prefix = base_model_prefix
+
+        self.qwen2 = Qwen2InferenceModel(config, base_model_prefix)
         if config.tie_word_embeddings:
             self.lm_head = Qwen2LMHead(config, embedding_weights=self.qwen2.embed_tokens.weight, transpose_y=True)
             self.tie_weights()
@@ -1214,9 +1220,9 @@ def set_state_dict(self, state_dict):
 
 @register_base_model
 class Qwen2BlockInferenceModel(Qwen2InferenceModel):
-    def __init__(self, config: Qwen2Config):
+    def __init__(self, config: Qwen2Config, base_model_prefix: str):
         self.append_attn = config.append_attn
-        super().__init__(config)
+        super().__init__(config, base_model_prefix)
         self.max_seq_len = config.max_seq_len
         self.block_size = config.block_size
 
@@ -1309,13 +1315,15 @@ class Qwen2ForCausalLMBlockInferenceModel(GenerationBlockInferenceModel, Qwen2Pr
 
     _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
 
-    def __init__(self, config):
+    def __init__(self, config: Qwen2Config, base_model_prefix: str = "qwen2"):
         super().__init__(config)
+        self.base_model_prefix = base_model_prefix
+
         self.max_candidate_len = config.get("speculate_max_candidate_len", 5)
         self.verify_window = config.get("speculate_verify_window", 2)
         self.max_seq_len = config.max_seq_len
 
-        self.qwen2 = Qwen2BlockInferenceModel(config)
+        self.qwen2 = Qwen2BlockInferenceModel(config, base_model_prefix)
         if config.tie_word_embeddings:
             self.lm_head = Qwen2LMHead(config, embedding_weights=self.qwen2.embed_tokens.weight, transpose_y=True)
             self.tie_weights()
@@ -1347,6 +1355,31 @@ def get_tensor_parallel_split_mappings(num_layers):
                 "layers.0.mlp.down_proj.weight": partial(fn, is_column=False),
             }
 
+            if "a8w8" in config.quant_type:
+                if config.quantization_config.shift_smooth_all_linears:
+                    base_actions["layers.0.self_attn.o_proj.shift_bias"] = partial(fn, is_column=True)
+                    base_actions["layers.0.self_attn.o_proj.smooth_weight"] = partial(fn, is_column=True)
+                    base_actions["layers.0.mlp.down_proj.shift_bias"] = partial(fn, is_column=True)
+                    base_actions["layers.0.mlp.down_proj.smooth_weight"] = partial(fn, is_column=True)
+
+                if config.quantization_config.shift:
+                    if config.fuse_attention_qkv:
+                        base_actions["layers.0.self_attn.qkv_proj.bias"] = partial(fn, is_column=True)
+                    else:
+                        base_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True)
+                        # if we have enough num_key_value_heads to split, then split it.
+                        if config.num_key_value_heads % config.tensor_parallel_degree == 0:
+                            base_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True)
+                            base_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True)
+
+                    if config.fuse_attention_ffn:
+                        base_actions["layers.0.mlp.gate_up_fused_proj.bias"] = partial(
+                            fn, is_column=True, is_naive_2fuse=True
+                        )
+                    else:
+                        base_actions["layers.0.mlp.gate_proj.bias"] = partial(fn, is_column=True)
+                        base_actions["layers.0.mlp.up_proj.bias"] = partial(fn, is_column=True)
+
             # Column Linear
             if config.fuse_attention_qkv:
                 base_actions["layers.0.self_attn.qkv_proj.weight"] = partial(fn, is_column=True)
@@ -1520,6 +1553,5 @@ class Qwen2VLForConditionalGenerationBlockInferenceModel(Qwen2ForCausalLMBlockIn
     """
 
     # NOTE: (changwenbin) This function corresponds to QWen2-VL's second part, only used for QWen2-VL.
-    def __init__(self, config):
-        super().__init__(config)
-        self.qwen2.base_model_prefix = "model"
+    def __init__(self, config: Qwen2Config):
+        super().__init__(config, base_model_prefix="model")
diff --git a/paddlenlp/experimental/transformers/qwen2_moe/modeling.py b/paddlenlp/experimental/transformers/qwen2_moe/modeling.py
index 1aa0969b4a11..9b1600fafd58 100644
--- a/paddlenlp/experimental/transformers/qwen2_moe/modeling.py
+++ b/paddlenlp/experimental/transformers/qwen2_moe/modeling.py
@@ -78,7 +78,6 @@ def __init__(self, config: Qwen2MoeConfig):
         self.num_key_value_heads = config.num_key_value_heads
         self.num_layers = config.num_hidden_layers
         self.rms_norm_eps = config.rms_norm_eps
-        self.max_position_embeddings = config.max_position_embeddings
         self.quant_type = config.quant_type
         self.rope_theta = config.rope_theta
 
@@ -217,7 +216,7 @@ def __init__(self, config: Qwen2MoeConfig):
             num_experts=self.num_experts,
             top_k=self.moe_topk,
             norm_topk_prob=self.norm_topk_prob,
-            moe_every2=False,
+            moe_intermediate_size=self.moe_intermediate_size,
             shared_expert_intermediate_size=self.shared_expert_intermediate_size,
             shared_expert_ffn1_weight_attrs=shared_expert_ffn1_weight_attrs,
             shared_expert_ffn1_weight_scale_attrs=shared_expert_ffn1_weight_scale_attrs,
@@ -230,7 +229,7 @@ def __init__(self, config: Qwen2MoeConfig):
             embed_dim=self.hidden_size,
             num_heads=self.num_attention_heads,
             kv_num_heads=self.num_key_value_heads,
-            dim_feedforward=self.moe_intermediate_size,
+            intermediate_size=self.moe_intermediate_size,
             quant_type=self.quant_type,
             activation="swiglu",
             num_layers=config.num_hidden_layers,
diff --git a/paddlenlp/generation/utils.py b/paddlenlp/generation/utils.py
index e7c3dd162643..7d6ee91d259f 100644
--- a/paddlenlp/generation/utils.py
+++ b/paddlenlp/generation/utils.py
@@ -742,7 +742,7 @@ def generate(
                 # ['是的', '嗯嗯']
         """
         if generation_config is None:
-            if self.generation_config._from_model_config:
+            if self.generation_config is None or self.generation_config._from_model_config:
                 new_generation_config = GenerationConfig.from_model_config(self.config)
                 if new_generation_config != self.generation_config:
                     logger.warning(
@@ -1437,16 +1437,8 @@ def _post_process_(
             outputs, input_ids, cur_len_gpu, origin_len_gpu, scores, unfinished_flag, model_kwargs, pad_token_id
         )
 
-        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
-            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
-            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
-            # removed after static graphs support inplace and stride.
-            with paddle.framework._no_check_dy2st_diff():
-                paddle.increment(cur_len)
-                paddle.increment(cur_len_gpu)
-        else:
-            paddle.increment(cur_len)
-            paddle.increment(cur_len_gpu)
+        cur_len += 1
+        cur_len_gpu += 1
 
         attn_mask = model_kwargs["attention_mask"]
         # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
@@ -1454,38 +1446,19 @@ def _post_process_(
         model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
         max_new_tokens = paddle.full([1], max_new_tokens + cur_len - 1, dtype="int64")
 
-        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
-            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
-            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
-            # removed after static graphs support inplace and stride.
-            with paddle.framework._no_check_dy2st_diff():
-                while cur_len < max_new_tokens and paddle.any(unfinished_flag):
-                    input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
-                        _forward_(**model_kwargs),
-                        input_ids,
-                        cur_len_gpu,
-                        origin_len_gpu,
-                        scores,
-                        unfinished_flag,
-                        model_kwargs,
-                        pad_token_id,
-                    )
-                    paddle.increment(cur_len)
-                    paddle.increment(cur_len_gpu)
-        else:
-            while cur_len < max_new_tokens and paddle.any(unfinished_flag):
-                input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
-                    _forward_(**model_kwargs),
-                    input_ids,
-                    cur_len_gpu,
-                    origin_len_gpu,
-                    scores,
-                    unfinished_flag,
-                    model_kwargs,
-                    pad_token_id,
-                )
-                paddle.increment(cur_len)
-                paddle.increment(cur_len_gpu)
+        while cur_len < max_new_tokens and paddle.any(unfinished_flag):
+            input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                _forward_(**model_kwargs),
+                input_ids,
+                cur_len_gpu,
+                origin_len_gpu,
+                scores,
+                unfinished_flag,
+                model_kwargs,
+                pad_token_id,
+            )
+            cur_len += 1
+            cur_len_gpu += 1
 
         return input_ids[:, origin_len:], scores
 
diff --git a/paddlenlp/mergekit/merge_config.py b/paddlenlp/mergekit/merge_config.py
index df044d826838..e440cec580f8 100644
--- a/paddlenlp/mergekit/merge_config.py
+++ b/paddlenlp/mergekit/merge_config.py
@@ -17,10 +17,7 @@
 from dataclasses import asdict, dataclass, field
 from typing import List, Optional
 
-import paddle
-
 from paddlenlp.utils.env import MERGE_CONFIG_NAME
-from paddlenlp.utils.log import logger
 
 
 @dataclass
@@ -30,15 +27,16 @@ class MergeConfig:
     """
 
     # Common parameters
-    device: str = field(default="cpu", metadata={"help": "Device to use for the merge.ex cpu、 gpu、low_gpu_mem"})
     tensor_type: str = field(
         default="np", metadata={"help": "Tensor type to use for the merge. Choose np(CPU Only) or pd (CPU/GPU)"}
     )
     n_process: int = field(default=1, metadata={"help": "Number of processes to use for the merge."})
-    merge_preifx: str = field(default="model", metadata={"help": "Prefix name: model or master_weights"})
+    merge_prefix: str = field(default="model", metadata={"help": "Prefix name: model or master_weights"})
     merge_method: str = field(default="linear", metadata={"help": "The merge strategy."})
     merge_type: str = field(default="linear", metadata={"help": "The type of merge process."})
     sparsify_type: str = field(default=None, metadata={"help": "The type of sparsify process."})
+    split_pieces: int = field(default=8, metadata={"help": "Split large tensor to multi-piece"})
+    max_tensor_mem: float = field(default=0.5, metadata={"help": "Split tensor if exceed setting max_tensor_mem."})
 
     # Model parameters
     model_path_list: Optional[List[str]] = field(default=None, metadata={"help": "Merge model name or path list"})
@@ -46,7 +44,11 @@ class MergeConfig:
         default=None, metadata={"help": "Merge model name or path string.(split by ',')"}
     )
     base_model_path: str = field(default=None, metadata={"help": "Base model name or path."})
-    output_path: str = field(default=None, metadata={"help": "Base model name or path."})
+    output_path: str = field(default=None, metadata={"help": "Output model name or path."})
+    lora_model_path: str = field(default=None, metadata={"help": "LoRA model name or path."})
+    copy_file_list: Optional[List[str]] = field(
+        default=None, metadata={"help": "Copy file list from base model path or first model path."}
+    )
     # merge parameters
     weight_list: Optional[List[float]] = field(
         default=None, metadata={"help": "Relative (or absolute if normalize=False) weighting of a given tensor"}
@@ -73,35 +75,45 @@ def __post_init__(self):
     def config_check(self):
         if self.output_path is not None:
             os.makedirs(self.output_path, exist_ok=True)
-        if self.tensor_type not in ["np"]:
-            raise ValueError(f"Unsupported tensor type: {self.tensor_type}. Support 'np' only.")
-        if self.device != "cpu":
-            logger.warning(f"Currently only support cpu device, but got {self.device}. Setting `device` to `cpu`.")
-            self.device = "cpu"
-            self.tensor_type = "np"
-
-        elif self.merge_method not in ["linear", "ties", "slerp", "della_linear", "della", "dare_linear", "dare_ties"]:
-            raise ValueError(
-                f"Unsupported merge strategy: {self.merge_method}. Please choose one from ['linear', 'slerp']."
-            )
-        if self.model_path_str is not None:
-            self.model_path_list = self.model_path_str.split(",")
-        if self.model_path_list is not None:
-            if not isinstance(self.model_path_list, list) or len(self.model_path_list) < 2:
-                raise ValueError(f"Please specify the model_path_list at least two. But got {self.model_path_list}")
-            if self.weight_list is None:
-                self.weight_list = [1.0] * len(self.model_path_list)
-                self.normalize = True
-            if len(self.model_path_list) != len(self.weight_list):
-                raise ValueError("The length of model_path_list and weight_list must be the same.")
-        if self.reserve_p < 0 or self.reserve_p > 1:
-            raise ValueError("reserve_p must be between 0 and 1.")
-        if "della" in self.merge_method or self.sparsify_type == "magprune":
-            if self.reserve_p <= self.epsilon / 2 or self.reserve_p >= (1 - self.epsilon):
+        if self.tensor_type not in ["np", "pd"]:
+            raise ValueError(f"Unsupported tensor type: {self.tensor_type}. Support 'np' and 'pd' only.")
+        if self.lora_model_path is not None:
+            if self.base_model_path is None:
+                raise ValueError("Please specify the base_model_path when using LoRA merge.")
+            self.tensor_type = "pd"
+
+        if self.lora_model_path is None:
+            if self.merge_method not in [
+                "linear",
+                "ties",
+                "slerp",
+                "della_linear",
+                "della",
+                "dare_linear",
+                "dare_ties",
+            ]:
                 raise ValueError(
-                    f"Error: reserve_p +- epsilon/2 must be in the range (0, 1). reserve_p + epsilon/2 = {self.reserve_p + self.epsilon / 2 }, reserve_p - epsilon/2 = {self.reserve_p - self.epsilon / 2 }"
+                    f"Unsupported merge strategy: {self.merge_method}. Please choose one from ['linear', 'slerp', 'ties', 'della_linear', 'della', ']."
                 )
-        paddle.set_device(self.device)
+            if self.model_path_str is not None:
+                self.model_path_list = self.model_path_str.split(",")
+            if self.model_path_list is not None:
+                if not isinstance(self.model_path_list, list) or len(self.model_path_list) < 2:
+                    raise ValueError(
+                        f"Please specify the model_path_list at least two. But got {self.model_path_list}"
+                    )
+                if self.weight_list is None:
+                    self.weight_list = [1.0] * len(self.model_path_list)
+                    self.normalize = True
+                if len(self.model_path_list) != len(self.weight_list):
+                    raise ValueError("The length of model_path_list and weight_list must be the same.")
+            if self.reserve_p < 0 or self.reserve_p > 1:
+                raise ValueError("reserve_p must be between 0 and 1.")
+            if "della" in self.merge_method or self.sparsify_type == "magprune":
+                if self.reserve_p <= self.epsilon / 2 or self.reserve_p >= (1 - self.epsilon):
+                    raise ValueError(
+                        f"Error: reserve_p +- epsilon/2 must be in the range (0, 1). reserve_p + epsilon/2 = {self.reserve_p + self.epsilon / 2 }, reserve_p - epsilon/2 = {self.reserve_p - self.epsilon / 2 }"
+                    )
 
     @property
     def __dict__(self):
diff --git a/paddlenlp/mergekit/merge_method.py b/paddlenlp/mergekit/merge_method.py
index 042ac6ed30f3..737312a75be5 100644
--- a/paddlenlp/mergekit/merge_method.py
+++ b/paddlenlp/mergekit/merge_method.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import numpy as np
+import paddle
 
 
 class MergeMethod:
@@ -46,8 +47,13 @@ def linear(self, tensor_list):
         if self.merge_config.tensor_type == "np":
             tensor_output = sum(weight * tensor for weight, tensor in zip(weight_list, tensor_list))
             return tensor_output
+        elif self.merge_config.tensor_type == "pd":
+            tensor_output = paddle.zeros_like(tensor_list[0])
+            for i, tensor in enumerate(tensor_list):
+                tensor_output += tensor * weight_list[i]
+            return tensor_output
         else:
-            raise NotImplementedError("Paddle Tensor is not supported yet.")
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
     def slerp(self, tensor_list):
         """
@@ -85,9 +91,38 @@ def slerp(self, tensor_list):
             s0 = np.sin(theta_0 - theta_t) / sin_theta_0
             s1 = sin_theta_t / sin_theta_0
 
+            return s0 * t0_copy + s1 * t1_copy
+        elif self.merge_config.tensor_type == "pd":
+            t0, t1 = tensor_list
+            # Copy the tensors to reuse them later
+            t0_copy = t0.clone()
+            t1_copy = t1.clone()
+
+            # Normalize the tensors to get the directions and angles
+            t0 = self.normalize(t0)
+            t1 = self.normalize(t1)
+
+            # Dot product with the normalized tensors
+            dot = paddle.sum(t0 * t1)
+            # If absolute value of dot product is almost 1, vectors are ~colinear, so use lerp
+            if paddle.abs(dot) > self.merge_config.slerp_dot_threshold:
+                return (1 - self.merge_config.slerp_alpha) * t0_copy + self.merge_config.slerp_alpha * t1_copy
+
+            # Calculate initial angle between t0 and t1
+            theta_0 = paddle.acos(dot)
+            sin_theta_0 = paddle.sin(theta_0)
+
+            # Angle at timestep t
+            theta_t = theta_0 * self.merge_config.slerp_alpha
+            sin_theta_t = paddle.sin(theta_t)
+
+            # Finish the slerp algorithm
+            s0 = paddle.sin(theta_0 - theta_t) / sin_theta_0
+            s1 = sin_theta_t / sin_theta_0
+
             return s0 * t0_copy + s1 * t1_copy
         else:
-            raise NotImplementedError("Paddle Tensor is not supported yet.")
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
     def ties(self, tensor_list):
         if self.merge_config.tensor_type == "np":
@@ -95,7 +130,6 @@ def ties(self, tensor_list):
             mask_dtype = tensor_list[0].dtype
             weight_list = self.merge_config.weight_list
             tensor_list = [weight * tensor for (weight, tensor) in zip(weight_list, tensor_list)]
-
             # Elect majority sign
             sign_tensor_list = [np.sign(tensor).astype(mask_dtype) for tensor in tensor_list]
             if self.merge_config.ties_elect_type == "sum":
@@ -117,14 +151,57 @@ def ties(self, tensor_list):
                 divisor[np.abs(divisor) < 1e-8] = 1
                 merge_tensor /= divisor
             return merge_tensor
+
+        elif self.merge_config.tensor_type == "pd":
+            mask_dtype = tensor_list[0].dtype
+
+            # Elect majority sign
+            majority_sign = paddle.zeros_like(tensor_list[0])
+            for i, tensor in enumerate(tensor_list):
+                if self.merge_config.ties_elect_type == "sum":
+                    majority_sign += tensor * self.merge_config.weight_list[i]
+                elif self.merge_config.ties_elect_type == "count":
+                    majority_sign += tensor.sign()
+                else:
+                    raise NotImplementedError(f"ties_elect_type: {self.merge_config.ties_elect_type} is unknown.")
+            majority_sign = (majority_sign >= 0).astype(mask_dtype) * 2 - 1
+
+            # Merge
+            merge_tensor = paddle.zeros_like(tensor_list[0])
+            if self.merge_config.normalize:
+                divisor = paddle.zeros_like(tensor_list[0])
+            for i, tensor in enumerate(tensor_list):
+                if self.merge_config.normalize:
+                    mask = (tensor.sign() == majority_sign).astype(mask_dtype) * self.merge_config.weight_list[i]
+                    divisor += mask
+                    merge_tensor += mask * tensor
+                else:
+                    merge_tensor += (
+                        (tensor.sign() == majority_sign).astype(mask_dtype) * tensor * self.merge_config.weight_list[i]
+                    )
+
+            # Normalize
+            if self.merge_config.normalize:
+                divisor = paddle.where(paddle.abs(divisor) < 1e-8, paddle.ones_like(divisor), divisor)
+                merge_tensor /= divisor
+
+            return merge_tensor
         else:
-            raise NotImplementedError("Paddle Tensor is not supported yet.")
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
     def normalize(self, t):
         """
         Normalize a vector by its L2 norm.
         """
-        norm_t = np.linalg.norm(t)
-        if norm_t > self.merge_config.slerp_normalize_eps:
-            t = t / norm_t
-        return t
+        if self.merge_config.tensor_type == "np":
+            norm_t = np.linalg.norm(t)
+            if norm_t > self.merge_config.slerp_normalize_eps:
+                t = t / norm_t
+            return t
+        elif self.merge_config.tensor_type == "pd":
+            norm_t = paddle.norm(t, p=2)
+            if norm_t > self.merge_config.slerp_normalize_eps:
+                t = t / norm_t
+            return t
+        else:
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
diff --git a/paddlenlp/mergekit/merge_model.py b/paddlenlp/mergekit/merge_model.py
index 9c3f6dc21958..03684a51cf89 100644
--- a/paddlenlp/mergekit/merge_model.py
+++ b/paddlenlp/mergekit/merge_model.py
@@ -11,28 +11,35 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import gc
 import json
+import math
 import os
+import shutil
 from multiprocessing import Process
 
 import numpy as np
 import paddle
+import paddle.distributed as dist
 from safetensors import safe_open
 from safetensors.numpy import save_file
 
+from paddlenlp.peft import LoRAConfig
+from paddlenlp.utils import device_guard
 from paddlenlp.utils.env import (
+    LORA_WEIGHTS_NAME,
     PADDLE_MASTER_WEIGHTS_NAME,
     PADDLE_WEIGHTS_NAME,
     SAFE_MASTER_WEIGHTS_INDEX_NAME,
     SAFE_MASTER_WEIGHTS_NAME,
+    SAFE_PEFT_WEIGHTS_INDEX_NAME,
     SAFE_WEIGHTS_INDEX_NAME,
     SAFE_WEIGHTS_NAME,
 )
+from paddlenlp.utils.log import logger
 from paddlenlp.utils.safetensors import fast_safe_open
 
 from .merge_method import MergeMethod
-from .merge_utils import divide_positions
+from .merge_utils import divide_lora_key_list, divide_positions
 from .sparsify_method import SparsifyMethod
 
 SPARSIFY_MERGE_MAPPING = {
@@ -50,8 +57,13 @@ class MergeModel:
     def __init__(self, merge_config):
         self.reset_merge_model(merge_config=merge_config)
         self.numpy_dtype_map = {"float32": 4, "float16": 2, "uint16": 2}
+        self.is_peft = False
 
     def reset_merge_model(self, merge_config=None, merge_param_dict=None):
+        self.is_cpu = "cpu" in paddle.device.get_device()
+        if not self.is_cpu:
+            if dist.get_world_size() > 1 and not paddle.distributed.is_initialized():
+                dist.init_parallel_env()
         if merge_config is not None:
             self.merge_config = merge_config
         elif merge_param_dict is not None:
@@ -71,178 +83,296 @@ def reset_merge_model(self, merge_config=None, merge_param_dict=None):
         self.merge_method = MergeMethod(merge_config, sparsify_method)
 
     def merge_model(self):
+        if self.merge_config.lora_model_path is not None:
+            self.merge_lora_model()
+        else:
+            if self.merge_config.tensor_type == "np" and not self.is_cpu:
+                # Avoid memory allocated on GPU
+                with device_guard():
+                    self.mergekit()
+            else:
+                self.mergekit()
+        self.copy_file()
+
+    def copy_file(self):
+        if self.merge_config.copy_file_list is not None:
+            if self.merge_config.base_model_path is not None:
+                src_path = self.merge_config.base_model_path
+            else:
+                src_path = self.merge_config.model_path_list[0]
+            for file in self.merge_config.copy_file_list:
+                src_file = os.path.join(src_path, file)
+                dst_file = os.path.join(self.merge_config.output_path, file)
+                if os.path.isfile(src_file):
+                    shutil.copy2(src_file, dst_file)
+                else:
+                    logger.warning(f"Copy failed: {file} not found in {src_path}")
+
+    def mergekit(self):
+        # Check model file type
         file_type_list = []
         for model_path in self.merge_config.model_path_list:
             file_type_list.append(self.check_model_path(model_path))
         if self.merge_config.base_model_path is not None:
             file_type_list.append(self.check_model_path(self.merge_config.base_model_path))
+
+        # Merge model (distinguish between safetensors and pdparams)
         if all(file_type == "safetensors" or file_type == "safetensors_without_index" for file_type in file_type_list):
             self.merge_safetensor_model(file_type_list)
         else:
             self.merge_mix_model(file_type_list)
 
     def merge_mix_model(self, file_type_list):
+        # Load model state dict
         state_dict_list = []
         for i, model_path in enumerate(self.merge_config.model_path_list):
             state_dict_list.append(self.get_model_state_dict(model_path, file_type_list[i]))
         if self.merge_config.base_model_path is not None:
             state_dict_list.append(self.get_model_state_dict(self.merge_config.base_model_path, file_type_list[-1]))
+
         if not all(state_dict_list[0].keys() == state_dict.keys() for state_dict in state_dict_list):
             raise ValueError("State dict keys mismatch. Please make sure you load the correct weight file")
-        if self.merge_config.base_model_path is not None:
-            base_state_dict = state_dict_list.pop()
-            base_file_type = file_type_list.pop()
+
+        # Merge state dict
         merge_state_dict = {}
-        total_size = 0
-        weight_map = {}
-        for key in state_dict_list[0].keys():
-            is_bf16 = False
-            tensor_list = []
-            for state_dict, file_type in zip(state_dict_list, file_type_list):
-                if file_type == "pdparams":
-                    if str(state_dict[key].dtype) == "paddle.bfloat16":
-                        is_bf16 = True
-                        state_dict[key] = state_dict[key].astype("float32").numpy()
+        index = {"metadata": {"total_size": 0}, "weight_map": {}}
+
+        key_list = list(state_dict_list[file_type_list.index("pdparams")].keys())
+        model_num = len(state_dict_list)
+        rank = dist.get_rank()
+        positions = divide_positions(len(key_list), dist.get_world_size())
+        local_keys = key_list[positions[rank] : positions[rank + 1]]
+        for ii in range(len(positions) - 1):
+            shard_file = f"{self.merge_config.merge_prefix}-{ii+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+            for key in key_list[positions[ii] : positions[ii + 1]]:
+                index["weight_map"][key] = shard_file
+                index["metadata"]["total_size"] += int(
+                    np.prod(state_dict_list[0][key].shape) * self.numpy_dtype_map[str(state_dict_list[0][key].dtype)]
+                )
+        for key in local_keys:
+            # Tensor preprocess
+            is_bf16 = str(state_dict_list[0][key].dtype) == "uint16"
+            tensor_list = [state_dict_list[i].pop(key) for i in range(model_num)]
+            tensor_mem = int(np.prod(tensor_list[0].shape) * self.numpy_dtype_map[str(tensor_list[0].dtype)]) / (
+                1024**3
+            )
+            if self.merge_config.tensor_type == "pd" and tensor_mem > self.merge_config.max_tensor_mem:
+                tensor_split_list = [
+                    np.array_split(tensor, self.merge_config.split_pieces, axis=0) for tensor in tensor_list
+                ]
+                merge_split = []
+                for sp in range(self.merge_config.split_pieces):
+                    tensor_list = [tensor_split[sp] for tensor_split in tensor_split_list]
+                    if is_bf16:
+                        tensor_list = [
+                            paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list
+                        ]
                     else:
-                        state_dict[key] = state_dict[key].numpy()
-                elif str(state_dict[key].dtype) == "uint16":
-                    is_bf16 = True
-                    state_dict[key] = paddle.to_tensor(state_dict[key], dtype="bfloat16").astype("float32").numpy()
-                tensor_list.append(state_dict[key])
-            if self.merge_config.base_model_path is not None:
-                if base_file_type == "pdparams":
-                    if str(base_state_dict[key].dtype) == "paddle.bfloat16":
-                        base_state_dict[key] = base_state_dict[key].astype("float32").numpy()
+                        tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                    if self.merge_config.base_model_path is not None:
+                        base_tensor = tensor_list.pop()
+                        tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                    merge_tensor = self.merge_method.merge(tensor_list)
+                    if self.merge_config.base_model_path is not None:
+                        merge_tensor += base_tensor
+                    if is_bf16:
+                        merge_split.append(merge_tensor.astype("bfloat16").numpy())
                     else:
-                        base_state_dict[key] = base_state_dict[key].numpy()
-                elif str(base_state_dict[key].dtype) == "uint16":
-                    base_state_dict[key] = (
-                        paddle.to_tensor(base_state_dict[key], dtype="bfloat16").astype("float32").numpy()
-                    )
-                tensor_list = [tensor - base_state_dict[key] for tensor in tensor_list]
-            merge_state_dict[key] = self.merge_method.merge(tensor_list)
-            if self.merge_config.base_model_path is not None:
-                merge_state_dict[key] += base_state_dict[key]
-            # dtype==bfloat16: numpy(float32) -> paddle(float32) -> paddle(bfloat16) -> numpy(uint16)
-            if is_bf16:
-                merge_state_dict[key] = (
-                    paddle.to_tensor(merge_state_dict[key], dtype="float32").astype("bfloat16").numpy()
-                )
-            total_size += np.prod(merge_state_dict[key].shape) * self.numpy_dtype_map[str(merge_state_dict[key].dtype)]
-            weight_map[key] = f"{self.merge_config.merge_preifx}-00001-of-00001.safetensors"
-        # save safetensor file
+                        merge_split.append(merge_tensor.numpy())
+                merge_state_dict[key] = np.concatenate(merge_split, axis=0)
+            else:
+                if self.merge_config.tensor_type == "pd":
+                    if is_bf16:
+                        tensor_list = [
+                            paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list
+                        ]
+                    else:
+                        tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                elif self.merge_config.tensor_type == "np" and is_bf16:
+                    tensor_list = [
+                        paddle.Tensor(tensor, zero_copy=True).astype("float32").numpy() for tensor in tensor_list
+                    ]
+
+                if self.merge_config.base_model_path is not None:
+                    base_tensor = tensor_list.pop()
+                    tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                merge_tensor = self.merge_method.merge(tensor_list)
+                if self.merge_config.base_model_path is not None:
+                    merge_tensor += base_tensor
+                if self.merge_config.tensor_type == "pd":
+                    if is_bf16:
+                        merge_state_dict[key] = merge_tensor.astype("bfloat16").numpy()
+                    else:
+                        merge_state_dict[key] = merge_tensor.numpy()
+                elif self.merge_config.tensor_type == "np" and is_bf16:
+                    # dtype==bfloat16: numpy(float32) -> paddle(float32) -> paddle(bfloat16) -> numpy(uint16)
+                    merge_state_dict[key] = paddle.Tensor(merge_tensor, zero_copy=True).astype("bfloat16").numpy()
+
+        # Save safetensor file
         save_file(
             merge_state_dict,
             os.path.join(
-                self.merge_config.output_path, f"{self.merge_config.merge_preifx}-00001-of-00001.safetensors"
+                self.merge_config.output_path,
+                f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors",
             ),
             metadata={"format": "np"},
         )
-        # save safe index file
-        index = {"metadata": {"total_size": int(total_size)}, "weight_map": weight_map}
-        save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
-        with open(save_index_file, "w", encoding="utf-8") as f:
-            content = json.dumps(index, indent=2) + "\n"
-            f.write(content)
-        # save merge config file
-        self.merge_config.save_pretrained(self.merge_config.output_path)
-        del state_dict_list
-        del merge_state_dict
-        if self.merge_config.base_model_path is not None:
-            del base_state_dict
-        gc.collect()
+        # Save index file & merge config file
+        if paddle.distributed.get_rank() == 0:
+            save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
+            self.merge_config.save_pretrained(self.merge_config.output_path)
 
-    def get_model_state_dict(self, model_path, file_type):
+    def get_model_state_dict(self, model_path, file_type, key_list=None, file=None):
         if file_type == "safetensors":
             state_dict = {}
             with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
                 index = json.load(f)
-            for key in index["weight_map"].keys():
-                with fast_safe_open(
-                    os.path.join(model_path, index["weight_map"][key]),
-                    framework="np",
-                ) as f:
-                    state_dict[key] = f.get_tensor(key)
+            if file is not None:
+                with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                    for k in f.keys():
+                        state_dict[k] = f.get_tensor(k)
+            elif key_list is None:
+                files = set(index["weight_map"].values())
+                for file in files:
+                    with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                        for k in f.keys():
+                            state_dict[k] = f.get_tensor(k)
+            else:
+                file_map = {}
+                for key in key_list:
+                    if index["weight_map"][key] not in file_map:
+                        file_map[index["weight_map"][key]] = [key]
+                    else:
+                        file_map[index["weight_map"][key]].append(key)
+                for file in file_map.keys():
+                    with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                        for k in file_map[file]:
+                            state_dict[k] = f.get_tensor(k)
         elif file_type == "safetensors_without_index":
             state_dict = {}
             with fast_safe_open(os.path.join(model_path, self.safe_weight_name()), framework="numpy") as f:
-                for k in f.keys():
+                tgt_key_list = f.keys() if key_list is None else key_list
+                for k in tgt_key_list:
                     state_dict[k] = f.get_tensor(k)
         elif file_type == "pdparams":
-            state_dict = paddle.load(os.path.join(model_path, self.weight_name()))
+            state_dict = np.load(os.path.join(model_path, self.weight_name()), allow_pickle=True)
+            if "StructuredToParameterName@@" in state_dict.keys():
+                state_dict.pop("StructuredToParameterName@@")
+        elif file_type == "lora_pdparams":
+            state_dict = np.load(os.path.join(model_path, LORA_WEIGHTS_NAME), allow_pickle=True)
+        elif file_type == "lora_safetensors":
+            state_dict = {}
+            with open(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME), "r", encoding="utf-8") as f:
+                index = json.load(f)
+            files = set(index["weight_map"].values())
+            for file in files:
+                with fast_safe_open(os.path.join(model_path, file), framework="np") as f:
+                    for k in f.keys():
+                        state_dict[k] = f.get_tensor(k)
         else:
             raise ValueError(f"Unsupported file_type: {file_type}")
         return state_dict
 
-    def create_safetensor_index(self, model_path):
-        weight_map = {}
-        total_size = 0
-
-        with safe_open(os.path.join(model_path, self.safe_weight_name()), framework="numpy") as f:
-            for key in f.keys():
-                tensor = f.get_tensor(key)
-                total_size += np.prod(tensor.shape) * self.numpy_dtype_map[str(tensor.dtype)]
-                weight_map[key] = self.safe_weight_name()
-        index = {"metadata": {"total_size": total_size}, "weight_map": weight_map}
+    def get_safetensor_index(self, model_path, file_type):
+        if file_type == "safetensors":
+            with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
+                index = json.load(f)
+        elif file_type == "safetensors_without_index":
+            weight_map = {}
+            total_size = 0
+            with safe_open(os.path.join(model_path, self.safe_weight_name()), framework="numpy") as f:
+                for key in f.keys():
+                    tensor = f.get_tensor(key)
+                    total_size += int(np.prod(tensor.shape) * self.numpy_dtype_map[str(tensor.dtype)])
+                    weight_map[key] = self.safe_weight_name()
+            index = {"metadata": {"total_size": total_size}, "weight_map": weight_map}
         return index
 
     def merge_safetensor_model(self, file_type_list):
-        # load index
+        # Load index
         index_list = []
-        model_path_list = self.merge_config.model_path_list
+        model_path_list = self.merge_config.model_path_list.copy()
         if self.merge_config.base_model_path is not None:
             model_path_list += [self.merge_config.base_model_path]
 
         for model_path, file_type in zip(model_path_list, file_type_list):
-            if file_type == "safetensors":
-                with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
-                    index_list.append(json.load(f))
-            else:
-                index = self.create_safetensor_index(model_path)
-                index_list.append(index)
-        # check index
+            index_list.append(self.get_safetensor_index(model_path, file_type))
+
+        # Check index
         if not all(index_list[0]["metadata"]["total_size"] == index["metadata"]["total_size"] for index in index_list):
             raise ValueError("Weights total_size mismatch. Please make sure you load the correct weight file")
         if not all(index_list[0]["weight_map"].keys() == index["weight_map"].keys() for index in index_list):
             raise ValueError("Weights weight_map mismatch. Please make sure you load the correct weight file")
-        # init new index
+        # Initialize new index
         index = {}
         index["metadata"] = index_list[0]["metadata"]
         index["metadata"]["total_size"] = int(index["metadata"]["total_size"])
         index["weight_map"] = {}
-
-        # Multi-process update
+        num = self.merge_config.n_process if self.is_cpu else dist.get_world_size()
         key_list = list(index_list[0]["weight_map"].keys())
-        positions = divide_positions(len(key_list), self.merge_config.n_process)
-        threads = []
-        if self.merge_config.tensor_type == "np":
-            target = self.shard_merge_np
+        positions = divide_positions(len(key_list), num)
+        if not self.is_cpu:
+            rank = dist.get_rank()
+            file_list = sorted(list(set(index_list[0]["weight_map"].values())))
+            if file_type_list[0] == "safetensors" and len(file_list) >= num:
+                positions = divide_positions(len(file_list), num)
+                index["weight_map"] = index_list[0]["weight_map"]
+                file_map = {}
+                for key in key_list:
+                    if index["weight_map"][key] not in file_map:
+                        file_map[index["weight_map"][key]] = [key]
+                    else:
+                        file_map[index["weight_map"][key]].append(key)
+                for shard_file in file_list[positions[rank] : positions[rank + 1]]:
+                    if self.merge_config.tensor_type == "np":
+                        self.shard_merge_np(file_map[shard_file], index_list, shard_file)
+                    else:
+                        self.shard_merge_pd(file_map[shard_file], index_list, shard_file)
+            else:
+                local_keys = key_list[positions[rank] : positions[rank + 1]]
+                shard_file = (
+                    f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                )
+                if self.merge_config.tensor_type == "np":
+                    self.shard_merge_np(local_keys, index_list, shard_file)
+                else:
+                    self.shard_merge_pd(local_keys, index_list, shard_file)
+
+                for i in range(len(positions) - 1):
+                    shard_file = (
+                        f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                    )
+                    for k in key_list[positions[i] : positions[i + 1]]:
+                        index["weight_map"][k] = shard_file
         else:
-            target = self.shard_merge_pd
-        for i in range(len(positions) - 1):
-            shard_file = f"{self.merge_config.merge_preifx}-{i+1:05d}-of-{self.merge_config.n_process:05d}.safetensors"
-            t = Process(
-                target=target,
-                args=(
-                    key_list[positions[i] : positions[i + 1]],  # key_list
-                    index_list,  # index_list
-                    shard_file,  # shard_file name
-                ),
-            )
-            threads.append(t)
-            for k in key_list[positions[i] : positions[i + 1]]:
-                index["weight_map"][k] = shard_file
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join()
-
-        # save safe index file
-        save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
-        with open(save_index_file, "w", encoding="utf-8") as f:
-            content = json.dumps(index, indent=2) + "\n"
-            f.write(content)
-        self.merge_config.save_pretrained(self.merge_config.output_path)
+            threads = []
+            for i in range(len(positions) - 1):
+                shard_file = (
+                    f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{self.merge_config.n_process:05d}.safetensors"
+                )
+                t = Process(
+                    target=self.shard_merge_np if self.merge_config.tensor_type == "np" else self.shard_merge_pd,
+                    args=(
+                        key_list[positions[i] : positions[i + 1]],  # key_list
+                        index_list,  # index_list
+                        shard_file,  # shard_file name
+                    ),
+                )
+                threads.append(t)
+                for k in key_list[positions[i] : positions[i + 1]]:
+                    index["weight_map"][k] = shard_file
+
+            for t in threads:
+                t.start()
+            for t in threads:
+                t.join()
+        # Save safe index file
+        if paddle.distributed.get_rank() == 0:
+            save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
 
     def shard_merge_np(
         self,
@@ -253,14 +383,13 @@ def shard_merge_np(
         merge_state_dict = {}
         for k in key_list:
             tensor_list = []
-
             for i, model_path in enumerate(self.merge_config.model_path_list):
                 with fast_safe_open(os.path.join(model_path, index_list[i]["weight_map"][k]), framework="np") as w:
                     tensor = w.get_tensor(k)
                     dtype = tensor.dtype
                     # dtype==bfloat16: numpy(uint16) -> paddle(bfloat16) -> paddle(float32) -> numpy(float32)
                     if tensor.dtype == np.uint16:
-                        tensor = paddle.to_tensor(tensor, dtype="bfloat16").astype("float32").numpy()
+                        tensor = paddle.Tensor(tensor, zero_copy=True).astype("float32").numpy()
                     tensor_list.append(tensor)
             if self.merge_config.base_model_path is not None:
                 with fast_safe_open(
@@ -269,25 +398,19 @@ def shard_merge_np(
                 ) as w:
                     base_tensor = w.get_tensor(k)
                     if base_tensor.dtype == np.uint16:
-                        base_tensor = paddle.to_tensor(base_tensor, dtype="bfloat16").astype("float32").numpy()
+                        base_tensor = paddle.Tensor(base_tensor, zero_copy=True).astype("float32").numpy()
                 tensor_list = [tensor - base_tensor for tensor in tensor_list]
             merge_state_dict[k] = self.merge_method.merge(tensor_list)
             if self.merge_config.base_model_path is not None:
                 merge_state_dict[k] += base_tensor
             # dtype==bfloat16: numpy(float32) -> paddle(float32) -> paddle(bfloat16) -> numpy(uint16)
             if dtype == np.uint16:
-                merge_state_dict[k] = paddle.to_tensor(merge_state_dict[k], dtype="float32").astype("bfloat16").numpy()
-            del tensor_list
-            if self.merge_config.base_model_path is not None:
-
-                del base_tensor
+                merge_state_dict[k] = paddle.Tensor(merge_state_dict[k], zero_copy=True).astype("bfloat16").numpy()
         save_file(
             merge_state_dict,
             os.path.join(self.merge_config.output_path, shard_file),
             metadata={"format": "np"},
         )
-        del merge_state_dict
-        gc.collect()
 
     def shard_merge_pd(
         self,
@@ -295,18 +418,69 @@ def shard_merge_pd(
         index_list,
         shard_file,
     ):
-        raise NotImplementedError("Not support paddle tensors.")
+        merge_state_dict = {}
+        for k in key_list:
+            tensor_list = []
+            for i, model_path in enumerate(self.merge_config.model_path_list):
+                with fast_safe_open(os.path.join(model_path, index_list[i]["weight_map"][k]), framework="np") as w:
+                    tensor_list.append(w.get_tensor(k))
+            if self.merge_config.base_model_path is not None:
+                with fast_safe_open(
+                    os.path.join(self.merge_config.base_model_path, index_list[-1]["weight_map"][k]),
+                    framework="np",
+                ) as w:
+                    tensor_list.append(w.get_tensor(k))
+            is_bf16 = str(tensor_list[0].dtype) == "uint16"
+            tensor_mem = int(np.prod(tensor_list[0].shape) * self.numpy_dtype_map[str(tensor_list[0].dtype)]) / (
+                1024**3
+            )
+            if tensor_mem > self.merge_config.max_tensor_mem:
+                tensor_split_list = [
+                    np.array_split(tensor, self.merge_config.split_pieces, axis=0) for tensor in tensor_list
+                ]
+                merge_split = []
+                for sp in range(self.merge_config.split_pieces):
+                    tensor_list = [tensor_split[sp] for tensor_split in tensor_split_list]
+                    if is_bf16:
+                        tensor_list = [
+                            paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list
+                        ]
+                    else:
+                        tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                    if self.merge_config.base_model_path is not None:
+                        base_tensor = tensor_list.pop()
+                        tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                    merge_tensor = self.merge_method.merge(tensor_list)
+                    if self.merge_config.base_model_path is not None:
+                        merge_tensor += base_tensor
+                    if is_bf16:
+                        merge_split.append(merge_tensor.astype("bfloat16").numpy())
+                    else:
+                        merge_split.append(merge_tensor.numpy())
+                merge_state_dict[k] = np.concatenate(merge_split, axis=0)
+            else:
+                if is_bf16:
+                    tensor_list = [paddle.Tensor(tensor, zero_copy=True).astype("float32") for tensor in tensor_list]
+                else:
+                    tensor_list = [paddle.Tensor(tensor, zero_copy=True) for tensor in tensor_list]
+                if self.merge_config.base_model_path is not None:
+                    base_tensor = tensor_list.pop()
+                    tensor_list = [tensor - base_tensor for tensor in tensor_list]
+                merge_tensor = self.merge_method.merge(tensor_list)
+                if self.merge_config.base_model_path is not None:
+                    merge_tensor += base_tensor
+                if is_bf16:
+                    merge_state_dict[k] = merge_tensor.astype("bfloat16").numpy()
+                else:
+                    merge_state_dict[k] = merge_tensor.numpy()
+        save_file(
+            merge_state_dict,
+            os.path.join(self.merge_config.output_path, shard_file),
+            metadata={"format": "np"},
+        )
 
-    def check_model_path(self, model_path):
+    def check_model_path(self, model_path, lora_merge=False):
         if os.path.exists(os.path.join(model_path, self.safe_index_name())):
-            with open(os.path.join(model_path, self.safe_index_name()), "r", encoding="utf-8") as f:
-                index = json.load(f)
-                safe_file_list = list(set(index["weight_map"][k] for k in index["weight_map"]))
-                for i in range(len(safe_file_list)):
-                    if os.path.exists(os.path.join(model_path, safe_file_list[i])):
-                        continue
-                    else:
-                        ValueError(f"Not found {os.path.join(model_path, safe_file_list[i])}.")
             file_type = "safetensors"
         elif os.path.exists(os.path.join(model_path, self.safe_weight_name())):
             file_type = "safetensors_without_index"
@@ -314,24 +488,230 @@ def check_model_path(self, model_path):
             file_type = "pdparams"
         else:
             raise ValueError(
-                f"Please check path {model_path} is correct. Support safetensors and pdparams in complete parameter format (not TP or PP format) only."
+                f"Please check path {model_path} is correct. Support safetensors and pdparams only in complete parameter format (not TP or PP format) only."
+            )
+        return file_type
+
+    def check_lora_model_path(self, model_path):
+        if os.path.exists(os.path.join(model_path, SAFE_PEFT_WEIGHTS_INDEX_NAME)):
+            file_type = "lora_safetensors"
+        elif os.path.exists(os.path.join(model_path, LORA_WEIGHTS_NAME)):
+            file_type = "lora_pdparams"
+        else:
+            raise ValueError(
+                f"Please check lora path {model_path} is correct. Support safetensors and pdparams only in complete parameter format (not TP or PP format) only."
             )
         return file_type
 
     def weight_name(self):
-        if self.merge_config.merge_preifx == "model":
+        if self.merge_config.merge_prefix == "model":
             return PADDLE_WEIGHTS_NAME
         else:
             return PADDLE_MASTER_WEIGHTS_NAME
 
     def safe_weight_name(self):
-        if self.merge_config.merge_preifx == "model":
+        if self.merge_config.merge_prefix == "model":
             return SAFE_WEIGHTS_NAME
         else:
             return SAFE_MASTER_WEIGHTS_NAME
 
     def safe_index_name(self):
-        if self.merge_config.merge_preifx == "model":
+        if self.merge_config.merge_prefix == "model":
             return SAFE_WEIGHTS_INDEX_NAME
         else:
             return SAFE_MASTER_WEIGHTS_INDEX_NAME
+
+    def merge_lora_model(self):
+        # Check model file type
+        file_type_list = []
+        file_type_list.append(self.check_lora_model_path(self.merge_config.lora_model_path))
+        file_type_list.append(self.check_model_path(self.merge_config.base_model_path))
+        # Merge model (distinguish between safetensors and pdparams)
+        if "safetensors" in file_type_list[-1]:
+            self.merge_safetensor_lora_model(file_type_list)
+        else:
+            self.merge_pdparams_lora_model(file_type_list)
+
+    def shard_lora_merge(self, base_index, shard_file, lora_config, file_type_list, key_list=None, file=None):
+        merge_state_dict = {}
+        base_state_dict = self.get_model_state_dict(
+            self.merge_config.base_model_path, file_type_list[1], key_list=key_list, file=file
+        )
+        lora_state_dict = self.get_model_state_dict(self.merge_config.lora_model_path, file_type_list[0])
+        if not lora_config.rslora:
+            scaling = lora_config.lora_alpha / lora_config.r
+        else:
+            scaling = lora_config.lora_alpha / math.sqrt(lora_config.r)
+
+        model_key_list = list(base_state_dict.keys())
+        for k in model_key_list:
+            if lora_state_dict is not None and k in lora_state_dict.keys():
+                tensor = lora_state_dict.pop(k)
+            else:
+                tensor = base_state_dict.pop(k)
+            if "weight" in k:
+                lora_A_key, lora_B_key = k.replace("weight", "lora_A"), k.replace("weight", "lora_B")
+                lora_A_tensor = None
+                if lora_state_dict is not None and lora_A_key in lora_state_dict.keys():
+                    lora_A_tensor, lora_B_tensor = lora_state_dict.pop(lora_A_key), lora_state_dict.pop(lora_B_key)
+                    is_bf16 = tensor.dtype == np.uint16
+                    tensor = paddle.Tensor(tensor, zero_copy=True)
+                    lora_A_tensor = paddle.Tensor(lora_A_tensor, zero_copy=True)
+                    lora_B_tensor = paddle.Tensor(lora_B_tensor, zero_copy=True)
+                    if self.is_cpu and is_bf16:
+                        tensor = tensor.astype("float32")
+                        lora_A_tensor = lora_A_tensor.astype("float32")
+                        lora_B_tensor = lora_B_tensor.astype("float32")
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                        tensor = tensor.astype("bfloat16").numpy()
+                    else:
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                        tensor = tensor.numpy()
+            merge_state_dict[k] = tensor
+        save_file(
+            merge_state_dict,
+            os.path.join(self.merge_config.output_path, shard_file),
+            metadata={"format": "np"},
+        )
+
+    def merge_safetensor_lora_model(self, file_type_list):
+        # Load index
+        base_index = self.get_safetensor_index(self.merge_config.base_model_path, file_type_list[-1])
+        lora_config = LoRAConfig.from_pretrained(self.merge_config.lora_model_path)
+
+        # Initialize new index
+        index = {}
+        index["metadata"] = base_index["metadata"]
+        index["metadata"]["total_size"] = int(index["metadata"]["total_size"])
+        index["weight_map"] = {}
+
+        # LoRA Merge
+        key_list = list(base_index["weight_map"].keys())
+        if not self.is_cpu:
+            rank = dist.get_rank()
+            file_list = sorted(list(set(base_index["weight_map"].values())))
+            if file_type_list[-1] == "safetensors" and len(file_list) >= dist.get_world_size():
+                positions = divide_positions(len(file_list), dist.get_world_size())
+                for shard_file in file_list[positions[rank] : positions[rank + 1]]:
+                    self.shard_lora_merge(base_index, shard_file, lora_config, file_type_list, file=shard_file)
+                index["weight_map"] = base_index["weight_map"]
+            else:
+                divided_key_list = divide_lora_key_list(key_list, dist.get_world_size(), lora_config)
+                local_keys = divided_key_list[rank]
+                shard_file = (
+                    f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                )
+                self.shard_lora_merge(base_index, shard_file, lora_config, file_type_list, key_list=local_keys)
+                for i in range(len(divided_key_list)):
+                    shard_file = (
+                        f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+                    )
+                    for k in divided_key_list[i]:
+                        index["weight_map"][k] = shard_file
+        else:
+            divided_key_list = divide_lora_key_list(key_list, self.merge_config.n_process, lora_config)
+            threads = []
+            for i in range(len(divided_key_list)):
+                shard_file = (
+                    f"{self.merge_config.merge_prefix}-{i+1:05d}-of-{self.merge_config.n_process:05d}.safetensors"
+                )
+                t = Process(
+                    target=self.shard_lora_merge,
+                    args=(
+                        base_index,  # base index
+                        shard_file,  # shard_file name
+                        lora_config,
+                        file_type_list,
+                        divided_key_list[i],  # key_list
+                    ),
+                )
+                threads.append(t)
+                for k in divided_key_list[i]:
+                    index["weight_map"][k] = shard_file
+
+            for t in threads:
+                t.start()
+            for t in threads:
+                t.join()
+
+        # Save safe index file
+        if paddle.distributed.get_rank() == 0:
+            save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
+            self.merge_config.save_pretrained(self.merge_config.output_path)
+
+    def merge_pdparams_lora_model(self, file_type_list):
+        # Load & check state dict
+        lora_state_dict = self.get_model_state_dict(self.merge_config.lora_model_path, file_type_list[0])
+        base_state_dict = self.get_model_state_dict(self.merge_config.base_model_path, file_type_list[1])
+        for key in lora_state_dict.keys():
+            if "lora_A" in key:
+                if key.replace("lora_A", "lora_B") not in lora_state_dict.keys():
+                    raise ValueError(f"{key} is not paired with {key.replace('lora_A', 'lora_B')}")
+                if key.replace("lora_A", "weight") not in base_state_dict.keys():
+                    raise ValueError(f'{key.replace("lora_A", "weight")} does not exist in base model.')
+
+        # Load lora config
+        lora_config = LoRAConfig.from_pretrained(self.merge_config.lora_model_path)
+        if not lora_config.rslora:
+            scaling = lora_config.lora_alpha / lora_config.r
+        else:
+            scaling = lora_config.lora_alpha / math.sqrt(lora_config.r)
+
+        # Create index
+        merge_state_dict = {}
+        index = {"metadata": {"total_size": 0}, "weight_map": {}}
+        key_list = list(base_state_dict.keys())
+        positions = divide_positions(len(key_list), dist.get_world_size())
+        for ii in range(len(positions) - 1):
+            shard_file = f"{self.merge_config.merge_prefix}-{ii+1:05d}-of-{dist.get_world_size():05d}.safetensors"
+            for key in key_list[positions[ii] : positions[ii + 1]]:
+                index["weight_map"][key] = shard_file
+                index["metadata"]["total_size"] += int(
+                    np.prod(base_state_dict[key].shape) * self.numpy_dtype_map[str(base_state_dict[key].dtype)]
+                )
+
+        # Merge state dict
+        rank = dist.get_rank()
+        local_keys = key_list[positions[rank] : positions[rank + 1]]
+        for k in local_keys:
+            if k in lora_state_dict.keys():
+                tensor = lora_state_dict[k]
+            else:
+                tensor = base_state_dict[k]
+            if "weight" in k:
+                lora_A_key, lora_B_key = k.replace("weight", "lora_A"), k.replace("weight", "lora_B")
+                if lora_A_key in lora_state_dict.keys():
+                    lora_A_tensor = lora_state_dict[lora_A_key]
+                    lora_B_tensor = lora_state_dict[lora_B_key]
+                    is_bf16 = tensor.dtype == np.uint16
+                    tensor = paddle.Tensor(tensor, zero_copy=True)
+                    lora_A_tensor = paddle.Tensor(lora_A_tensor, zero_copy=True)
+                    lora_B_tensor = paddle.Tensor(lora_B_tensor, zero_copy=True)
+                    if self.is_cpu and is_bf16:
+                        tensor = tensor.astype("float32")
+                        lora_A_tensor = lora_A_tensor.astype("float32")
+                        lora_B_tensor = lora_B_tensor.astype("float32")
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                        tensor = tensor.astype("bfloat16")
+                    else:
+                        tensor += lora_A_tensor @ lora_B_tensor * scaling
+                    tensor = tensor.numpy()
+            merge_state_dict[k] = tensor
+
+        # Save safetensor file
+        save_file(
+            merge_state_dict,
+            os.path.join(
+                self.merge_config.output_path,
+                f"{self.merge_config.merge_prefix}-{rank+1:05d}-of-{dist.get_world_size():05d}.safetensors",
+            ),
+            metadata={"format": "np"},
+        )
+        # Save index file & merge config file
+        if paddle.distributed.get_rank() == 0:
+            save_index_file = os.path.join(self.merge_config.output_path, self.safe_index_name())
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                f.write(json.dumps(index, indent=2) + "\n")
+            self.merge_config.save_pretrained(self.merge_config.output_path)
diff --git a/paddlenlp/mergekit/merge_utils.py b/paddlenlp/mergekit/merge_utils.py
index c96a9ad2fe76..5e0fcf80741b 100644
--- a/paddlenlp/mergekit/merge_utils.py
+++ b/paddlenlp/mergekit/merge_utils.py
@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import re
 
 
 def divide_positions(m, n):
@@ -29,3 +30,44 @@ def divide_positions(m, n):
             positions.append(positions[-1] + base_value)
     positions.append(m)
     return positions
+
+
+def divide_lora_key_list(key_list, n, lora_config):
+    lora_key = []
+    other_key = []
+    for module_name in key_list:
+        if (
+            any(re.fullmatch(target_module, module_name) for target_module in lora_config.target_modules)
+            and "weight" in module_name
+        ):
+            lora_key.append(module_name)
+        else:
+            other_key.append(module_name)
+    lora_positions = divide_positions(len(lora_key), n)
+    other_positions = divide_positions(len(other_key), n)
+    divided_key_list = []
+    for i in range(len(lora_positions) - 1):
+        divided_key = (
+            lora_key[lora_positions[i] : lora_positions[i + 1]]
+            + other_key[other_positions[i] : other_positions[i + 1]]
+        )
+        divided_key_list.append(divided_key)
+    return divided_key_list
+
+
+def divide_safetensor_key_list(weight_map, n):
+    file_map = {}
+    for key in weight_map:
+        if weight_map[key] in file_map:
+            file_map[weight_map[key]].append(key)
+        else:
+            file_map[weight_map[key]] = [key]
+    file_list = list(file_map.keys())
+    p = divide_positions(len(file_list), n)
+    key_list = []
+    positions = [0]
+    for i in range(n):
+        for file in file_list[p[i] : p[i + 1]]:
+            key_list += file_map[file]
+        positions.append(len(key_list))
+    return key_list, positions
diff --git a/paddlenlp/mergekit/sparsify_method.py b/paddlenlp/mergekit/sparsify_method.py
index 97180d79d08e..c13e6b97b8d6 100644
--- a/paddlenlp/mergekit/sparsify_method.py
+++ b/paddlenlp/mergekit/sparsify_method.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
+import paddle
 
 
 class SparsifyMethod:
@@ -32,17 +33,20 @@ def sparsify(self, tensor):
 
     def dare(self, tensor):
         if self.merge_config.tensor_type == "np":
-            mask = np.random.binomial(1, self.merge_config.reserve_p, size=tensor.shape).astype(tensor.dtype)
-            tensor *= mask
+            tensor *= (np.random.rand(*tensor.shape) < self.merge_config.reserve_p).astype(tensor.dtype)
             if self.merge_config.rescale:
                 tensor /= self.merge_config.reserve_p
             return tensor
+        elif self.merge_config.tensor_type == "pd":
+            mode = "upscale_in_train" if self.merge_config.rescale else "downscale_in_infer"
+            tensor = paddle.nn.functional.dropout(tensor, p=1 - self.merge_config.reserve_p, mode=mode, training=True)
+            return tensor
         else:
-            raise NotImplementedError("Paddle Tensor is not supported yet.")
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
     def magprune(self, tensor):
         if self.merge_config.tensor_type == "np":
-            if np.all(tensor == 0):
+            if not np.any(tensor != 0):
                 return tensor
             drop_p = 1 - self.merge_config.reserve_p
             # 1: ranking(descending)
@@ -63,13 +67,30 @@ def magprune(self, tensor):
             if self.merge_config.rescale:
                 tensor /= 1 - probs
             return tensor
+        elif self.merge_config.tensor_type == "pd":
+            if not paddle.any(tensor != 0):
+                return tensor
+            drop_p = 1 - self.merge_config.reserve_p
+            abs_tensor = paddle.abs(tensor)
+            sorted_indices = paddle.argsort(-abs_tensor.flatten())
+
+            probs = paddle.zeros_like(sorted_indices, dtype="float32")
+            probs = paddle.scatter(probs, sorted_indices, paddle.arange(tensor.numel(), dtype="float32"))
+            probs = probs.reshape(tensor.shape)
+            probs = probs * self.merge_config.epsilon / tensor.numel()
+            p_min = drop_p - self.merge_config.epsilon / 2
+            probs += p_min
+            mask = paddle.bernoulli(1 - probs).astype(tensor.dtype)
+            tensor *= mask
+            if self.merge_config.rescale:
+                tensor /= 1 - probs
+            return tensor
         else:
-            raise NotImplementedError("Paddle Tensor is not supported yet.")
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
 
     def trim(self, tensor):
         if self.merge_config.tensor_type == "np":
             shape = tensor.shape
-            org_sum = np.sum(np.abs(tensor))
             tensor = tensor.flatten()
             abs_tensor = np.abs(tensor)
             threshold = np.quantile(abs_tensor, 1 - self.merge_config.reserve_p)
@@ -83,5 +104,15 @@ def trim(self, tensor):
             else:
                 tensor[abs_tensor < threshold] = 0
             return tensor.reshape(shape)
+        elif self.merge_config.tensor_type == "pd":
+            abs_tensor = paddle.abs(tensor)
+            threshold = paddle.quantile(abs_tensor, 1 - self.merge_config.reserve_p)
+            tensor = paddle.where(abs_tensor < threshold, paddle.zeros_like(tensor), tensor)
+            if self.merge_config.rescale:
+                org_sum = paddle.sum(abs_tensor)
+                new_sum = paddle.sum(paddle.abs(tensor))
+                if org_sum >= 1e-8 and new_sum >= 1e-8:
+                    tensor *= org_sum / new_sum
+            return tensor
         else:
-            raise NotImplementedError("Paddle Tensor is not supported yet.")
+            raise ValueError(f"Unkonwn tensor type {self.merge_config.tensor_type}")
diff --git a/paddlenlp/peft/lora/lora_model.py b/paddlenlp/peft/lora/lora_model.py
index 46f0d19a19f1..3d1270d807ef 100644
--- a/paddlenlp/peft/lora/lora_model.py
+++ b/paddlenlp/peft/lora/lora_model.py
@@ -46,7 +46,7 @@
 from ...utils.env import LORA_WEIGHTS_NAME, SAFE_PEFT_WEIGHTS_INDEX_NAME
 from ...utils.log import logger
 from ...utils.tools import get_env_device
-from .lora_config import LoRAConfig
+from .lora_config import LoRAConfig, LoRAAutoConfig
 
 
 def get_lora_layers():
@@ -327,12 +327,18 @@ def set_state_dict(self, state_dict):
         model_state_dict = self.model.state_dict()
         if self.lora_config.loraga:
 
-            def process_split_and_assign(name, concat_tensor, axis, init_dict, state_dict):
+            def process_split_and_assign(name, concat_tensor, init_dict, state_dict):
+                if "lora_A" in name:
+                    axis = 1
+                else:
+                    axis = 0
                 if isinstance(concat_tensor, np.ndarray):
                     final_lora, init_lora = np.split(concat_tensor, 2, axis=axis)
                     init_lora = paddle.to_tensor(init_lora)
                 else:
                     final_lora, init_lora = paddle.split(concat_tensor, 2, axis=axis)
+                if "lora_B" in name:
+                    init_lora *= -1
                 init_dict[name] = init_lora
                 state_dict[name] = final_lora
                 return init_lora
@@ -341,13 +347,13 @@ def process_split_and_assign(name, concat_tensor, axis, init_dict, state_dict):
                 if "lora_A" in name:
                     concat_lora_A = state_dict[name]
                     init_loraA = process_split_and_assign(
-                        name, concat_lora_A, axis=1, init_dict=self.loraga_init_dict, state_dict=state_dict
+                        name, concat_lora_A, init_dict=self.loraga_init_dict, state_dict=state_dict
                     )
 
                     loraB_name = name.replace("lora_A", "lora_B")
                     concat_lora_B = state_dict[loraB_name]
                     init_loraB = process_split_and_assign(
-                        loraB_name, concat_lora_B, axis=0, init_dict=self.loraga_init_dict, state_dict=state_dict
+                        loraB_name, concat_lora_B, init_dict=self.loraga_init_dict, state_dict=state_dict
                     )
 
                     base_name = name.replace("lora_A", "weight")
@@ -437,7 +443,10 @@ def save_pretrained(self, save_directory: str, merge_tensor_parallel: bool = Fal
         ), f"Saving directory ({save_directory}) should be a directory, not a file"
         os.makedirs(save_directory, exist_ok=True)
 
-        lora_config_to_save = LoRAConfig(**self.lora_config.to_dict())
+        if isinstance(self.lora_config, LoRAConfig):
+            lora_config_to_save = LoRAConfig(**self.lora_config.to_dict())
+        else:
+            lora_config_to_save = LoRAAutoConfig(**self.lora_config.to_dict())
 
         trainable_state_dict = self.get_trainable_state_dict(concat_init_lora=lora_config_to_save.loraga)
 
@@ -690,7 +699,7 @@ def get_trainable_state_dict(self, concat_init_lora=False):
                     if "lora_A" in name:
                         trainable_state_dict[name] = paddle.concat([weight, self.loraga_init_dict[name]], axis=1)
                     else:
-                        trainable_state_dict[name] = paddle.concat([weight, self.loraga_init_dict[name]], axis=0)
+                        trainable_state_dict[name] = paddle.concat([weight, -self.loraga_init_dict[name]], axis=0)
                 else:
                     trainable_state_dict[name] = weight
 
diff --git a/paddlenlp/peft/lora/loraga_utils.py b/paddlenlp/peft/lora/loraga_utils.py
index 72b4baac1de2..c7079b6bf456 100644
--- a/paddlenlp/peft/lora/loraga_utils.py
+++ b/paddlenlp/peft/lora/loraga_utils.py
@@ -16,6 +16,13 @@
 import paddle.distributed as dist
 from paddle.distributed import fleet
 
+try:
+    from paddle.distributed.fleet.utils.sequence_parallel_utils import (
+        register_sequence_parallel_allreduce_hooks,
+    )
+except:
+    pass
+
 from paddlenlp.peft import LoRAModel
 from paddlenlp.peft.lora.lora_layers import (
     ColumnParallelLoRALinear,
@@ -83,6 +90,11 @@ def estimate_gradient(self, model: PretrainedModel):
     def _wrap_model(self, model):
         """Wrap Model without optimizer, support dp, tp and sharding"""
 
+        if self.args.tensor_parallel_degree > 1 and self.args.sequence_parallel:
+            register_sequence_parallel_allreduce_hooks(
+                model, self.args.gradient_accumulation_steps, self.args.fuse_sequence_parallel_allreduce
+            )
+
         in_pipeline_parallel_mode = self.args.pipeline_parallel_degree > 1
         in_sharding_parallel_mode = self.sharding is not None
         in_tensor_parallel_mode = self.args.tensor_parallel_degree > 1
diff --git a/paddlenlp/quantization/quantization_config.py b/paddlenlp/quantization/quantization_config.py
index dda0ef0edcf3..a202484f0039 100644
--- a/paddlenlp/quantization/quantization_config.py
+++ b/paddlenlp/quantization/quantization_config.py
@@ -52,6 +52,7 @@ def __init__(
         weight_double_quant_block_size=256,
         weight_quant_method="abs_max_channel_wise",
         act_quant_method="abs_max",
+        **kwargs,
     ):
         if weight_quantize_algo is not None and weight_quantize_algo not in [
             "weight_only_int8",
diff --git a/paddlenlp/quantization/quantization_utils.py b/paddlenlp/quantization/quantization_utils.py
index fe46efd2a2fa..a12bebd89d36 100644
--- a/paddlenlp/quantization/quantization_utils.py
+++ b/paddlenlp/quantization/quantization_utils.py
@@ -23,6 +23,7 @@
 from paddle.nn.quant import weight_quantize
 
 from ..utils.log import logger
+from ..utils.memory_utils import empty_device_cache
 from .quantization_linear import (
     ColumnParallelQuantizationLinear,
     QuantizationLinear,
@@ -150,7 +151,7 @@ def convert_to_quantize_state_dict_without_check(state_dict, quantization_linear
             state_dict.update(qlora_state_dict)
             del target_weight
             gc.collect()
-            paddle.device.cuda.empty_cache()
+            empty_device_cache()
     return state_dict
 
 
diff --git a/paddlenlp/server/predictor.py b/paddlenlp/server/predictor.py
index 45d803e4b13b..226c15b54fba 100644
--- a/paddlenlp/server/predictor.py
+++ b/paddlenlp/server/predictor.py
@@ -21,6 +21,7 @@
 
 import paddle
 
+from ..utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 from ..utils.log import logger
 
 
@@ -40,13 +41,15 @@ def __init__(self, model_path, precision, device):
 
     def _get_default_static_model_path(self):
         # The model path had the static_model_path
-        static_model_path = os.path.join(self._model_path, self._default_static_model_path, "inference.pdmodel")
+        static_model_path = os.path.join(
+            self._model_path, self._default_static_model_path, f"inference{PADDLE_INFERENCE_MODEL_SUFFIX}"
+        )
         if os.path.exists(static_model_path):
             return os.path.join(self._model_path, self._default_static_model_path, "inference")
         for file_name in os.listdir(self._model_path):
             # FIXME(wawltor) The path maybe not correct
-            if file_name.count(".pdmodel"):
-                return os.path.join(self._model_path, file_name[:-8])
+            if file_name.count(PADDLE_INFERENCE_MODEL_SUFFIX):
+                return os.path.join(self._model_path, file_name[: -len(PADDLE_INFERENCE_MODEL_SUFFIX)])
         return None
 
     def _is_int8_model(self, model_path):
@@ -110,7 +113,10 @@ def _prepare_paddle_mode(self, static_model_path):
         """
         Construct the input data and predictor in the PaddlePaddele static mode.
         """
-        self._config = paddle.inference.Config(static_model_path + ".pdmodel", static_model_path + ".pdiparams")
+        self._config = paddle.inference.Config(
+            static_model_path + PADDLE_INFERENCE_MODEL_SUFFIX,
+            static_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+        )
         self._config.disable_glog_info()
         if paddle.get_device() == "cpu":
             self._config.disable_gpu()
@@ -146,7 +152,7 @@ def _prepare_onnx_mode(self, static_model_path):
             os.mkdir(onnx_dir)
         float_onnx_file = os.path.join(onnx_dir, "model.onnx")
         if not os.path.exists(float_onnx_file):
-            model_path = static_model_path + ".pdmodel"
+            model_path = static_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
             params_file = static_model_path + ".pdiparams"
             onnx_model = paddle2onnx.command.c_paddle_to_onnx(
                 model_file=model_path, params_file=params_file, opset_version=13, enable_onnx_checker=True
diff --git a/paddlenlp/taskflow/information_extraction.py b/paddlenlp/taskflow/information_extraction.py
index fac8d7231395..19459187a3b8 100644
--- a/paddlenlp/taskflow/information_extraction.py
+++ b/paddlenlp/taskflow/information_extraction.py
@@ -12,7 +12,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
 import base64
 import json
 import os
@@ -25,7 +24,14 @@
 
 from ..datasets import load_dataset
 from ..layers import GlobalPointerForEntityExtraction, GPLinkerForRelationExtraction
-from ..transformers import UIE, UIEM, UIEX, AutoModel, AutoTokenizer
+from ..transformers import (
+    UIE,
+    UIEM,
+    UIEX,
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
 from ..utils.doc_parser import DocParser
 from ..utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME
 from ..utils.ie_utils import map_offset, pad_image_data
@@ -115,6 +121,300 @@ def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length
     return max_length
 
 
+LLM_IE_PROMPT = """你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。
+**句子开始**
+{sentence}
+**句子结束**
+**问题开始**
+{prompt}
+**问题结束**
+**回答开始**
+"""
+
+
+class UIELLMTask(Task):
+    def __init__(self, task, model, schema, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._dtype = kwargs.get("dtype", "float16")
+        self.kwargs["generation_task"] = task
+        self._tgt_length = kwargs.get("tgt_length", 50)
+        # Token max length
+        self._max_seq_length = kwargs.get("max_seq_length", 512)
+        self._top_k = kwargs.get("top_k", 1)
+        self._top_p = kwargs.get("top_p", 1.0)
+        self._temperature = kwargs.get("temperature", 1.0)
+        self._decode_strategy = kwargs.get("decode_strategy", "greedy_search")
+        self._num_return_sequences = kwargs.get("num_return_sequences", 1)
+        self._prompt = LLM_IE_PROMPT
+
+        self._construct_tokenizer(model)
+        self.set_schema(schema)
+        self._construct_model(model)
+        self._construct_input_spec()
+
+        if not schema:
+            logger.warning(
+                "The schema has not been set yet, please set a schema via set_schema(). "
+                "More details about the setting of schema please refer to https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/taskflow_text.md"
+            )
+            self._schema_tree = None
+        else:
+            self.set_schema(schema)
+
+        self._is_en = False
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = AutoModelForCausalLM.from_pretrained(model, dtype=self._infer_precision)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(model)
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+        # Separates data into some batches.
+        one_batch = []
+        for example in data:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield one_batch
+                one_batch = []
+        if one_batch:
+            yield one_batch
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        return inputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = self._multi_stage_predict(inputs)
+        return results
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        return inputs
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        if paddle.get_device().split(":", 1)[0] == "npu":
+            input_spec_dtype = "int32"
+        else:
+            input_spec_dtype = "int64"
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
+        ]
+
+    def _single_stage_predict(self, inputs):
+        inputs = [self._prompt.format(sentence=dic["text"], prompt=dic["prompt"]) for dic in inputs]
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        batches = self._batchify(inputs, batch_size)
+        examples = []
+        for input_text in batches:
+            if self._tokenizer.chat_template is not None:
+                input_text = [input_text] if isinstance(input_text, str) else input_text
+                input_text = [self._tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in input_text]
+            tokenized_output = self._tokenizer(
+                input_text,
+                return_tensors="pd",
+                return_position_ids=True,
+                padding_side="left",
+                padding=True,
+                max_new_tokens=self._max_seq_length,
+                truncation=True,
+                truncation_side="left",
+                add_special_tokens=self._tokenizer.chat_template is None,
+            )
+            examples.append(tokenized_output)
+
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["data_loader"] = examples
+
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        results = []
+        for batch_inputs in outputs["data_loader"]:
+            result = self._model.generate(
+                **batch_inputs,
+                decode_strategy=self._decode_strategy,
+                top_k=self._top_k,
+                top_p=self._top_p,
+                temperature=self._temperature,
+                max_new_tokens=self._tgt_length,
+                bos_token_id=self._tokenizer.bos_token_id,
+                eos_token_id=self._tokenizer.eos_token_id,
+                pad_token_id=self._tokenizer.pad_token_id,
+                num_return_sequences=self._num_return_sequences,
+                use_cache=True,
+            )
+            results.extend(result[0])
+        out_list = []
+        for x in results:
+            res = self._tokenizer.decode(x.numpy().tolist(), skip_special_tokens=True)
+            res = res.strip("\n")
+            end_idx = res.find("\n**回答结束**")
+            if end_idx != -1:
+                res = res[:end_idx]
+            out_list.append([{"text": res}])
+
+        return out_list
+
+    def _multi_stage_predict(self, data):
+        """
+        Traversal the schema tree and do multi-stage prediction.
+
+        Args:
+            data (list): a list of strings
+
+        Returns:
+            list: a list of predictions, where the list's length
+                equals to the length of `data`
+        """
+        results = [{} for _ in range(len(data))]
+        # Input check to early return
+        if len(data) < 1 or self._schema_tree is None:
+            return results
+
+        # Copy to stay `self._schema_tree` unchanged
+        schema_list = self._schema_tree.children[:]
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+            examples = []
+            input_map = {}
+            cnt = 0
+            idx = 0
+            if not node.prefix:
+                for one_data in data:
+                    examples.append(
+                        {
+                            "text": one_data,
+                            "prompt": dbc2sbc(node.name),
+                        }
+                    )
+                    input_map[cnt] = [idx]
+                    idx += 1
+                    cnt += 1
+            else:
+                for pre, one_data in zip(node.prefix, data):
+                    if len(pre) == 0:
+                        input_map[cnt] = []
+                    else:
+                        for p in pre:
+                            prompt = p + node.name
+                            examples.append(
+                                {
+                                    "text": one_data,
+                                    "prompt": dbc2sbc(prompt),
+                                }
+                            )
+                        input_map[cnt] = [i + idx for i in range(len(pre))]
+                        idx += len(pre)
+                    cnt += 1
+            if len(examples) == 0:
+                result_list = []
+            else:
+                result_list = self._single_stage_predict(examples)
+
+            if not node.parent_relations:
+                relations = [[] for i in range(len(data))]
+                for k, v in input_map.items():
+                    for idx in v:
+                        if len(result_list[idx]) == 0:
+                            continue
+                        if node.name not in results[k].keys():
+                            results[k][node.name] = result_list[idx]
+                        else:
+                            results[k][node.name].extend(result_list[idx])
+                    if node.name in results[k].keys():
+                        relations[k].extend(results[k][node.name])
+            else:
+                relations = node.parent_relations
+                for k, v in input_map.items():
+                    for i in range(len(v)):
+                        if len(result_list[v[i]]) == 0:
+                            continue
+                        if "relations" not in relations[k][i].keys():
+                            relations[k][i]["relations"] = {node.name: result_list[v[i]]}
+                        elif node.name not in relations[k][i]["relations"].keys():
+                            relations[k][i]["relations"][node.name] = result_list[v[i]]
+                        else:
+                            relations[k][i]["relations"][node.name].extend(result_list[v[i]])
+                new_relations = [[] for i in range(len(data))]
+                for i in range(len(relations)):
+                    for j in range(len(relations[i])):
+                        if "relations" in relations[i][j].keys() and node.name in relations[i][j]["relations"].keys():
+                            for k in range(len(relations[i][j]["relations"][node.name])):
+                                new_relations[i].append(relations[i][j]["relations"][node.name][k])
+                relations = new_relations
+
+            prefix = [[] for _ in range(len(data))]
+            for k, v in input_map.items():
+                for idx in v:
+                    for i in range(len(result_list[idx])):
+                        if self._is_en:
+                            prefix[k].append(" of " + result_list[idx][i]["text"])
+                        else:
+                            prefix[k].append(result_list[idx][i]["text"] + "的")
+
+            for child in node.children:
+                child.prefix = prefix
+                child.parent_relations = relations
+                schema_list.append(child)
+        return results
+
+    def set_schema(self, schema):
+        if isinstance(schema, dict) or isinstance(schema, str):
+            schema = [schema]
+        self._schema_tree = self._build_tree(schema)
+
+    @classmethod
+    def _build_tree(cls, schema, name="root"):
+        """
+        Build the schema tree.
+        """
+        schema_tree = SchemaTree(name)
+        for s in schema:
+            if isinstance(s, str):
+                schema_tree.add_child(SchemaTree(s))
+            elif isinstance(s, dict):
+                for k, v in s.items():
+                    if isinstance(v, str):
+                        child = [v]
+                    elif isinstance(v, list):
+                        child = v
+                    else:
+                        raise TypeError(
+                            "Invalid schema, value for each key:value pairs should be list or string"
+                            "but {} received".format(type(v))
+                        )
+                    schema_tree.add_child(cls._build_tree(child, name=k))
+            else:
+                raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+        return schema_tree
+
+
 class UIETask(Task):
     """
     Universal Information Extraction Task.
@@ -510,7 +810,6 @@ def __init__(self, task, model, schema=None, **kwargs):
                 self._check_task_files()
                 with open(os.path.join(self._task_path, CONFIG_NAME)) as f:
                     self._init_class = json.load(f)["architectures"].pop()
-
         self._is_en = True if model in ["uie-base-en"] or self._schema_lang == "en" else False
 
         if self._init_class in ["UIEX"]:
@@ -583,7 +882,9 @@ def _construct_model(self, model):
         Construct the inference model for the predictor.
         """
         model_instance = MODEL_MAP[self._init_class].from_pretrained(
-            self._task_path, from_hf_hub=self.from_hf_hub, convert_from_torch=self._convert_from_torch
+            self._task_path,
+            from_hf_hub=self.from_hf_hub,
+            convert_from_torch=self._convert_from_torch,
         )
         self._model = model_instance
         self._model.eval()
@@ -621,16 +922,20 @@ def _check_input_text(self, inputs):
                     if "doc" in example.keys():
                         if not self._parser_map[self._ocr_lang_choice]:
                             self._parser_map[self._ocr_lang_choice] = DocParser(
-                                ocr_lang=self._ocr_lang, layout_analysis=self._layout_analysis
+                                ocr_lang=self._ocr_lang,
+                                layout_analysis=self._layout_analysis,
                             )
                         if "layout" in example.keys():
                             data = self._parser_map[self._ocr_lang_choice].parse(
-                                {"doc": example["doc"]}, do_ocr=False, expand_to_a4_size=self._expand_to_a4_size
+                                {"doc": example["doc"]},
+                                do_ocr=False,
+                                expand_to_a4_size=self._expand_to_a4_size,
                             )
                             data["layout"] = example["layout"]
                         else:
                             data = self._parser_map[self._ocr_lang_choice].parse(
-                                {"doc": example["doc"]}, expand_to_a4_size=self._expand_to_a4_size
+                                {"doc": example["doc"]},
+                                expand_to_a4_size=self._expand_to_a4_size,
                             )
                     elif "text" in example.keys():
                         if not isinstance(example["text"], str):
@@ -658,14 +963,16 @@ def _check_input_text(self, inputs):
     def _single_stage_predict(self, inputs):
         input_texts = [d["text"] for d in inputs]
         prompts = [d["prompt"] for d in inputs]
-
         # max predict length should exclude the length of prompt and summary tokens
         max_predict_len = self._max_seq_len - len(max(prompts)) - self._summary_token_num
 
         if self._init_class in ["UIEX"]:
             bbox_list = [d["bbox"] for d in inputs]
             short_input_texts, short_bbox_list, input_mapping = self._auto_splitter(
-                input_texts, max_predict_len, bbox_list=bbox_list, split_sentence=self._split_sentence
+                input_texts,
+                max_predict_len,
+                bbox_list=bbox_list,
+                split_sentence=self._split_sentence,
             )
         else:
             short_input_texts, input_mapping = self._auto_splitter(
@@ -761,7 +1068,14 @@ def _process_bbox(tokens, bbox_lines, offset_mapping, offset_bias):
                 return bbox_list
 
             def _encode_doc(
-                tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len
+                tokenizer,
+                offset_mapping,
+                last_offset,
+                prompt,
+                this_text_line,
+                inputs_ids,
+                q_sep_index,
+                max_seq_len,
             ):
                 if len(offset_mapping) == 0:
                     content_encoded_inputs = tokenizer(
@@ -795,7 +1109,10 @@ def _encode_doc(
                     last_offset = offset_mapping[-1][-1]
                 else:
                     content_encoded_inputs = tokenizer(
-                        text=this_text_line, max_seq_len=max_seq_len, return_dict=False, return_offsets_mapping=True
+                        text=this_text_line,
+                        max_seq_len=max_seq_len,
+                        return_dict=False,
+                        return_offsets_mapping=True,
                     )
                     inputs_ids += content_encoded_inputs["input_ids"][1:-1]
                     sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
@@ -842,7 +1159,7 @@ def _encode_doc(
 
                     bbox_list = [[0, 0, 0, 0] for x in range(len(inputs_ids))]
                     token_type_ids = [
-                        1 if token_index <= q_sep_index or token_index > c_sep_index else 0
+                        (1 if token_index <= q_sep_index or token_index > c_sep_index else 0)
                         for token_index in range(self._max_seq_len)
                     ]
                     padded_image = np.zeros([3, 224, 224])
@@ -930,7 +1247,13 @@ def _encode_doc(
                     padded_image,
                     offset_mapping,
                 ]
-                input_list = [inputs_ids, token_type_ids, position_ids, attention_mask, bbox_list]
+                input_list = [
+                    inputs_ids,
+                    token_type_ids,
+                    position_ids,
+                    attention_mask,
+                    bbox_list,
+                ]
                 return_list = [np.array(x, dtype="int64") for x in input_list]
                 return_list.append(np.array(padded_image, dtype="float32"))
                 return_list.append(np.array(offset_mapping, dtype="int64"))
@@ -946,14 +1269,25 @@ def _encode_doc(
         batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
 
         infer_data_loader = paddle.io.DataLoader(
-            dataset=infer_ds, batch_sampler=batch_sampler, num_workers=self._num_workers, return_list=True
+            dataset=infer_ds,
+            batch_sampler=batch_sampler,
+            num_workers=self._num_workers,
+            return_list=True,
         )
 
         sentence_ids = []
         probs = []
         for batch in infer_data_loader:
             if self._init_class in ["UIEX"]:
-                input_ids, token_type_ids, pos_ids, att_mask, bbox, image, offset_maps = batch
+                (
+                    input_ids,
+                    token_type_ids,
+                    pos_ids,
+                    att_mask,
+                    bbox,
+                    image,
+                    offset_maps,
+                ) = batch
             elif self._init_class in ["UIEM"]:
                 input_ids, pos_ids, offset_maps = batch
             else:
@@ -1033,7 +1367,10 @@ def _auto_joiner(self, short_results, short_inputs, input_mapping):
                     if len(short_results[v]) == 0:
                         continue
                     if short_results[v][0]["text"] not in cls_options.keys():
-                        cls_options[short_results[v][0]["text"]] = [1, short_results[v][0]["probability"]]
+                        cls_options[short_results[v][0]["text"]] = [
+                            1,
+                            short_results[v][0]["probability"],
+                        ]
                     else:
                         cls_options[short_results[v][0]["text"]][0] += 1
                         cls_options[short_results[v][0]["text"]][1] += short_results[v][0]["probability"]
@@ -1087,7 +1424,14 @@ def _parse_inputs(self, inputs):
                         box = self._parser_map[self._ocr_lang_choice]._normalize_box(box, [img_w, img_h], [1000, 1000])
                         text += segment[1]
                         bbox.extend([box] * len(segment[1]))
-                    _inputs.append({"text": text, "bbox": bbox, "image": d["image"], "layout": d["layout"]})
+                    _inputs.append(
+                        {
+                            "text": text,
+                            "bbox": bbox,
+                            "image": d["image"],
+                            "layout": d["layout"],
+                        }
+                    )
                 else:
                     _inputs.append({"text": d["text"], "bbox": None, "image": None})
             else:
@@ -1162,7 +1506,6 @@ def _multi_stage_predict(self, data):
                 result_list = []
             else:
                 result_list = self._single_stage_predict(examples)
-
             if not node.parent_relations:
                 relations = [[] for i in range(len(data))]
                 for k, v in input_map.items():
@@ -1249,7 +1592,12 @@ def _add_bbox(result, char_boxes):
                     if len(segment) == 2 or (len(segment) == 3 and segment[2] != "table"):
                         char_w = (sbox[2] - sbox[0]) * 1.0 / text_len
                         for i in range(text_len):
-                            cbox = [sbox[0] + i * char_w, sbox[1], sbox[0] + (i + 1) * char_w, sbox[3]]
+                            cbox = [
+                                sbox[0] + i * char_w,
+                                sbox[1],
+                                sbox[0] + (i + 1) * char_w,
+                                sbox[3],
+                            ]
                             char_boxes.append((segment[1][i], cbox))
                     else:
                         cell_bbox = [(segment[1][i], sbox) for i in range(text_len)]
@@ -1281,7 +1629,12 @@ def _convert_ids_to_results(self, examples, sentence_ids, probs):
                     result = {"text": prompt[start:end], "probability": prob[i]}
                     result_list.append(result)
                 else:
-                    result = {"text": text[start:end], "start": start, "end": end, "probability": prob[i]}
+                    result = {
+                        "text": text[start:end],
+                        "start": start,
+                        "end": end,
+                        "probability": prob[i],
+                    }
                     result_list.append(result)
             results.append(result_list)
         return results
@@ -1507,7 +1860,10 @@ def _postprocess_opinion_extraction(self, inputs):
             for rel in all_rel_preds[i]:
                 r = aspect_maps[(rel["aspect"], rel["aspect_start_index"])]
                 r["relations"] = {}
-                sentiment = {"probability": rel["probability"], "text": rel["sentiment"]}
+                sentiment = {
+                    "probability": rel["probability"],
+                    "text": rel["sentiment"],
+                }
                 opinion = {
                     "text": rel["opinion"],
                     "start": rel["opinion_start_index"],
diff --git a/paddlenlp/taskflow/multimodal_feature_extraction.py b/paddlenlp/taskflow/multimodal_feature_extraction.py
index 3e6050081643..7671ba3c3991 100644
--- a/paddlenlp/taskflow/multimodal_feature_extraction.py
+++ b/paddlenlp/taskflow/multimodal_feature_extraction.py
@@ -19,6 +19,7 @@
 from PIL import Image
 
 from ..transformers import AutoModel, AutoProcessor
+from ..utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 from ..utils.log import logger
 from .task import Task
 from .utils import dygraph_mode_guard, static_mode_guard
@@ -411,9 +412,9 @@ def _get_inference_model(self):
         self.inference_image_model_path = os.path.join(_base_path, "static", "get_image_features")
         self.inference_text_model_path = os.path.join(_base_path, "static", "get_text_features")
         if (
-            not os.path.exists(self.inference_image_model_path + ".pdiparams")
+            not os.path.exists(self.inference_image_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX)
             or self._param_updated
-            or not os.path.exists(self.inference_text_model_path + ".pdiparams")
+            or not os.path.exists(self.inference_text_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX)
         ):
             with dygraph_mode_guard():
                 self._construct_model(self.model)
@@ -422,8 +423,8 @@ def _get_inference_model(self):
         if self._predictor_type == "paddle-inference":
             # Get text inference model
             self.inference_model_path = self.inference_text_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
             self._prepare_static_mode()
 
@@ -435,8 +436,8 @@ def _get_inference_model(self):
 
             # Get image inference model
             self.inference_model_path = self.inference_image_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
             self._prepare_static_mode()
 
@@ -449,15 +450,15 @@ def _get_inference_model(self):
             # Get text onnx model
             self.export_type = "text"
             self.inference_model_path = self.inference_text_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._prepare_onnx_mode()
             self.predictor_map["text"] = self.predictor
 
             # Get image onnx model
             self.export_type = "image"
             self.inference_model_path = self.inference_image_model_path
-            self._static_model_file = self.inference_model_path + ".pdmodel"
-            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+            self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             self._prepare_onnx_mode()
             self.predictor_map["image"] = self.predictor
diff --git a/paddlenlp/taskflow/task.py b/paddlenlp/taskflow/task.py
index 22b178b61d35..f7cdd87cec74 100644
--- a/paddlenlp/taskflow/task.py
+++ b/paddlenlp/taskflow/task.py
@@ -20,9 +20,14 @@
 from multiprocessing import cpu_count
 
 import paddle
+from paddle.base.framework import use_pir_api
 from paddle.dataset.common import md5file
 
-from ..utils.env import PPNLP_HOME
+from ..utils.env import (
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+    PPNLP_HOME,
+)
 from ..utils.log import logger
 from .utils import cut_chinese_sent, download_check, download_file, dygraph_mode_guard
 
@@ -54,7 +59,15 @@ def __init__(self, model, task, priority_path=None, **kwargs):
         self._param_updated = False
 
         self._num_threads = self.kwargs["num_threads"] if "num_threads" in self.kwargs else math.ceil(cpu_count() / 2)
-        self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
+        if (
+            self.task == "paddlenlp/PP-UIE-0.5B"
+            or self.task == "paddlenlp/PP-UIE-1.5B"
+            or self.task == "paddlenlp/PP-UIE-7B"
+            or self.task == "paddlenlp/PP-UIE-14B"
+        ):
+            self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "float16"
+        else:
+            self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
         # Default to use Paddle Inference
         self._predictor_type = "paddle-inference"
         # The root directory for storing Taskflow related files, default to ~/.paddlenlp.
@@ -118,12 +131,12 @@ def _construct_input_spec(self):
     def _get_static_model_name(self):
         names = []
         for file_name in os.listdir(self._task_path):
-            if ".pdmodel" in file_name:
-                names.append(file_name[:-8])
+            if PADDLE_INFERENCE_MODEL_SUFFIX in file_name:
+                names.append(file_name[: -len(PADDLE_INFERENCE_MODEL_SUFFIX)])
         if len(names) == 0:
-            raise IOError(f"{self._task_path} should include '.pdmodel' file.")
+            raise IOError(f"{self._task_path} should include '{PADDLE_INFERENCE_MODEL_SUFFIX}' file.")
         if len(names) > 1:
-            logger.warning(f"{self._task_path} includes more than one '.pdmodel' file.")
+            logger.warning(f"{self._task_path} includes more than one '{PADDLE_INFERENCE_MODEL_SUFFIX}' file.")
         return names[0]
 
     def _check_task_files(self):
@@ -212,18 +225,25 @@ def _prepare_static_mode(self):
             # TODO(linjieccc): enable after fixed
             self._config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
             self._config.delete_pass("fused_multi_transformer_encoder_pass")
+            self._config.delete_pass("fused_rotary_position_embedding_pass")
+
+        self._config.switch_ir_optim(True)
+        self._config.enable_new_executor()
+
         self._config.set_cpu_math_library_num_threads(self._num_threads)
         self._config.switch_use_feed_fetch_ops(False)
         self._config.disable_glog_info()
         self._config.enable_memory_optim()
-
         # TODO(linjieccc): some temporary settings and will be remove in future
         # after fixed
-        if self.task in ["document_intelligence", "knowledge_mining", "zero_shot_text_classification"]:
+        if self.task in [
+            "document_intelligence",
+            "knowledge_mining",
+            "zero_shot_text_classification",
+        ]:
             self._config.switch_ir_optim(False)
         if self.model == "uie-data-distill-gp":
             self._config.enable_memory_optim(False)
-
         self.predictor = paddle.inference.create_predictor(self._config)
         self.input_names = [name for name in self.predictor.get_input_names()]
         self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
@@ -281,12 +301,14 @@ def _get_inference_model(self):
         """
         if self._custom_model:
             param_path = os.path.join(self._task_path, "model_state.pdparams")
-
             if os.path.exists(param_path):
                 cache_info_path = os.path.join(self._task_path, ".cache_info")
                 md5 = md5file(param_path)
                 self._param_updated = True
-                if os.path.exists(cache_info_path) and open(cache_info_path).read()[:-8] == md5:
+                if (
+                    os.path.exists(cache_info_path)
+                    and open(cache_info_path).read()[: -len(PADDLE_INFERENCE_MODEL_SUFFIX)] == md5
+                ):
                     self._param_updated = False
                 elif self.task == "information_extraction" and self.model != "uie-data-distill-gp":
                     # UIE related models are moved to paddlenlp.transformers after v2.4.5
@@ -296,13 +318,20 @@ def _get_inference_model(self):
                     fp.write(md5 + "taskflow")
                     fp.close()
                     model_state = paddle.load(param_path)
-                    prefix_map = {"UIE": "ernie", "UIEM": "ernie_m", "UIEX": "ernie_layout"}
+                    prefix_map = {
+                        "UIE": "ernie",
+                        "UIEM": "ernie_m",
+                        "UIEX": "ernie_layout",
+                    }
                     new_state_dict = {}
                     for name, param in model_state.items():
                         if "ernie" in name:
                             new_state_dict[name] = param
                         elif "encoder.encoder" in name:
-                            trans_name = name.replace("encoder.encoder", prefix_map[self._init_class] + ".encoder")
+                            trans_name = name.replace(
+                                "encoder.encoder",
+                                prefix_map[self._init_class] + ".encoder",
+                            )
                             new_state_dict[trans_name] = param
                         elif "encoder" in name:
                             trans_name = name.replace("encoder", prefix_map[self._init_class])
@@ -318,11 +347,11 @@ def _get_inference_model(self):
         # When the user-provided model path is already a static model, skip to_static conversion
         if self.is_static_model:
             self.inference_model_path = os.path.join(self._task_path, self._static_model_name)
-            if not os.path.exists(self.inference_model_path + ".pdmodel") or not os.path.exists(
-                self.inference_model_path + ".pdiparams"
+            if not os.path.exists(self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX) or not os.path.exists(
+                self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
             ):
                 raise IOError(
-                    f"{self._task_path} should include {self._static_model_name + '.pdmodel'} and {self._static_model_name + '.pdiparams'} while is_static_model is True"
+                    f"{self._task_path} should include {self._static_model_name + PADDLE_INFERENCE_MODEL_SUFFIX} and {self._static_model_name + PADDLE_INFERENCE_WEIGHTS_SUFFIX} while is_static_model is True"
                 )
             if self.paddle_quantize_model(self.inference_model_path):
                 self._infer_precision = "int8"
@@ -336,19 +365,20 @@ def _get_inference_model(self):
                 else os.path.join(self._home_path, "taskflow", self.task, self._task_path)
             )
             self.inference_model_path = os.path.join(_base_path, "static", "inference")
-            if not os.path.exists(self.inference_model_path + ".pdiparams") or self._param_updated:
+            if not os.path.exists(self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX) or self._param_updated:
                 with dygraph_mode_guard():
                     self._construct_model(self.model)
                     self._construct_input_spec()
                     self._convert_dygraph_to_static()
 
-        self._static_model_file = self.inference_model_path + ".pdmodel"
-        self._static_params_file = self.inference_model_path + ".pdiparams"
+
+        self._static_model_file = self.inference_model_path + PADDLE_INFERENCE_MODEL_SUFFIX
+        self._static_params_file = self.inference_model_path + PADDLE_INFERENCE_WEIGHTS_SUFFIX
 
         if paddle.get_device().split(":", 1)[0] == "npu" and self._infer_precision == "fp16":
             # transform fp32 model tp fp16 model
-            self._static_fp16_model_file = self.inference_model_path + "-fp16.pdmodel"
-            self._static_fp16_params_file = self.inference_model_path + "-fp16.pdiparams"
+            self._static_fp16_model_file = self.inference_model_path + f"-fp16{PADDLE_INFERENCE_MODEL_SUFFIX}"
+            self._static_fp16_params_file = self.inference_model_path + f"-fp16{PADDLE_INFERENCE_WEIGHTS_SUFFIX}"
             if not os.path.exists(self._static_fp16_model_file) and not os.path.exists(self._static_fp16_params_file):
                 logger.info("Converting to the inference model from fp32 to fp16.")
                 paddle.inference.convert_to_mixed_precision(
@@ -368,7 +398,10 @@ def _get_inference_model(self):
             self._static_model_file = self._static_fp16_model_file
             self._static_params_file = self._static_fp16_params_file
         if self._predictor_type == "paddle-inference":
-            self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
+            if use_pir_api():
+                self._config = paddle.inference.Config(self._static_json_file, self._static_params_file)
+            else:
+                self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
             self._prepare_static_mode()
         else:
             self._prepare_onnx_mode()
@@ -384,7 +417,8 @@ def _convert_dygraph_to_static(self):
             self._input_spec is not None
         ), "The input spec must be created before converting the dygraph model to static model."
         logger.info("Converting to the inference model cost a little time.")
-        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec)
+
+        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec, full_graph=True)
 
         paddle.jit.save(static_model, self.inference_model_path)
         logger.info("The inference model save in the path:{}".format(self.inference_model_path))
@@ -512,7 +546,7 @@ def paddle_quantize_model(self, model_path):
         program = model.program()
         for block in program.blocks:
             for op in block.ops:
-                if op.type.count("quantize"):
+                if "quantize" in op.name():
                     return True
         return False
 
diff --git a/paddlenlp/taskflow/taskflow.py b/paddlenlp/taskflow/taskflow.py
index 520ad4cf5886..fdba3ee9ad3d 100644
--- a/paddlenlp/taskflow/taskflow.py
+++ b/paddlenlp/taskflow/taskflow.py
@@ -23,7 +23,7 @@
 from .dialogue import DialogueTask
 from .document_intelligence import DocPromptTask
 from .fill_mask import FillMaskTask
-from .information_extraction import GPTask, UIETask
+from .information_extraction import GPTask, UIELLMTask, UIETask
 from .knowledge_mining import NPTagTask, WordTagTask
 from .lexical_analysis import LacTask
 from .multimodal_feature_extraction import MultimodalFeatureExtractionTask
@@ -67,7 +67,10 @@
     },
     "dialogue": {
         "models": {
-            "plato-mini": {"task_class": DialogueTask, "task_flag": "dialogue-plato-mini"},
+            "plato-mini": {
+                "task_class": DialogueTask,
+                "task_flag": "dialogue-plato-mini",
+            },
             "__internal_testing__/tiny-random-plato": {
                 "task_class": DialogueTask,
                 "task_flag": "dialogue-tiny-random-plato",
@@ -79,7 +82,10 @@
     },
     "fill_mask": {
         "models": {
-            "fill_mask": {"task_class": FillMaskTask, "task_flag": "fill_mask-fill_mask"},
+            "fill_mask": {
+                "task_class": FillMaskTask,
+                "task_flag": "fill_mask-fill_mask",
+            },
         },
         "default": {
             "model": "fill_mask",
@@ -206,7 +212,10 @@
     },
     "text_correction": {
         "models": {
-            "ernie-csc": {"task_class": CSCTask, "task_flag": "text_correction-ernie-csc"},
+            "ernie-csc": {
+                "task_class": CSCTask,
+                "task_flag": "text_correction-ernie-csc",
+            },
         },
         "default": {"model": "ernie-csc"},
     },
@@ -314,16 +323,56 @@
     },
     "information_extraction": {
         "models": {
-            "uie-base": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base"},
+            "paddlenlp/PP-UIE-0.5B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 896,
+                "task_flag": "information_extraction-pp-uie-0.5b",
+            },
+            "paddlenlp/PP-UIE-1.5B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 1536,
+                "task_flag": "information_extraction-pp-uie-1.5b",
+            },
+            "paddlenlp/PP-UIE-7B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 3584,
+                "task_flag": "information_extraction-pp-uie-7b",
+            },
+            "paddlenlp/PP-UIE-14B": {
+                "task_class": UIELLMTask,
+                "hidden_size": 5120,
+                "task_flag": "information_extraction-pp-uie-14b",
+            },
+            "uie-base": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-base",
+            },
             "uie-medium": {
                 "task_class": UIETask,
                 "hidden_size": 768,
                 "task_flag": "information_extraction-uie-medium",
             },
-            "uie-mini": {"task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-mini"},
-            "uie-micro": {"task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-micro"},
-            "uie-nano": {"task_class": UIETask, "hidden_size": 312, "task_flag": "information_extraction-uie-nano"},
-            "uie-tiny": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-tiny"},
+            "uie-mini": {
+                "task_class": UIETask,
+                "hidden_size": 384,
+                "task_flag": "information_extraction-uie-mini",
+            },
+            "uie-micro": {
+                "task_class": UIETask,
+                "hidden_size": 384,
+                "task_flag": "information_extraction-uie-micro",
+            },
+            "uie-nano": {
+                "task_class": UIETask,
+                "hidden_size": 312,
+                "task_flag": "information_extraction-uie-nano",
+            },
+            "uie-tiny": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-tiny",
+            },
             "uie-medical-base": {
                 "task_class": UIETask,
                 "hidden_size": 768,
@@ -349,7 +398,10 @@
                 "hidden_size": 768,
                 "task_flag": "information_extraction-uie-x-base",
             },
-            "uie-data-distill-gp": {"task_class": GPTask, "task_flag": "information_extraction-uie-data-distill-gp"},
+            "uie-data-distill-gp": {
+                "task_class": GPTask,
+                "task_flag": "information_extraction-uie-data-distill-gp",
+            },
             "__internal_testing__/tiny-random-uie": {
                 "task_class": UIETask,
                 "hidden_size": 8,
@@ -693,6 +745,10 @@
 }
 
 support_schema_list = [
+    "paddlenlp/PP-UIE-0.5B",
+    "paddlenlp/PP-UIE-1.5B",
+    "paddlenlp/PP-UIE-7B",
+    "paddlenlp/PP-UIE-14B",
     "uie-base",
     "uie-medium",
     "uie-mini",
@@ -736,6 +792,10 @@
     "openai/disco-diffusion-clip-rn50",
     "openai/disco-diffusion-clip-rn101",
     "PaddlePaddle/disco_diffusion_ernie_vil-2.0-base-zh",
+    "paddlenlp/PP-UIE-0.5B",
+    "paddlenlp/PP-UIE-1.5B",
+    "paddlenlp/PP-UIE-7B",
+    "paddlenlp/PP-UIE-14B",
     "uie-base",
     "uie-medium",
     "uie-mini",
@@ -807,7 +867,11 @@ def __init__(self, task, model=None, mode=None, device_id=0, from_hf_hub=False,
         self.kwargs = kwargs
         task_class = TASKS[self.task][tag][self.model]["task_class"]
         self.task_instance = task_class(
-            model=self.model, task=self.task, priority_path=self.priority_path, from_hf_hub=from_hf_hub, **self.kwargs
+            model=self.model,
+            task=self.task,
+            priority_path=self.priority_path,
+            from_hf_hub=from_hf_hub,
+            **self.kwargs,
         )
         task_list = TASKS.keys()
         Taskflow.task_list = task_list
diff --git a/paddlenlp/taskflow/text_similarity.py b/paddlenlp/taskflow/text_similarity.py
index 5792125218f0..cb3a296db343 100644
--- a/paddlenlp/taskflow/text_similarity.py
+++ b/paddlenlp/taskflow/text_similarity.py
@@ -196,7 +196,7 @@ def _construct_model(self, model):
         """
 
         if "rocketqav2-en" in model or "ernie-search" in model:
-            self._model = ErnieCrossEncoder(self._task_path, num_classes=1, reinitialize=True)
+            self._model = ErnieCrossEncoder(self._task_path, num_classes=2, reinitialize=True)
         elif "rocketqa" in model:
             self._model = ErnieCrossEncoder(self._task_path, num_classes=2)
         else:
@@ -274,7 +274,6 @@ def _run_model(self, inputs):
         if "rocketqa" in self.model_name or "ernie-search" in self.model_name:
             with static_mode_guard():
                 for batch in inputs["data_loader"]:
-
                     if self._predictor_type == "paddle-inference":
                         input_ids, segment_ids = self._batchify_fn(batch)
                         self.input_handles[0].copy_from_cpu(input_ids)
diff --git a/paddlenlp/taskflow/zero_shot_text_classification.py b/paddlenlp/taskflow/zero_shot_text_classification.py
index 43d9f8ff5756..9bd5251d6048 100644
--- a/paddlenlp/taskflow/zero_shot_text_classification.py
+++ b/paddlenlp/taskflow/zero_shot_text_classification.py
@@ -259,6 +259,7 @@ class ZeroShotTextClassificationTask(Task):
     def __init__(self, task: str, model: str, schema: list = None, **kwargs):
         super().__init__(task=task, model=model, **kwargs)
 
+        self._static_mode = False
         self._set_utc_schema(schema)
         self._max_seq_len = kwargs.get("max_seq_len", 512)
         self._batch_size = kwargs.get("batch_size", 1)
@@ -269,7 +270,10 @@ def __init__(self, task: str, model: str, schema: list = None, **kwargs):
         self._check_task_files()
         self._construct_tokenizer()
         self._check_predictor_type()
-        self._get_inference_model()
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
 
     def _set_utc_schema(self, schema):
         if schema is None:
@@ -293,7 +297,7 @@ def _construct_input_spec(self):
             InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
             InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
             InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
-            InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"),
+            InputSpec(shape=[None, None], dtype="float32", name="attention_mask"),
             InputSpec(shape=[None, None], dtype="int64", name="omask_positions"),
             InputSpec(shape=[None], dtype="int64", name="cls_positions"),
         ]
@@ -311,7 +315,10 @@ def _construct_tokenizer(self):
         Construct the tokenizer for the predictor.
         """
         self._tokenizer = AutoTokenizer.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub)
-        self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        if self._static_mode:
+            self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        else:
+            self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="pd")
         self._template = UTCTemplate(self._tokenizer, self._max_seq_len)
 
     def _check_input_text(self, inputs):
@@ -381,19 +388,26 @@ def _run_model(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
             "omask_positions": "int64",
             "cls_positions": "int64",
         }
-        with static_mode_guard():
+        if self._static_mode:
+            with static_mode_guard():
+                for batch in inputs["batches"]:
+                    if self._predictor_type == "paddle-inference":
+                        for i, input_name in enumerate(self.input_names):
+                            self.input_handles[i].copy_from_cpu(batch[input_name].astype(dtype_dict[input_name]))
+                        self.predictor.run()
+                        logits = self.output_handle[0].copy_to_cpu().tolist()
+                    else:
+                        input_dict = {}
+                        for input_name in dtype_dict:
+                            input_dict[input_name] = batch[input_name].astype(dtype_dict[input_name])
+                        logits = self.predictor.run(None, input_dict)[0].tolist()
+                    outputs["batch_logits"].append(logits)
+        else:
             for batch in inputs["batches"]:
-                if self._predictor_type == "paddle-inference":
-                    for i, input_name in enumerate(self.input_names):
-                        self.input_handles[i].copy_from_cpu(batch[input_name].astype(dtype_dict[input_name]))
-                    self.predictor.run()
-                    logits = self.output_handle[0].copy_to_cpu().tolist()
-                else:
-                    input_dict = {}
-                    for input_name in dtype_dict:
-                        input_dict[input_name] = batch[input_name].astype(dtype_dict[input_name])
-                    logits = self.predictor.run(None, input_dict)[0].tolist()
-                outputs["batch_logits"].append(logits)
+                if batch["soft_token_ids"] is not None:
+                    del batch["soft_token_ids"]
+                logits = self._model(**batch)
+                outputs["batch_logits"].append(np.array(logits))
 
         return outputs
 
diff --git a/paddlenlp/trainer/auto_trainer.py b/paddlenlp/trainer/auto_trainer.py
index ea01b7104e81..66c4a0aef957 100644
--- a/paddlenlp/trainer/auto_trainer.py
+++ b/paddlenlp/trainer/auto_trainer.py
@@ -20,12 +20,9 @@
 import numpy as np
 import paddle
 import paddle.distributed as dist
+import paddle.distributed.auto_parallel.intermediate.parallelize as parallelize
 import paddle.nn as nn
 from paddle.distributed import fleet
-from paddle.distributed.auto_parallel.intermediate.parallelize import (
-    parallelize_model,
-    parallelize_optimizer,
-)
 from tqdm.auto import tqdm
 
 from paddlenlp.trainer import Trainer
@@ -34,6 +31,7 @@
 from ..utils.batch_sampler import DistributedBatchSampler as NlpDistributedBatchSampler
 from ..utils.log import logger
 from .argparser import strtobool
+from .auto_training_args import AutoTrainingArguments
 from .trainer import SCALER_NAME, SCHEDULER_NAME, TRAINER_STATE_NAME, TRAINING_ARGS_NAME
 from .trainer_callback import TrainerState
 from .trainer_utils import (  # set_hyrbid_parallel_seed,
@@ -70,44 +68,24 @@ def loss_func(loss, outputs):
                     return loss
 
                 kwargs.update({"criterion": loss_func})
-
-        sequence_parallel = False
-        if kwargs.get("model_args", None) is not None:
-            model_args = kwargs.pop("model_args")
-            if hasattr(model_args, "sequence_parallel"):
-                sequence_parallel = model_args.sequence_parallel
-
+        self.auto_dist_config = kwargs.pop("auto_dist_config", None)
+        model = kwargs.get("model", None)
+        assert model is not None
         if kwargs.get("args", None) is not None and kwargs["args"].use_intermediate_api:
-            model = kwargs.get("model", None)
-            assert model is not None
-            assert isinstance(model, PretrainedModel), f" AutoTrainer only support pretrained models,but got {model}"
-            for param in model.parameters():
-                assert not param._is_initialized(), "intermediate_api needs lazy init"
-
-            auto_dist_degree = {
-                "tensor_parallel": kwargs["args"].tensor_parallel_degree > 1,
-                "sequence_parallel": sequence_parallel,
-                "pipeline_parallel": kwargs["args"].pipeline_parallel_degree > 1,
-                "data_sharding_parallel": kwargs["args"].dataset_world_size > 1,
-                "sharding": kwargs["args"].sharding,
-                "sharding_mesh_dim": kwargs["args"].sharding_parallel_mesh_dimension,
-            }
-            auto_dist_config = model._generate_auto_dist_config(auto_dist_degree)
-            self.auto_dist_config = auto_dist_config
-
-            model = parallelize_model(
-                model,
-                config=self.auto_dist_config,
-            )
-
-            kwargs["model"] = model
-
+            if not parallelize.has_parallelized_model:
+                model, self.auto_dist_config = self.parallel_model(model, kwargs["args"])
+                kwargs["model"] = model
+            else:
+                assert kwargs.get(
+                    "auto_dist_config", None
+                ), "if use AutoTrainer.parallel_model , auto_dist_config obtained from parallel_model should be passed to AutoTrainer  "
+                self.auto_dist_config = kwargs.pop("auto_dist_config")
         model = kwargs["model"]
         for param in model.parameters():
-            if not param._is_initialized():
+            # NOTE(zhangwl):in pipeline mode , param my be initialized before while delte init_func ,but param is still not is_initialized
+            if not param._is_initialized() and param._init_func is not None:
                 param.initialize()
         kwargs["model"] = model
-
         super().__init__(*args, **kwargs)
         assert self.args.enable_auto_parallel
 
@@ -115,6 +93,42 @@ def loss_func(loss, outputs):
         self.comm_group_in_pp = fleet.get_hybrid_communicate_group().get_pipe_parallel_group()
         self._in_pir_mode = paddle.base.framework.get_flags("FLAGS_enable_pir_api")["FLAGS_enable_pir_api"]
 
+    @classmethod
+    def parallel_model(cls, model, training_args: AutoTrainingArguments):
+        """
+        Parallelize the model from a single card version to a distributed version.
+        Args:
+            model (paddle.nn.Layer): the model to be parallelized.
+            training_args (AutoTrainingArguments) : Training arguments which contain distributed information
+        Returns:
+            the model after parallelize and config conatins distributed strategy
+        """
+        if not training_args.use_intermediate_api:
+            return model, None
+        assert model is not None
+        for param in model.parameters():
+            if param._is_initialized():
+                logger.warning(
+                    "intermediate_api needs lazy init because if param init before parallelize_model ,"
+                    + " param will be allocated the full amount of memory"
+                    + " We recommend reallocating memory after paralleliz-model to reduce the peak of memory allocation"
+                )
+
+        auto_dist_degree = {
+            "tensor_parallel": training_args.tensor_parallel_degree > 1,
+            "sequence_parallel": training_args.sequence_parallel,
+            "pipeline_parallel": training_args.pipeline_parallel_degree > 1,
+            "data_sharding_parallel": training_args.dataset_world_size > 1,
+            "sharding": training_args.sharding,
+            "sharding_mesh_dim": training_args.sharding_parallel_mesh_dimension,
+        }
+        auto_dist_config = model._generate_auto_dist_config(auto_dist_degree)
+        model = parallelize.parallelize_model(
+            model,
+            config=auto_dist_config,
+        )
+        return model, auto_dist_config
+
     def _nested_gather(self, tensors):
         """
         Gather value of `tensors` (tensor or list/tuple of nested tensors) and convert them to numpy before
@@ -162,7 +176,7 @@ def _wrap_for_auto(self, model, train_dataloader):
 
         if self.args.use_intermediate_api:
             assert self.auto_dist_config is not None
-            self.optimizer = parallelize_optimizer(
+            self.optimizer = parallelize.parallelize_optimizer(
                 self.optimizer,
                 config=self.auto_dist_config,
             )
@@ -387,7 +401,6 @@ def _inner_training_loop(
 
         model, dist_loader = self._wrap_for_auto(model, train_dataloader)
         train_dataloader = dist_loader()
-
         if resume_from_checkpoint is not None:
             self._load_from_checkpoint(resume_from_checkpoint)
 
@@ -700,8 +713,13 @@ def _save_checkpoint(self, model, metrics=None):
                         for key, value in model.state_dict("opt").items()
                         if not any(keyword in key for keyword in FREE_SVAE_LOAD_KEY_PATTERNS)
                     }
+                    model_state_dict = model.state_dict("param")
+                    if self.args.should_save_model_with_tensor_fusion:
+                        model_state_dict = self._convert_state_dict_for_saving_tensor_fusion_ckpt(model_state_dict)
+                        opt_state_dict = self._convert_state_dict_for_saving_tensor_fusion_ckpt(opt_state_dict)
+
                     state_dict = {
-                        MODEL_NAME: model.state_dict("param"),
+                        MODEL_NAME: model_state_dict,
                         OPTIMIZER_NAME: opt_state_dict,
                     }
                 else:
@@ -841,6 +859,9 @@ def _load_from_checkpoint(self, resume_from_checkpoint=None):
                     for key, value in self.model_wrapped.state_dict("opt").items()
                     if not any(keyword in key for keyword in FREE_SVAE_LOAD_KEY_PATTERNS)
                 }
+                if self.args.should_load_model_with_tensor_fusion:
+                    model_state_dict = self._convert_state_dict_for_loading_tensor_fusion_ckpt(model_state_dict)
+                    optim_state_dict = self._convert_state_dict_for_loading_tensor_fusion_ckpt(optim_state_dict)
             else:
                 model_state_dict = self.model_wrapped.state_dict()
                 optim_state_dict = self.optimizer.state_dict()
@@ -875,7 +896,36 @@ def _load_from_checkpoint(self, resume_from_checkpoint=None):
                 self._load_ckpt_func(state_dict, ckpt_path)
 
             if self.args.to_static:
+                if self.args.should_load_model_with_tensor_fusion:
+                    model_state_dict = self._convert_state_dict_for_loading_model_with_tensor_fusion(model_state_dict)
+                    optim_state_dict = self._convert_state_dict_for_loading_model_with_tensor_fusion(optim_state_dict)
+
                 self.model_wrapped.set_state_dict(model_state_dict)
                 self.model_wrapped.set_state_dict(optim_state_dict)
             # release memory
             del state_dict
+
+    def _convert_state_dict_for_loading_tensor_fusion_ckpt(self, state_dict):
+        if self.args.load_model_with_sharding_tensor_fusion:
+            logger.info("load sharding tensor fusion unbalanced model")
+            state_dict = self.model_wrapped._convert_state_dict_with_rank_unique_name(state_dict)
+        else:
+            logger.info("load sharding tensor fusion balanced model")
+            state_dict = self.model_wrapped._convert_state_dict_without_tensor_fusion_param(state_dict)
+        return state_dict
+
+    def _convert_state_dict_for_loading_model_with_tensor_fusion(self, state_dict):
+        if self.args.load_model_with_sharding_tensor_fusion:
+            state_dict = self.model_wrapped._convert_state_dict_with_origin_name(state_dict)
+        else:
+            state_dict = self.model_wrapped._convert_state_dict_with_tensor_fusion_param(state_dict)
+        return state_dict
+
+    def _convert_state_dict_for_saving_tensor_fusion_ckpt(self, state_dict):
+        if self.args.save_model_with_sharding_tensor_fusion:
+            logger.info("save sharding tensor fusion unbalanced model")
+            state_dict = self.model_wrapped._convert_state_dict_with_rank_unique_name(state_dict)
+        else:
+            logger.info("save sharding tensor fusion balanced model")
+            state_dict = self.model_wrapped._convert_state_dict_without_tensor_fusion_param(state_dict)
+        return state_dict
diff --git a/paddlenlp/trainer/auto_training_args.py b/paddlenlp/trainer/auto_training_args.py
index e9a184f725b2..21bf7633bad2 100644
--- a/paddlenlp/trainer/auto_training_args.py
+++ b/paddlenlp/trainer/auto_training_args.py
@@ -14,7 +14,7 @@
 import json
 from dataclasses import dataclass, field
 
-from .trainer_utils import split_parallel_config
+from .trainer_utils import ShardingOption, split_parallel_config
 from .training_args import TrainingArguments
 from .utils import add_start_docstrings
 
@@ -52,6 +52,29 @@ class AutoTrainingArguments(TrainingArguments):
         metadata={"help": "Weather to use auto_parallel intermediate api"},
     )
     refined_ops_patterns: str = field(default=None, metadata={"help": "The pattern of refined recompute."})
+    load_model_with_sharding_tensor_fusion: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "When using sharding stage1, enabling tensor fusion, and setting `load_model_with_sharding_tensor_fusion` to `True`, "
+                "the model is loaded with unbalanced weights, meaning that the model weights are stored in an unbalanced format to avoid "
+                "additional memory overhead. If set to `False`, the model will be loaded with balanced weights, which may increase memory "
+                "consumption. This setting is only available in auto parallel to_static mode."
+            )
+        },
+    )
+    save_model_with_sharding_tensor_fusion: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "When using sharding stage1 and enabling tensor fusion, setting `save_model_with_sharding_tensor_fusion` to `True` "
+                "saves the model with unbalanced weights, which helps avoid additional memory consumption. Setting it to `False` "
+                "saves the model with balanced weights, which may increase memory usage but ensures uniform parameter distribution. "
+                "This option allows flexibility in choosing the save format based on memory requirements. "
+                "This setting is only available in auto parallel to_static mode."
+            )
+        },
+    )
 
     def __post_init__(self):
         super().__post_init__()
@@ -89,3 +112,13 @@ def __post_init__(self):
                 recompute.refined_ops_patterns = (
                     self.refined_ops_patterns if self.refined_ops_patterns is not None else []
                 )
+
+    @property
+    def should_load_model_with_tensor_fusion(self):
+        return (
+            self.enable_auto_parallel
+            and self.to_static
+            and ShardingOption.SHARD_OP in self.sharding
+            and self.sharding_parallel_degree > 1
+            and "enable_tensor_fusion" in self.sharding_parallel_config
+        )
diff --git a/paddlenlp/trainer/trainer.py b/paddlenlp/trainer/trainer.py
index eee06f66fb28..347b89a36752 100644
--- a/paddlenlp/trainer/trainer.py
+++ b/paddlenlp/trainer/trainer.py
@@ -87,6 +87,12 @@
     from ..quantization.quantization_linear import QuantizationLinear
 except:
     QuantizationLinear = None
+try:
+    from paddle.distributed.fleet.utils.sequence_parallel_utils import (
+        register_sequence_parallel_allreduce_hooks,
+    )
+except:
+    pass
 from ..transformers.context_parallel_utils import split_inputs_sequence_dim_load_balance
 from ..transformers.model_utils import (
     PretrainedModel,
@@ -460,6 +466,9 @@ def fn(layer):
 
         # very last
         self._memory_tracker.stop_and_update_metrics()
+        if self.args.count_trained_tokens:
+            self.trained_effective_tokens = 0
+            self.trained_tokens = 0
 
     def _wrap_amp_model(self, args, model):
         logger.info("Using half precision")
@@ -1116,6 +1125,9 @@ def _inner_training_loop(
                     is_no_sync = True
 
                 sync_context = model.no_sync() if is_no_sync else contextlib.nullcontext()
+                if self.args.count_trained_tokens:
+                    self.trained_effective_tokens += (inputs["input_ids"] != self.args.pad_token_id).sum()
+                    self.trained_tokens += inputs["input_ids"].numel()
                 with sync_context:
                     if "step_control" in inspect.signature(self.training_step).parameters:
                         tr_loss_step = self.training_step(model, inputs, step_control=step_control)
@@ -1499,13 +1511,13 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
             )
 
             seq_length = None
-            model_flops = None
+            model_flops_per_token = None
             if getattr(self, "is_pretraining", False) and hasattr(self.model, "config"):
                 seq_length = getattr(self.model.config, "seq_length", None)
                 try:
-                    model_flops = self.model.get_hardware_flops(seq_length=seq_length, recompute=self.args.recompute)
+                    model_flops_per_token = self.model.get_hardware_flops()
                 except NotImplementedError:
-                    model_flops = None
+                    model_flops_per_token = None
 
             # Do not log speed metrics if all steps are skipped since last log.
             if num_steps > 0:
@@ -1516,7 +1528,7 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
                         num_samples=total_train_batch_size * num_steps,
                         num_steps=num_steps,
                         seq_length=seq_length,
-                        model_flops=model_flops,
+                        model_flops_per_token=model_flops_per_token,
                     )
                 )
 
@@ -1564,6 +1576,27 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
             self._save_checkpoint(model, metrics=metrics)
             logger.info(f"{self.runtime_timer.log()}")
             self.control = self.callback_handler.on_save(self.args, self.state, self.control)
+            self.log_trained_tokens()
+
+    def log_trained_tokens(self):
+        if self.args.count_trained_tokens:
+            token_list = []
+            for token_num in [self.trained_effective_tokens, self.trained_tokens]:
+                tensors = token_num.reshape([1])
+                if self.hcg._sharding_degree > 1:
+                    output_tensors = []
+                    paddle.distributed.all_gather(output_tensors, tensors, group=self.hcg._sharding_comm_group)
+                    tensors = paddle.concat(output_tensors).sum().reshape([1])
+                if self.hcg._dp_degree > 1:
+                    output_tensors = []
+                    paddle.distributed.all_gather(output_tensors, tensors, group=self.hcg._dp_comm_group)
+                    tensors = paddle.concat(output_tensors).sum().reshape([1])
+                token_list.append(tensors.item())
+            if self.is_local_process_zero():
+
+                logger.info(
+                    f"Update to now, trained_effective_tokens: {token_list[0]}, trained_tokens: {token_list[1]}."
+                )
 
     def _get_learning_rate(self):
         return self.optimizer.get_lr()
@@ -2043,6 +2076,11 @@ def _wrap_model(self, model, training=True):
             else:
                 model, self.optimizer = decorated
 
+        if self.args.tensor_parallel_degree > 1 and self.args.sequence_parallel:
+            register_sequence_parallel_allreduce_hooks(
+                model, self.args.gradient_accumulation_steps, self.args.fuse_sequence_parallel_allreduce
+            )
+
         if self.args.world_size == 1:
             if self.args.amp_master_grad:
                 mix_precision_utils.MixPrecisionLayer(model, dtype=self.amp_dtype)
@@ -2535,7 +2573,49 @@ def _save_checkpoint(self, model, metrics=None):
         else:
             self.save_model(output_dir)
 
-        # only save model state dict, ignore optimizer and scheduler
+        # Determine the new best metric / best model checkpoint
+        if metrics is not None and self.args.metric_for_best_model is not None:
+            metric_to_check = self.args.metric_for_best_model
+            if not metric_to_check.startswith("eval_"):
+                metric_to_check = f"eval_{metric_to_check}"
+            metric_value = metrics[metric_to_check]
+
+            operator = np.greater if self.args.greater_is_better else np.less
+            if (
+                self.state.best_metric is None
+                or self.state.best_model_checkpoint is None
+                or operator(metric_value, self.state.best_metric)
+            ):
+                self.state.best_metric = metric_value
+                self.state.best_model_checkpoint = output_dir
+
+        # Save the Trainer state
+        if self.args.should_save:
+            self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
+
+        # Save RNG state in non-distributed training
+        rng_states = {
+            "python": random.getstate(),
+            "numpy": np.random.get_state(),
+            "cuda": paddle.get_rng_state(),
+            "cpu": paddle.framework.core.default_cpu_generator().get_state(),
+        }
+        if self.args.use_hybrid_parallel:
+            rng_states[
+                "hybrid_parallel_rng_state_tracker"
+            ] = fleet.meta_parallel.get_rng_state_tracker().get_states_tracker()
+
+        if self.args.world_size > 1:
+            rng_states_list = []
+            paddle.distributed.all_gather_object(rng_states_list, rng_states)
+            if self.args.should_save:
+                os.makedirs(output_dir, exist_ok=True)
+                paddle.save(rng_states_list, os.path.join(output_dir, f"rng_state_{self.args.world_size}.pth"))
+        else:
+            os.makedirs(output_dir, exist_ok=True)
+            paddle.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
+
+            # only save model state dict, ignore optimizer and scheduler
         if not self.args.ignore_save_lr_and_optim:
             optimizer_name = _add_variant(OPTIMIZER_NAME, self.args.optimizer_name_suffix)
             saved_signal_path = os.path.join(output_dir, f"saved_signal_{dist.get_rank()}")
@@ -2621,47 +2701,6 @@ def _save_checkpoint(self, model, metrics=None):
                             paddle.save(global_rank, os.path.join(signal_dir, f".master_weight.done.{global_rank}"))
 
         self.runtime_timer.stop()
-        # Determine the new best metric / best model checkpoint
-        if metrics is not None and self.args.metric_for_best_model is not None:
-            metric_to_check = self.args.metric_for_best_model
-            if not metric_to_check.startswith("eval_"):
-                metric_to_check = f"eval_{metric_to_check}"
-            metric_value = metrics[metric_to_check]
-
-            operator = np.greater if self.args.greater_is_better else np.less
-            if (
-                self.state.best_metric is None
-                or self.state.best_model_checkpoint is None
-                or operator(metric_value, self.state.best_metric)
-            ):
-                self.state.best_metric = metric_value
-                self.state.best_model_checkpoint = output_dir
-
-        # Save the Trainer state
-        if self.args.should_save:
-            self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
-
-        # Save RNG state in non-distributed training
-        rng_states = {
-            "python": random.getstate(),
-            "numpy": np.random.get_state(),
-            "cuda": paddle.get_rng_state(),
-            "cpu": paddle.framework.core.default_cpu_generator().get_state(),
-        }
-        if self.args.use_hybrid_parallel:
-            rng_states[
-                "hybrid_parallel_rng_state_tracker"
-            ] = fleet.meta_parallel.get_rng_state_tracker().get_states_tracker()
-
-        if self.args.world_size > 1:
-            rng_states_list = []
-            paddle.distributed.all_gather_object(rng_states_list, rng_states)
-            if self.args.should_save:
-                os.makedirs(output_dir, exist_ok=True)
-                paddle.save(rng_states_list, os.path.join(output_dir, f"rng_state_{self.args.world_size}.pth"))
-        else:
-            os.makedirs(output_dir, exist_ok=True)
-            paddle.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
 
         # Maybe delete some older checkpoints.
         # For hybrid parallel training, the checkpoint files maybe on different node.
@@ -3112,6 +3151,9 @@ def evaluation_loop(
             if self.model is self.model_wrapped and isinstance(self.model_wrapped, PipelineLayer):
                 # NOTE(gongenlei): when do_train=False, do_eval=True, we need to wrap model for pipeline
                 self.model_wrapped = fleet.distributed_model(self.model_wrapped)
+            if isinstance(self.model_wrapped, LoRAModel) and isinstance(self.model_wrapped.model, PipelineLayer):
+                # NOTE(liuting): when do_train=False, do_eval=True, lora=True, we need to wrap model for pipeline
+                self.model_wrapped = fleet.distributed_model(self.model_wrapped.model)
             model = self.model_wrapped
         else:
             model = self.model
diff --git a/paddlenlp/trainer/trainer_compress.py b/paddlenlp/trainer/trainer_compress.py
index f2f945cd128f..44420ca9de02 100644
--- a/paddlenlp/trainer/trainer_compress.py
+++ b/paddlenlp/trainer/trainer_compress.py
@@ -38,6 +38,7 @@
     prepare_qkv_ofa,
     reorder_neuron_head,
 )
+from ..utils.env import PADDLE_INFERENCE_MODEL_SUFFIX, PADDLE_INFERENCE_WEIGHTS_SUFFIX
 from ..utils.log import logger
 from .trainer import Trainer
 
@@ -651,8 +652,8 @@ def _batch_generator_func():
             executor=exe,
             batch_generator=_batch_generator_func,
             model_dir=model_dir,
-            model_filename=args.input_filename_prefix + ".pdmodel",
-            params_filename=args.input_filename_prefix + ".pdiparams",
+            model_filename=args.input_filename_prefix + PADDLE_INFERENCE_MODEL_SUFFIX,
+            params_filename=args.input_filename_prefix + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
             batch_size=batch_size,
             batch_nums=batch_nums,
             scope=None,
@@ -675,8 +676,8 @@ def _batch_generator_func():
         save_model_path = os.path.join(model_dir, algo + "_".join([str(batch_size), str(batch_nums)]))
         post_training_quantization.save_quantized_model(
             save_model_path=save_model_path,
-            model_filename=args.output_filename_prefix + ".pdmodel",
-            params_filename=args.output_filename_prefix + ".pdiparams",
+            model_filename=args.output_filename_prefix + PADDLE_INFERENCE_MODEL_SUFFIX,
+            params_filename=args.output_filename_prefix + PADDLE_INFERENCE_WEIGHTS_SUFFIX,
         )
         output_dir_list.append(save_model_path)
 
diff --git a/paddlenlp/trainer/trainer_utils.py b/paddlenlp/trainer/trainer_utils.py
index 538f4c8ec32d..6a3f9754712e 100644
--- a/paddlenlp/trainer/trainer_utils.py
+++ b/paddlenlp/trainer/trainer_utils.py
@@ -359,7 +359,7 @@ def total_processes_number(local_rank):
     return 1
 
 
-def speed_metrics(split, start_time, num_samples=None, num_steps=None, seq_length=None, model_flops=None):
+def speed_metrics(split, start_time, num_samples=None, num_steps=None, seq_length=None, model_flops_per_token=None):
     """
     Measure and return speed performance metrics.
 
@@ -380,9 +380,9 @@ def speed_metrics(split, start_time, num_samples=None, num_steps=None, seq_lengt
         if seq_length is not None:
             tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()
             result[f"{split}_tokens_per_second_per_device"] = round(tokens_per_second_per_device, 4)
-        if model_flops is not None:
+        if model_flops_per_token is not None:
             result[f"{split}_hardware_tflops_per_device"] = round(
-                tokens_per_second_per_device * model_flops / seq_length / 2**40, 2
+                tokens_per_second_per_device * model_flops_per_token / 2**40, 2
             )
 
     if num_steps is not None:
diff --git a/paddlenlp/trainer/training_args.py b/paddlenlp/trainer/training_args.py
index 6f80032cb347..b52e5137dd5e 100644
--- a/paddlenlp/trainer/training_args.py
+++ b/paddlenlp/trainer/training_args.py
@@ -285,12 +285,15 @@ class TrainingArguments:
             Some additional config it highly affect the useage of sharding parallel, we provide some option to config it.
             following config is support:
               enable_stage1_tensor_fusion, fuse small tensors into big tensor chunks to accelerate communications, may increase memory occupation
+              enable_tensor_fusion, fuse small tensors into big tensor chunks to accelerate communications, may increase memory occupation only used for semi auto mode.
               enable_stage1_overlap, fuse small tensors into big tensor chunks to accelerate communications and do communication overlap with backward computation, may harm the backward speed
+              enable_overlap, fuse small tensors into big tensor chunks to accelerate communications and do communication overlap with backward computation, may harm the backward speed only used for semi auto mode.
               enable_stage2_overlap, overlap stage2 NCCL communication with computation. There are some constraints for the overlap, such as the logging_step should be bigger than 1 for broadcast overlap and no other sync could be called during the training for broadcast overlap.
               enable_stage1_broadcast_overlap, overlap stage1 V1 broadcast with next step forward computation. There are some constraints for the overlap, such as the logging_step should be bigger than 1 for broadcast overlap forward compute and no other sync could be called during the training for broadcast overlap.
               enable_stage1_allgather_overlap, overlap stage1 V2 allgather with next step forward computation. There are some constraints for the overlap, such as the logging_step should be bigger than 1 for allgather overlap forward compute and no other sync could be called during the training for allgather overlap.
               disable_stage1_reduce_avg, replace reduce_avg with original reduce_sum+scale in stage1, which can be used for accuracy verification.
               enable_release_grads, reduce peak memory usage by releasing gradients after each iteration. The creation of gradients will be postponed until backward propagation of the next iteration.
+              enable_fuse_optimizer_states, fuse optimizer states to a single storage.
         recompute (`bool`, *optional*, defaults to `False`):
             Recompute the forward pass to calculate gradients. Used for saving memory.
             Only support for networks with transformer blocks.
@@ -616,6 +619,7 @@ class TrainingArguments:
             )
         },
     )
+
     tensor_parallel_degree: int = field(
         default=-1,
         metadata={
@@ -669,6 +673,13 @@ class TrainingArguments:
             )
         },
     )
+    sequence_parallel: bool = field(
+        default=False,
+        metadata={"help": "Whether to enable sequence parallel."},
+    )
+    fuse_sequence_parallel_allreduce: bool = field(
+        default=False, metadata={"help": "Whether to use fuse sequence parallel allreduce."}
+    )
     sequence_parallel_config: str = field(
         default="",
         metadata={
@@ -723,12 +734,13 @@ class TrainingArguments:
                 "Some additional config it highly affect the useage of sharding parallel, we provide some option to config it."
                 "following config is support: \n"
                 "enable_stage1_tensor_fusion, fuse small tensors into big tensor chunks to accelerate communications, may increase memory occupation\n"
+                "enable_tensor_fusion, fuse small tensors into big tensor chunks to accelerate communications, may increase memory occupation only used for semi auto mode.\n"
                 "enable_stage1_overlap, fuse small tensors into big tensor chunks to accelerate communications and do communication overlap with backward computation, may harm the backward speed\n"
+                "enable_overlap, fuse small tensors into big tensor chunks to accelerate communications and do communication overlap with backward computation, may harm the backward speed only used for semi auto mode.\n"
                 "disable_stage1_reduce_avg, replace reduce_avg with original reduce_sum+scale in stage1, which can be used for accuracy verification.\n"
                 "enable_stage2_overlap, overlap stage2 NCCL communication with computation. There are some constraints for the overlap, such as the logging_step should be bigger than 1 for broadcast overlap and no other sync could be called during the training for broadcast overlap\n"
                 "enable_stage1_broadcast_overlap, overlap stage1 V1 broadcast with next step forward computation. There are some constraints for the overlap, such as the logging_step should be bigger than 1 for broadcast overlap forward compute and no other sync could be called during the training for broadcast overlap.\n"
                 "enable_stage1_allgather_overlap, overlap stage1 V2 allgather with next step forward computation. There are some constraints for the overlap, such as the logging_step should be bigger than 1 for allgather overlap forward compute and no other sync could be called during the training for allgather overlap.\n"
-                "enable_stage1_tensor_fusion_blanced_save_load, convert unbalanced optimizer state to balanced state when using tensor fusion strategy, which may increase the memory occupation."
             )
         },
     )
@@ -966,6 +978,14 @@ class TrainingArguments:
         default=300,
         metadata={"help": "Timeout seconds for downloading checkpoint from remote cluster."},
     )
+    count_trained_tokens: bool = field(
+        default=False,
+        metadata={"help": "Whether to count trained tokens."},
+    )
+    pad_token_id: int = field(
+        default=0,
+        metadata={"help": "The id of the padding token."},
+    )
 
     def __post_init__(self):
         if in_auto_parallel_align_mode():
@@ -1209,10 +1229,17 @@ def __post_init__(self):
                                     f"Found unknown pipeline mode config {x}, accpet config is disable_p2p_cache_shape, disable_partial_send_recv."
                                 )
 
+                    enable_partial_send_recv = "disable_partial_send_recv" not in pipeline_parallel_config
+                    if self.sequence_parallel and enable_partial_send_recv:
+                        logger.warning(
+                            "When use pipeline parallel and sequence parallel simultaneously, we should turn off partial send recv."
+                        )
+                        enable_partial_send_recv = False
+
                     strategy.pipeline_configs = {
                         "accumulate_steps": self.gradient_accumulation_steps,
                         "micro_batch_size": self.per_device_train_batch_size,
-                        "enable_partial_send_recv": "disable_partial_send_recv" not in pipeline_parallel_config,
+                        "enable_partial_send_recv": enable_partial_send_recv,
                         "p2p_cache_shape": False if "disable_p2p_cache_shape" in pipeline_parallel_config else True,
                         # "delay_scale_loss": True, Fix ME
                     }
@@ -1394,10 +1421,11 @@ def is_segment_parallel_supported():
                                 "enable_stage1_broadcast_overlap",
                                 "enable_stage1_allgather_overlap",
                                 "enable_release_grads",
+                                "enable_fuse_optimizer_states",
                             ]:
                                 raise ValueError(
-                                    f"Found unknown pipeline mode config {x}, "
-                                    f"accpet config is enable_stage1_tensor_fusion, enable_stage1_overlap, enable_stage2_overlap, split_param, disable_stage1_reduce_avg, enable_stage1_broadcast_overlap, enable_stage1_allgather_overlap."
+                                    f"Found unknown sharding mode config {x}, "
+                                    f"accpet config is enable_stage1_tensor_fusion, enable_stage1_overlap, enable_stage2_overlap, split_param, disable_stage1_reduce_avg, enable_stage1_broadcast_overlap, enable_stage1_allgather_overlap, enable_release_grads, enable_fuse_optimizer_states."
                                 )
                     if "disable_stage1_reduce_avg" in sharding_parallel_config:
                         assert self.sharding == [
@@ -1423,6 +1451,9 @@ def is_segment_parallel_supported():
                         if "enable_release_grads" in sharding_parallel_config:
                             strategy.hybrid_configs["sharding_configs"].release_gradients = True
 
+                        if "enable_fuse_optimizer_states" in sharding_parallel_config:
+                            strategy.hybrid_configs["sharding_configs"].enable_fuse_optimizer_states = True
+
                         if self.pipeline_parallel_degree == 1:
                             strategy.hybrid_configs["sharding_configs"].tensor_fusion = (
                                 True if "enable_stage1_tensor_fusion" in sharding_parallel_config else False
@@ -1609,13 +1640,12 @@ def is_segment_parallel_supported():
                             "enable_mp_async_allreduce",  # allreduce_matmul_grad_overlapping in auto_parallel
                             "enable_delay_scale_loss",
                             "replace_with_c_embedding",
-                            # "enable_mp_skip_c_identity",
                             # "enable_mp_fused_linear_param_grad_add",
                             "replace_with_parallel_cross_entropy",
                         ]:
                             raise ValueError(
                                 f"Found unknown tensor parallell config {x}, "
-                                f"accept config is enable_mp_async_allreduce, replace_with_c_embedding, enable_mp_skip_c_identity and enable_mp_fused_linear_param_grad_add"
+                                f"accept config is enable_mp_async_allreduce, replace_with_c_embedding, and enable_mp_fused_linear_param_grad_add"
                             )
                 try:
                     if "enable_mp_async_allreduce" in mp_config:
@@ -1645,28 +1675,32 @@ def is_segment_parallel_supported():
                 for x in sharding_parallel_config:
                     if len(x) > 0:
                         if x not in [
-                            "enable_stage1_tensor_fusion",
-                            "enable_stage1_overlap",
-                            "enable_stage2_overlap",
+                            "enable_tensor_fusion",
+                            "enable_overlap",
                             "enable_release_grads",
-                            "enable_stage1_tensor_fusion_blanced_save_load",
                         ]:
+                            if x in ["enable_stage1_overlap", "enable_stage2_overlap"]:
+                                raise ValueError(
+                                    "enable_stage1_overlap and enable_stage2_overlap are not supported in "
+                                    "auto_parallel mode. Please use enable_overlap instead."
+                                )
+                            elif x == "enable_stage1_tensor_fusion":
+                                raise ValueError(
+                                    "enable_stage1_tensor_fusion is not supported in auto_parallel mode. "
+                                    "Please use enable_tensor_fusion instead."
+                                )
                             raise ValueError(
-                                f"Found unknown pipeline mode config {x}, " f"accpet config is reduce_overlap."
+                                f"Found unknown sharding mode config {x}, "
+                                f"accpet config is enable_tensor_fusion, "
+                                "enable_overlap, enable_release_grads."
                             )
 
-                    if (
-                        "enable_stage1_overlap" in sharding_parallel_config
-                        or "enable_stage2_overlap" in sharding_parallel_config
-                    ):
+                    if "enable_overlap" in sharding_parallel_config:
                         sharding.enable_overlap = True
 
-                    if "enable_stage1_tensor_fusion" in sharding_parallel_config:
+                    if "enable_tensor_fusion" in sharding_parallel_config:
                         sharding.grad_bucket_size_numel = 210355872
-                        sharding.enable_stage1_tensor_fusion = True
-
-                    if "enable_stage1_tensor_fusion_blanced_save_load" in sharding_parallel_config:
-                        sharding.save_unbalanced_param = False
+                        sharding.enable_tensor_fusion = True
 
                     if "enable_release_grads" in sharding_parallel_config:
                         sharding.release_gradients = True
@@ -2242,3 +2276,13 @@ def print_config(self, args=None, key=""):
                     logger.debug("{:30}: {}".format(a, v))
 
         logger.debug("")
+
+    @property
+    def should_save_model_with_tensor_fusion(self):
+        return (
+            self.enable_auto_parallel
+            and self.to_static
+            and ShardingOption.SHARD_OP in self.sharding
+            and self.sharding_parallel_degree > 1
+            and "enable_tensor_fusion" in self.sharding_parallel_config
+        )
diff --git a/paddlenlp/trainer/unified_checkpoint/load_local.py b/paddlenlp/trainer/unified_checkpoint/load_local.py
index c2ccbe4d240b..4e37f44b4d19 100644
--- a/paddlenlp/trainer/unified_checkpoint/load_local.py
+++ b/paddlenlp/trainer/unified_checkpoint/load_local.py
@@ -149,14 +149,6 @@ def _remove_unused_keys(
 
 
 def load_unified_optimizer_locally(args, model, optimizer, resume_from_checkpoint, safe_serialization=False):
-    # Special process with split param.
-    if is_sharding_split_param_mode(args):
-        returned_optim_state_dict = load_unified_optimizer_split_param(args, model, optimizer, resume_from_checkpoint)
-        return returned_optim_state_dict
-
-    # init and get optimizer LR_Scheduler
-    returned_optim_state_dict = nested_copy(optimizer.state_dict())
-
     if not safe_serialization:
         index_filename, index_filename_master_weights = (
             PADDLE_OPTIMIZER_INDEX_NAME,
@@ -165,6 +157,23 @@ def load_unified_optimizer_locally(args, model, optimizer, resume_from_checkpoin
     else:
         index_filename, index_filename_master_weights = SAFE_OPTIMIZER_INDEX_NAME, SAFE_MASTER_WEIGHTS_INDEX_NAME
 
+    with open(os.path.join(resume_from_checkpoint, index_filename), "r") as f:
+        index = json.loads(f.read())
+
+    ckpt_quant_stage = "O0"
+    if "ckpt_quant_stage" in index:
+        ckpt_quant_stage = index["ckpt_quant_stage"]
+
+    # Special process with split param.
+    if is_sharding_split_param_mode(args):
+        returned_optim_state_dict = load_unified_optimizer_split_param(
+            args, model, optimizer, resume_from_checkpoint, ckpt_quant_stage
+        )
+        return returned_optim_state_dict
+
+    # init and get optimizer LR_Scheduler
+    returned_optim_state_dict = nested_copy(optimizer.state_dict())
+
     resolved_archive_file, sharded_metadata = get_optimizer_shard_files(
         optimizer_path=resume_from_checkpoint,
         index_filename=os.path.join(resume_from_checkpoint, index_filename),
@@ -184,13 +193,6 @@ def load_unified_optimizer_locally(args, model, optimizer, resume_from_checkpoin
     if len(resolved_archive_file) > 1:
         resolved_archive_file = tqdm(resolved_archive_file, desc="Loading optimizer shards")
 
-    with open(os.path.join(resume_from_checkpoint, index_filename), "r") as f:
-        index = json.loads(f.read())
-
-    ckpt_quant_stage = "O0"
-    if "ckpt_quant_stage" in index:
-        ckpt_quant_stage = index["ckpt_quant_stage"]
-
     # update has_master_weights and index_filename_master_weights
     # 1. if the master weight exists, only has_master_weights is set True and loaded when needed
     # 2. if master weight does not exist, convert model weight to master weight when needed
diff --git a/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py b/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py
index fda80fca0a61..9b162d4a88c1 100644
--- a/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py
+++ b/paddlenlp/trainer/unified_checkpoint/sharding_split_param_utils.py
@@ -36,18 +36,25 @@
     get_expected_state_dict,
     get_optimizer_shard_files,
     mapping_optimizer_tp_actions,
+    update_master_weight_status,
 )
 
 __all__ = ["gather_splited_param_for_optimizer", "load_unified_optimizer_split_param"]
 
 
 def merge_splited_param(
-    state_dict, partial_tensor_list, param_shape_info, send_table, recv_table, is_master_weights=False
+    state_dict,
+    partial_tensor_list,
+    param_shape_info,
+    send_table,
+    recv_table,
+    is_master_weights=False,
+    ckpt_quant_stage="O0",
 ):
     """Merge the splited param in sharding group."""
     global_rank = dist.get_rank()
     for key in list(state_dict.keys()):
-        if state_dict[key].numel().item() == 1:  # for example: beta1, beta2
+        if int(state_dict[key].numel()) == 1:  # for example: beta1, beta2
             continue
 
         static_name = key if is_master_weights else generate_base_static_name(key)[0]
@@ -89,10 +96,21 @@ def merge_splited_param(
                     )
                     dist.stream.send(tensor, dst=recv_rank)
                     state_dict.pop(key)
+
+    if ckpt_quant_stage != "O0":
+        for key in list(state_dict.keys()):
+            if int(state_dict[key].numel()) == 1:  # for example: beta1, beta2
+                static_name = key if is_master_weights else generate_base_static_name(key)[0]
+                if static_name in partial_tensor_list:
+                    recv_rank = recv_table[static_name]
+                    send_info = send_table[static_name]
+                    if global_rank != recv_rank:
+                        state_dict.pop(key)
+
     return state_dict
 
 
-def gather_splited_param_for_optimizer(optimizer):
+def gather_splited_param_for_optimizer(optimizer, ckpt_quant_stage="O0"):
     hcg = fleet.get_hybrid_communicate_group()
     sharding_group = hcg.get_sharding_parallel_group()
     global_rank = dist.get_rank()
@@ -127,7 +145,7 @@ def gather_splited_param_for_optimizer(optimizer):
     for key in list(optim_state_dict.keys()):
         static_name, _ = generate_base_static_name(key)
         if static_name in param_slice_info.keys():
-            if optim_state_dict[key].numel().item() == 1:  # for example: beta1, beta2
+            if int(optim_state_dict[key].numel()) == 1:  # for example: beta1, beta2
                 continue
             begin, end = param_slice_info[static_name]
             shape, numel, _, _ = param_shape_info[static_name]
@@ -149,13 +167,15 @@ def gather_splited_param_for_optimizer(optimizer):
         recv_table[key] = sharding_ranklist[0][0]  # which sharding_rank to recv the splited tensor
         send_table[key] = [(rank, begin, end) for rank, begin, end in sharding_ranklist]
 
-    merge_splited_param(optim_state_dict, partial_tensor_list, param_shape_info, send_table, recv_table, False)
+    merge_splited_param(
+        optim_state_dict, partial_tensor_list, param_shape_info, send_table, recv_table, False, ckpt_quant_stage
+    )
     if master_weights is not None:
         merge_splited_param(master_weights, partial_tensor_list, param_shape_info, send_table, recv_table, True)
     return optim_state_dict, master_weights
 
 
-def load_unified_optimizer_split_param(args, model, optimizer, resume_from_checkpoint):
+def load_unified_optimizer_split_param(args, model, optimizer, resume_from_checkpoint, ckpt_quant_stage="O0"):
     returned_optim_state_dict = nested_copy(optimizer.state_dict())
 
     index_filename, index_filename_master_weights = SAFE_OPTIMIZER_INDEX_NAME, SAFE_MASTER_WEIGHTS_INDEX_NAME
@@ -208,6 +228,10 @@ def load_unified_optimizer_split_param(args, model, optimizer, resume_from_check
     if len(resolved_archive_file) > 1:
         resolved_archive_file = tqdm(resolved_archive_file, desc="Loading optimizer shards")
 
+    has_master_weights, index_filename_master_weights = update_master_weight_status(
+        args, optimizer, has_master_weights, safe_serialization=True
+    )
+
     if has_master_weights:
         returned_optim_state_dict["master_weights"] = {}
         resolved_archive_file_mw, sharded_metadata_mw = get_optimizer_shard_files(
@@ -217,7 +241,9 @@ def load_unified_optimizer_split_param(args, model, optimizer, resume_from_check
         if len(resolved_archive_file_mw) > 1:
             resolved_archive_file_mw = tqdm(resolved_archive_file_mw, desc="Loading master weights shards")
 
-    def load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected_keys, is_master_weights=False):
+    def load_resolved_archive_file(
+        resolved_archive_file, sharded_metadata, expected_keys, is_master_weights=False, ckpt_quant_stage="O0"
+    ):
         returned_state_dict = {}
 
         if model.config.tensor_parallel_degree > 1:
@@ -232,9 +258,21 @@ def load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected
             if expected_keys.isdisjoint(sharded_metadata["file_map"][os.path.split(shard_file)[-1]]):
                 continue
             if model.config.tensor_parallel_degree > 1:
-                state_dict = load_state_dict(shard_file, tp_actions, expected_keys, device="cpu")
+                state_dict = load_state_dict(
+                    shard_file,
+                    tp_actions,
+                    expected_keys,
+                    device="cpu",
+                    ckpt_quant_stage=ckpt_quant_stage,
+                )
             else:
-                state_dict = load_state_dict(shard_file, None, expected_keys, device="cpu")
+                state_dict = load_state_dict(
+                    shard_file,
+                    None,
+                    expected_keys,
+                    device="cpu",
+                    ckpt_quant_stage=ckpt_quant_stage,
+                )
             returned_state_dict.update(state_dict)
             del state_dict
             gc.collect()
@@ -242,14 +280,16 @@ def load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected
         return returned_state_dict
 
     # get tp params
-    state_dict_optim = load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected_keys_optim)
+    state_dict_optim = load_resolved_archive_file(
+        resolved_archive_file, sharded_metadata, expected_keys_optim, ckpt_quant_stage=ckpt_quant_stage
+    )
 
     # need to split param for different sharding rank, maybe need to deal with oom issue.
     for key in list(state_dict_optim.keys()):
         key_name = key.split("/")
         static_name = struct2static_name_mappings.get(key_name[0], None)
 
-        if state_dict_optim[key].numel().item() > 1:
+        if int(state_dict_optim[key].numel()) > 1:
             begin, end = param_slice_info[static_name]
             shape, numel, index, padded_size = param_shape_info[static_name]
             state_dict_optim[key] = state_dict_optim[key].reshape([-1])
@@ -265,7 +305,11 @@ def load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected
                     )
                 )
         if has_master_weights:
-            key_name = "_".join([static_name, FP32_MASTER, key_name[1]])
+            if model_state_dict[key_name[0]].dtype != paddle.float32:
+                key_name = "_".join([static_name, FP32_MASTER, key_name[1]])
+            else:
+                # for parameters with float32 dtype, no need to have fp32 master weights.
+                key_name = "_".join([static_name, key_name[1]])
         else:
             key_name = "_".join([static_name, key_name[1]])
 
@@ -284,7 +328,7 @@ def load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected
 
         for key in list(state_dict_master_weight.keys()):
             static_name = struct2static_name_mappings.get(key, None)
-            if state_dict_master_weight[key].numel().item() > 1:
+            if int(state_dict_master_weight[key].numel()) > 1:
                 begin, end = param_slice_info[static_name]
                 shape, numel, index, padded_size = param_shape_info[static_name]
                 state_dict_master_weight[key] = state_dict_master_weight[key].reshape([-1])
@@ -303,6 +347,13 @@ def load_resolved_archive_file(resolved_archive_file, sharded_metadata, expected
                 paddle.framework._current_expected_place(), False
             )
             returned_optim_state_dict["master_weights"][static_name] = state_dict_master_weight.pop(key)
+
+            # master weight cast (only in remove_master_weight)
+            if returned_optim_state_dict["master_weights"][static_name].dtype != paddle.float32:
+                returned_optim_state_dict["master_weights"][static_name] = paddle.cast(
+                    returned_optim_state_dict["master_weights"][static_name], dtype=paddle.float32
+                )
+
             returned_optim_state_dict["master_weights"][static_name].name = "_".join([static_name, FP32_MASTER])
 
     return returned_optim_state_dict
diff --git a/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py b/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py
index 0cb38bec94bb..41ba54972efb 100644
--- a/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py
+++ b/paddlenlp/trainer/unified_checkpoint/unified_checkpoint.py
@@ -29,7 +29,7 @@
     unwrap_model,
 )
 from paddlenlp.transformers.utils import dtype_byte_size
-from paddlenlp.utils import infohub
+from paddlenlp.utils import empty_device_cache, infohub
 from paddlenlp.utils.env import (
     LORA_WEIGHTS_NAME,
     MAX_QUANTIZATION_TIMES,
@@ -67,6 +67,7 @@
     FP32_MASTER,
     UnifiedCheckpointOption,
     filter_params,
+    filter_sync_parameters,
     gather_sharded_object,
     generate_base_static_name,
     get_expected_state_dict,
@@ -158,7 +159,7 @@ def save_unified_checkpoint(self, model, optimizer, output_dir, signal_dir=None)
         if self.args.should_save:
             save_model_config(model_to_save, save_directory)
 
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
 
         if strtobool(os.getenv("FLAG_LLM_PDC", "False")) and self.args.should_save:
             world_size = paddle.distributed.get_world_size()
@@ -195,7 +196,7 @@ def load_unified_checkpoint(self, model, resume_from_checkpoint: str):
             load_unified_checkpoint_locally(self.args, model, resume_from_checkpoint, safe_serialization=True)
 
     def save_non_merge_optimizer(self, model, optim_state_dict, master_weights, output_dir, signal_dir):
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
 
         # gather global master_weights status.
         global_master_weights = reduce_master_weights_status(master_weights is not None)
@@ -218,25 +219,9 @@ def save_non_merge_optimizer(self, model, optim_state_dict, master_weights, outp
             for key in list(master_weights.keys()):
                 master_weights[static2struct_name_mappings[key]] = master_weights.pop(key)
 
-        no_sync_kname = []
-        model_state_dict = get_expected_state_dict(model)
-        for k, v in model_state_dict.items():
-            if getattr(v, "no_sync", False):
-                no_sync_kname.append(k)
-
-        hcg = fleet.get_hybrid_communicate_group()
-        dp_group = hcg.get_data_parallel_group()
-        dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
         if self.args.use_expert_parallel:
-            for k in list(optim_state_dict.keys()):
-                model_k = k.split("/")[0]
-                if dp_rank > 0 and model_k not in no_sync_kname:
-                    optim_state_dict.pop(k)
-            if master_weights is not None:
-                for k in list(master_weights.keys()):
-                    model_k = k.split("/")[0]
-                    if dp_rank > 0 and model_k not in no_sync_kname:
-                        master_weights.pop(k)
+            model_state_dict = get_expected_state_dict(model)
+            filter_sync_parameters(model_state_dict, optim_state_dict, master_weights, is_model_weight=False)
 
         optimizer_name = _add_variant(SAFE_OPTIMIZER_NAME, self.args.optimizer_name_suffix)
         master_weights_name = _add_variant(SAFE_MASTER_WEIGHTS_NAME, self.args.optimizer_name_suffix)
@@ -344,7 +329,9 @@ def save_unified_optimizer(self, model, optimizer, output_dir, signal_dir):
             return
 
         if is_sharding_split_param_mode(self.args):
-            optim_state_dict, master_weights = gather_splited_param_for_optimizer(optimizer)
+            optim_state_dict, master_weights = gather_splited_param_for_optimizer(
+                optimizer, self.args.ckpt_quant_stage if "quant_reach_limit" not in infohub else "O0"
+            )
         else:
             optim_state_dict = nested_copy(optimizer.state_dict())
             master_weights = None
@@ -373,7 +360,7 @@ def save_unified_optimizer(self, model, optimizer, output_dir, signal_dir):
             optim_state_dict, shard_optim_file, sharded_optim_index = results[0]
             master_weight_state_dict, shard_master_weight_file, sharded_master_weight_index = results[1]
 
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
         save_directory = output_dir
         os.makedirs(save_directory, exist_ok=True)
         if signal_dir is not None:
@@ -506,7 +493,7 @@ def unified_checkpoint_into_shards(
     Returns:
         tuple: state_dict, config, shard_file: file name, sharded_index: map for weight to file name.
     """
-    paddle.device.cuda.empty_cache()
+    empty_device_cache()
     assert hasattr(model_to_save, "config")
 
     state_dict = get_expected_state_dict(model_to_save, concat_additional_adapter=True)
@@ -514,6 +501,10 @@ def unified_checkpoint_into_shards(
 
     config_to_save = copy.deepcopy(model_to_save.config)
 
+    if args.use_expert_parallel:
+        # ignore saving `no_sync=False` tensors when using expert_parallel under dp_rank > 0.
+        filter_sync_parameters(state_dict, is_model_weight=True)
+
     if config_to_save.tensor_parallel_degree > 1:
         if isinstance(model_to_save, LoRAModel) or isinstance(model_to_save, PrefixModelForCausalLM):
             tp_actions = model_to_save._get_tensor_parallel_convert_actions(
@@ -558,7 +549,7 @@ def unified_checkpoint_into_shards(
         elif isinstance(model_to_save, PrefixModelForCausalLM):
             sharded_index["type"] = "ptuning"
 
-    paddle.device.cuda.empty_cache()
+    empty_device_cache()
 
     return state_dict, shard_file, sharded_index
 
@@ -576,7 +567,7 @@ def unified_optimizer_into_shards(
         optimizer (Optimizer): optimizer to save.
         safe_serialization (bool, optional): safe serialization using safetensors. Defaults to False.
     """
-    paddle.device.cuda.empty_cache()
+    empty_device_cache()
 
     # gather global master_weights status.
     global_master_weights = reduce_master_weights_status(master_weights is not None)
@@ -623,6 +614,9 @@ def unified_optimizer_into_shards(
     tp_group = fleet.get_hybrid_communicate_group().get_model_parallel_group()
     tp_size = tp_group.nranks
 
+    if args.use_expert_parallel:
+        filter_sync_parameters(state_dict, optim_state_dict, master_weights, is_model_weight=False)
+
     if tp_size > 1:
         # get tp_actions
         model_keys = []
@@ -641,9 +635,8 @@ def unified_optimizer_into_shards(
             optim_state_dict,
             tp_actions,
             filter_optim_keys,
-            state_dict if args.use_expert_parallel else None,
         )
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
 
         if master_weights is not None:
             logger.info("Unified master weight tensor parallel in shards")
@@ -651,9 +644,8 @@ def unified_optimizer_into_shards(
                 master_weights,
                 tp_actions,
                 filter_master_keys,
-                state_dict if args.use_expert_parallel else None,
             )
-            paddle.device.cuda.empty_cache()
+            empty_device_cache()
 
     # build index json file
     index_optimizer_file, index_master_weight_file = {}, {}
@@ -704,7 +696,7 @@ def unified_optimizer_into_shards(
         else:
             sharded_optim_index["master_weights"] = False
 
-    paddle.device.cuda.empty_cache()
+    empty_device_cache()
     if master_weights is None:
         return [(optim_state_dict, shard_optimizer_file, sharded_optim_index)]
     else:
diff --git a/paddlenlp/trainer/unified_checkpoint/utils.py b/paddlenlp/trainer/unified_checkpoint/utils.py
index bbb49ae14820..413ca7c47210 100644
--- a/paddlenlp/trainer/unified_checkpoint/utils.py
+++ b/paddlenlp/trainer/unified_checkpoint/utils.py
@@ -354,9 +354,7 @@ def merge_tensor_parallel_with_shard(state_dict, tp_actions, all_filter_keys):
     """
     hcg = fleet.get_hybrid_communicate_group()
     tp_group = hcg.get_model_parallel_group()
-    dp_group = hcg.get_data_parallel_group()
     tp_rank = tp_group.rank
-    dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
 
     # filter actions for pipeline mode
     if hcg.get_pipe_parallel_group().nranks > 1:
@@ -373,10 +371,9 @@ def merge_tensor_parallel_with_shard(state_dict, tp_actions, all_filter_keys):
             if i > len(filter_keys) - 1:
                 continue
             key = filter_keys[i]
-            tensor = state_dict[key]
-            # When using expert parallel, there's no need to save tensors with `no_sync=False` when dp_rank > 0.
-            if dp_rank > 0 and not getattr(tensor, "no_sync", False):
+            if key not in state_dict:
                 continue
+            tensor = state_dict[key]
             if key in tp_actions:
                 # Get tensor size
                 tensor_bytes = tensor.numel().item() * dtype_byte_size(tensor.dtype) * tp_group.nranks
@@ -405,21 +402,13 @@ def merge_tensor_parallel_with_shard(state_dict, tp_actions, all_filter_keys):
     return state_dict_to_save
 
 
-def merge_tensor_parallel_for_optimizer(state_dict, tp_actions, all_filter_keys, model_state_dict=None):
+def merge_tensor_parallel_for_optimizer(state_dict, tp_actions, all_filter_keys):
     """
     Merge tensor parallel according to tp_actions, used for master_weight and optimizer weight.
     """
     hcg = fleet.get_hybrid_communicate_group()
     tp_group = hcg.get_model_parallel_group()
-    dp_group = hcg.get_data_parallel_group()
     tp_rank = tp_group.rank
-    dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
-
-    no_sync_kname = []
-    if model_state_dict is not None:
-        for k, v in model_state_dict.items():
-            if getattr(v, "no_sync", False):
-                no_sync_kname.append(k)
 
     state_dict_to_save = {}
     max_key_len = max([len(_) for _ in all_filter_keys])
@@ -430,10 +419,9 @@ def merge_tensor_parallel_for_optimizer(state_dict, tp_actions, all_filter_keys,
                 continue
             # get base model key
             model_key = filter_keys[i].split("/")[0]
-            tensor = state_dict[filter_keys[i]]
-            # When using expert parallel, there's no need to save tensors with `no_sync=False` when dp_rank > 0.
-            if dp_rank > 0 and model_key not in no_sync_kname:
+            if filter_keys[i] not in state_dict:
                 continue
+            tensor = state_dict[filter_keys[i]]
             if model_key in tp_actions:
                 # for example: beta1, beta2
                 if tensor.numel().item() == 1:
@@ -770,3 +758,31 @@ def save_config(model_to_save):
     # save generation config
     if model_to_save.can_generate():
         model_to_save.generation_config.save_pretrained(save_directory)
+
+
+def filter_sync_parameters(model_state_dict, optim_state_dict=None, master_weights=None, is_model_weight=True):
+    """Filter sync parameters under expert parallel mode."""
+
+    hcg = fleet.get_hybrid_communicate_group()
+    dp_group = hcg.get_data_parallel_group()
+    dp_rank = dp_group.rank if dp_group.nranks > 1 else 0
+
+    if is_model_weight:
+        for key in list(model_state_dict.keys()):
+            if dp_rank > 0 and not getattr(model_state_dict[key], "no_sync", False):
+                model_state_dict.pop(key)
+    else:
+        no_sync_kname = []
+        for k, v in model_state_dict.items():
+            if getattr(v, "no_sync", False):
+                no_sync_kname.append(k)
+
+        for key in list(optim_state_dict.keys()):
+            model_key = key.split("/")[0]
+            if dp_rank > 0 and model_key not in no_sync_kname:
+                optim_state_dict.pop(key)
+
+        if master_weights is not None:
+            for key in list(master_weights.keys()):
+                if dp_rank > 0 and key not in no_sync_kname:
+                    master_weights.pop(key)
diff --git a/paddlenlp/trainer/utils/ckpt_converter.py b/paddlenlp/trainer/utils/ckpt_converter.py
index c397eb885b52..23f085e18f44 100644
--- a/paddlenlp/trainer/utils/ckpt_converter.py
+++ b/paddlenlp/trainer/utils/ckpt_converter.py
@@ -16,6 +16,7 @@
 import os
 import re
 from functools import reduce
+from typing import List, Union
 
 import paddle
 from paddle.distributed.checkpoint.load_state_dict import (
@@ -41,7 +42,13 @@
 
 class CheckpointConverter:
     def __init__(
-        self, hybrid_parallel_ckpt_path, state_dict, parameter_to_structured_name, trainging_args=None, patch_dict=None
+        self,
+        hybrid_parallel_ckpt_path,
+        state_dict,
+        parameter_to_structured_name,
+        trainging_args=None,
+        patch_dict=None,
+        local_view_pattern: Union[List, bool] = None,
     ):
         self.use_dist = True if paddle.distributed.get_world_size() > 1 else False
         self.path = hybrid_parallel_ckpt_path
@@ -85,6 +92,17 @@ def __init__(
                 self.auto_parallel_state_dict[self.patch_dict[k]] = self.auto_parallel_state_dict[k]
             for k in del_keys:
                 self.auto_parallel_state_dict.pop(k)
+        # solve the problem of inconsistent parameter names in moe automatic parallel mode.
+        if hasattr(trainging_args, "moe_group") and trainging_args.moe_group:
+            if local_view_pattern is False:
+                self.local_view_pattern_list = None
+            else:
+                if isinstance(local_view_pattern, list):
+                    self.local_view_pattern_list = local_view_pattern
+                else:
+                    self.local_view_pattern_list = ["experts"]
+        else:
+            self.local_view_pattern_list = None
 
         flags = [
             ["tp degree", self.tp_degree],
@@ -497,6 +515,46 @@ def gen_metadata_and_prepare_source_state_dict(self):
             else:
                 return self.gen_metadata_for_tp_sharded_tensor()
 
+    def rename_local_view_state_dict(self, state_dict, file_name):
+        """
+        Rename the key for local views to the key for global views, and return the renamed `state_dict`.
+        """
+        if self.local_view_pattern_list is None:
+            return state_dict
+        # case 1: moe_group is mp_group
+        if self.tp_degree > 1 and self.sharding_degree <= 1:
+            (tp_rank, pp_rank, sharding_rank) = self.get_distribution_rank_from_file_name(file_name)
+            expert_name_old2new = {}
+            for pattern in self.local_view_pattern_list:
+                expert_pattern = rf"({pattern}\.)(\d+)"
+                # extract all experts IDs
+                expert_ids = set()
+                for state_name in state_dict.keys():
+                    res = re.search(expert_pattern, state_name)
+                    if res:
+                        expert_ids.add(int(res.group(2)))
+                expert_num = len(expert_ids)
+                # construct old name to new name mapping
+                for state_name in state_dict.keys():
+                    res = re.search(expert_pattern, state_name)
+                    if res:
+                        new_expert_id = int(res.group(2)) % expert_num + tp_rank * expert_num
+                        expert_name_old2new[state_name] = re.sub(
+                            expert_pattern, f"{res.group(1)}{new_expert_id}", state_name
+                        )
+            # rename state_dict
+            renamed_state_dict = {
+                expert_name_old2new[state_name]
+                if state_name in expert_name_old2new
+                else state_name: state_dict[state_name]
+                for state_name in state_dict.keys()
+            }
+
+            return renamed_state_dict
+        # TODO: add support for sharding
+        else:
+            return state_dict
+
     def load_state_dict_and_rename(self):
         """
         Parse the distributed information from the names of the checkpoint files and evenly parse out the distributed information for each weight/optimizer state
@@ -741,11 +799,10 @@ def load_state_dict_and_rename(self):
                         model_state_file_name = self.get_model_state_file_from(file_name)
                         assert model_state_file_name is not None
                         model_state_keys = global_file_to_state_dict_keys_mapping[model_state_file_name]
-                        renamed_state_dict = self.rename_using_optimizer_state_order(model_state_keys, state_dict)
-                        self.get_sharded_tensor_infos(file, renamed_state_dict, cur_rank_sharded_tensor_infos)
-                        self.cur_rank_loaded_state_dict[file_name] = renamed_state_dict
-                    else:
-                        self.get_sharded_tensor_infos(file_name, state_dict, cur_rank_sharded_tensor_infos)
+                        state_dict = self.rename_using_optimizer_state_order(model_state_keys, state_dict)
+                    renamed_state_dict = self.rename_local_view_state_dict(state_dict, file_name)
+                    self.get_sharded_tensor_infos(file_name, renamed_state_dict, cur_rank_sharded_tensor_infos)
+                    self.cur_rank_loaded_state_dict[file_name] = renamed_state_dict
             else:
                 for file, state_dict in self.cur_rank_loaded_state_dict.items():
                     # The rule for renaming is to change the master_weights name in the optimizer state to the model weight name,
@@ -897,6 +954,9 @@ def rename(old_name, parameter_to_structured_name):
             return None
 
         for key, value in state_dict.items():
+            # NOTE: Skip the parameters that are not initialized，which are not in the current rank.
+            if value is None or (isinstance(value, paddle.Tensor) and not value._is_initialized()):
+                continue
             if key in parameter_to_structured_name.values():
                 new_name = key
             else:
@@ -909,7 +969,9 @@ def rename(old_name, parameter_to_structured_name):
     def rename_using_optimizer_state_order(self, model_state_keys, optimizer_state_dict):
         name_mapping = {}
         suffix_bucket = {}
-        assert len(optimizer_state_dict) % len(model_state_keys) == 0
+        # TODO: After adapting to sharding, remove the code below.
+        if self.is_sharding_stage3 or (self.sharding_degree > 1 and self.sharding_stage1_v == 2):
+            assert len(optimizer_state_dict) % len(model_state_keys) == 0
         for suffix in OPTIMIZER_STATE_NAME_SUFFIX:
             suffix_bucket[suffix] = []
         for opt_name, opt_value in optimizer_state_dict.items():
@@ -927,10 +989,27 @@ def rename_using_optimizer_state_order(self, model_state_keys, optimizer_state_d
         for suffix, old_names in suffix_bucket.items():
             if len(old_names) == 0:
                 continue
-            assert len(old_names) == len(model_state_keys)
-            for i in range(len(old_names)):
-                name_mapping[old_names[i]] = model_state_keys[i] + suffix
-
+            # TODO: After adapting to sharding, remove the code below.
+            if self.is_sharding_stage3 or (self.sharding_degree > 1 and self.sharding_stage1_v == 2):
+                assert len(old_names) == len(model_state_keys)
+
+            # NOTE: Handle the case where the number of master_weight elements is not equal to the number of model_state_keys.
+            if suffix != ".master_weight":
+                for i in range(len(old_names)):
+                    name_mapping[old_names[i]] = model_state_keys[i] + suffix
+            else:
+                for i in range(len(old_names)):
+                    param = old_names[i][:-14]
+                    index = -1
+                    for idx, opt_name in enumerate(suffix_bucket[".moment1"]):
+                        if param == opt_name[:-24]:
+                            index = idx
+                            break
+                    if index >= 0:
+                        name_mapping[old_names[i]] = model_state_keys[index] + suffix
+                    else:
+                        raise RuntimeError(f"Can't find {param} in optimizer state dict.")
+        # rename state dict
         renamed_state_dict = {}
         for k, v in optimizer_state_dict.items():
             renamed_state_dict[name_mapping[k]] = v
diff --git a/paddlenlp/transformers/__init__.py b/paddlenlp/transformers/__init__.py
index 285db61fd572..0e99466f8e4b 100644
--- a/paddlenlp/transformers/__init__.py
+++ b/paddlenlp/transformers/__init__.py
@@ -56,102 +56,183 @@
 from .bert.configuration import *
 
 # isort: split
-from .gpt import *
-from .roberta.modeling import *
-from .roberta.tokenizer import *
-from .roberta.configuration import *
-from .electra.modeling import *
-from .electra.tokenizer import *
-from .electra.configuration import *
 from .albert.configuration import *
 from .albert.modeling import *
 from .albert.tokenizer import *
-from .bit.modeling import *
-from .bit.configuration import *
-from .bit.image_processing import *
+from .artist.configuration import *
+from .artist.modeling import *
+from .artist.tokenizer import *
+from .auto.configuration import *
+from .auto.image_processing import *
+from .auto.modeling import *
+from .auto.processing import *
+from .auto.tokenizer import *
+from .bart.configuration import *
 from .bart.modeling import *
 from .bart.tokenizer import *
-from .bart.configuration import *
 from .bert_japanese.tokenizer import *
-from .bigbird.modeling import *
 from .bigbird.configuration import *
+from .bigbird.modeling import *
 from .bigbird.tokenizer import *
+from .bit.configuration import *
+from .bit.image_processing import *
+from .bit.modeling import *
+from .blenderbot.configuration import *
 from .blenderbot.modeling import *
 from .blenderbot.tokenizer import *
-from .blenderbot.configuration import *
+from .blenderbot_small.configuration import *
 from .blenderbot_small.modeling import *
 from .blenderbot_small.tokenizer import *
-from .blenderbot_small.configuration import *
+from .blip.configuration import *
+from .blip.image_processing import *
 from .blip.modeling import *
 from .blip.modeling_text import *
-from .blip.configuration import *
 from .blip.processing import *
-from .blip.image_processing import *
+from .blip_2.configuration import *
+from .blip_2.modeling import *
+from .blip_2.processing import *
+from .bloom.configuration import *
+from .bloom.modeling import *
+from .bloom.tokenizer import *
+from .bloom.tokenizer_fast import *
+from .chatglm.configuration import *
+from .chatglm.modeling import *
+from .chatglm.tokenizer import *
+from .chatglm_v2.configuration import *
+from .chatglm_v2.modeling import *
+from .chatglm_v2.modeling_pp import *
+from .chatglm_v2.tokenizer import *
 from .chinesebert.configuration import *
 from .chinesebert.modeling import *
 from .chinesebert.tokenizer import *
+from .chineseclip.configuration import *
+from .chineseclip.feature_extraction import *
+from .chineseclip.image_processing import *
+from .chineseclip.modeling import *
+from .chineseclip.processing import *
+from .chineseclip.tokenizer import *
+from .clap.configuration import *
+from .clap.feature_extraction import *
+from .clap.modeling import *
+from .clap.processing import *
+from .clip.configuration import *
+from .clip.feature_extraction import *
+from .clip.image_processing import *
+from .clip.modeling import *
+from .clip.processing import *
+from .clip.tokenizer import *
+from .clipseg.configuration import *
+from .clipseg.image_processing import *
+from .clipseg.modeling import *
+from .clipseg.processing import *
+from .codegen.configuration import *
+from .codegen.modeling import *
+from .codegen.tokenizer import *
 from .convbert.configuration import *
 from .convbert.modeling import *
 from .convbert.tokenizer import *
+from .ctrl.configuration import *
 from .ctrl.modeling import *
 from .ctrl.tokenizer import *
-from .ctrl.configuration import *
-from .deepseek_v2.modeling import *
-from .deepseek_v2.tokenizer_fast import *
-from .deepseek_v2.configuration import *
-from .dpt.modeling import *
-from .dpt.configuration import *
-from .dpt.image_processing import *
+from .dallebart.configuration import *
+from .dallebart.modeling import *
+from .dallebart.tokenizer import *
+from .deberta.configuration import *
+from .deberta.modeling import *
+from .deberta.tokenizer import *
+from .deberta_v2.configuration import *
+from .deberta_v2.modeling import *
+from .deberta_v2.tokenizer import *
+from .deepseek_v2 import *
+from .deepseek_v3 import *
 from .distilbert.configuration import *
 from .distilbert.modeling import *
 from .distilbert.tokenizer import *
+from .dpt.configuration import *
+from .dpt.image_processing import *
+from .dpt.modeling import *
+from .electra.configuration import *
+from .electra.modeling import *
+from .electra.tokenizer import *
 from .ernie.configuration import *
 from .ernie.modeling import *
 from .ernie.tokenizer import *
+from .ernie_code.configuration import *
+from .ernie_code.modeling import *
+from .ernie_code.tokenizer import *
+from .ernie_ctm.configuration import *
 from .ernie_ctm.modeling import *
 from .ernie_ctm.tokenizer import *
-from .ernie_ctm.configuration import *
+from .ernie_doc.configuration import *
 from .ernie_doc.modeling import *
 from .ernie_doc.tokenizer import *
-from .ernie_doc.configuration import *
 from .ernie_gen.modeling import ErnieForGeneration
+from .ernie_gram.configuration import *
 from .ernie_gram.modeling import *
 from .ernie_gram.tokenizer import *
-from .ernie_gram.configuration import *
+from .ernie_layout.configuration import *
 from .ernie_layout.modeling import *
 from .ernie_layout.tokenizer import *
-from .ernie_layout.configuration import *
 from .ernie_m.configuration import *
 from .ernie_m.modeling import *
 from .ernie_m.tokenizer import *
+from .ernie_vil.configuration import *
+from .ernie_vil.feature_extraction import *
+from .ernie_vil.image_processing import *
+from .ernie_vil.modeling import *
+from .ernie_vil.processing import *
+from .ernie_vil.tokenizer import *
+from .fnet.configuration import *
 from .fnet.modeling import *
 from .fnet.tokenizer import *
-from .fnet.configuration import *
+from .funnel.configuration import *
 from .funnel.modeling import *
 from .funnel.tokenizer import *
-from .funnel.configuration import *
-from .llama import *
+from .gau_alpha.configuration import *
+from .gau_alpha.modeling import *
+from .gau_alpha.tokenizer import *
+from .gemma import *
+from .glm.configuration import *
+from .glm.modeling import *
+from .glm.tokenizer import *
+from .gpt import *
+from .gptj.configuration import *
+from .gptj.modeling import *
+from .gptj.tokenizer import *
+from .jamba.configuration import *
+from .jamba.modeling import *
+from .jamba.tokenizer import *
 from .layoutlm.configuration import *
 from .layoutlm.modeling import *
 from .layoutlm.tokenizer import *
+from .layoutlmv2.configuration import *
 from .layoutlmv2.modeling import *
 from .layoutlmv2.tokenizer import *
-from .layoutlmv2.configuration import *
+from .layoutxlm.configuration import *
 from .layoutxlm.modeling import *
 from .layoutxlm.tokenizer import *
-from .layoutxlm.configuration import *
+from .llama import *
+from .llm_embed.modeling import *
+from .luke.configuration import *
 from .luke.modeling import *
 from .luke.tokenizer import *
-from .luke.configuration import *
+from .mamba.configuration import *
+from .mamba.modeling import *
+from .mamba.tokenizer import *
+from .mbart.configuration import *
 from .mbart.modeling import *
 from .mbart.tokenizer import *
-from .mbart.configuration import *
+from .megatronbert.configuration import *
 from .megatronbert.modeling import *
 from .megatronbert.tokenizer import *
-from .megatronbert.configuration import *
-from .prophetnet.modeling import *
-from .prophetnet.tokenizer import *
-from .prophetnet.configuration import *
+from .minigpt4.configuration import *
+from .minigpt4.image_processing import *
+from .minigpt4.modeling import *
+from .minigpt4.processing import *
+from .mistral.configuration import *
+from .mistral.modeling import *
+from .mixtral.configuration import *
+from .mixtral.modeling import *
 from .mobilebert.configuration import *
 from .mobilebert.modeling import *
 from .mobilebert.tokenizer import *
@@ -163,155 +244,78 @@
 from .nezha.configuration import *
 from .nezha.modeling import *
 from .nezha.tokenizer import *
+from .nv_embed.modeling import *
+from .nystromformer.configuration import *
+from .nystromformer.modeling import *
+from .nystromformer.tokenizer import *
+from .opt.configuration import *
+from .opt.modeling import *
+from .optimization import *
+from .pegasus.configuration import *
+from .pegasus.modeling import *
+from .pegasus.tokenizer import *
 from .ppminilm.modeling import *
 from .ppminilm.tokenizer import *
+from .prophetnet.configuration import *
+from .prophetnet.modeling import *
+from .prophetnet.tokenizer import *
+from .qwen import *
+from .qwen2 import *
+from .qwen2_moe import *
+from .reformer.configuration import *
 from .reformer.modeling import *
 from .reformer.tokenizer import *
-from .reformer.configuration import *
+from .rembert.configuration import *
 from .rembert.modeling import *
 from .rembert.tokenizer import *
-from .rembert.configuration import *
-from .roformer.modeling import *
+from .roberta.configuration import *
+from .roberta.modeling import *
+from .roberta.tokenizer import *
 from .roformer.configuration import *
+from .roformer.modeling import *
 from .roformer.tokenizer import *
+from .roformerv2.configuration import *
+from .roformerv2.modeling import *
+from .roformerv2.tokenizer import *
+from .rw.configuration import *
+from .rw.modeling import *
+from .rw.tokenizer import *
 from .semantic_search.modeling import *
 from .skep.configuration import *
 from .skep.modeling import *
 from .skep.tokenizer import *
+from .speecht5.configuration import *
+from .speecht5.feature_extraction import *
+from .speecht5.modeling import *
+from .speecht5.processing import *
+from .speecht5.tokenizer import *
+from .squeezebert.configuration import *
 from .squeezebert.modeling import *
 from .squeezebert.tokenizer import *
-from .squeezebert.configuration import *
+from .t5.configuration import *
 from .t5.modeling import *
 from .t5.tokenizer import *
-from .t5.configuration import *
 from .tinybert.configuration import *
 from .tinybert.modeling import *
 from .tinybert.tokenizer import *
 from .transformer.modeling import *
+from .unified_transformer.configuration import *
 from .unified_transformer.modeling import *
 from .unified_transformer.tokenizer import *
-from .unified_transformer.configuration import *
-from .ernie_code.tokenizer import *
-from .ernie_code.modeling import *
-from .ernie_code.configuration import *
-from .ernie_vil.configuration import *
-from .ernie_vil.modeling import *
-from .ernie_vil.feature_extraction import *
-from .ernie_vil.tokenizer import *
-from .ernie_vil.processing import *
-from .ernie_vil.image_processing import *
+from .unimo.configuration import *
 from .unimo.modeling import *
 from .unimo.tokenizer import *
-from .unimo.configuration import *
-from .xlnet.modeling import *
-from .xlnet.tokenizer import *
-from .xlnet.configuration import *
-from .xlm.modeling import *
-from .xlm.tokenizer import *
-from .xlm.configuration import *
-from .gau_alpha.modeling import *
-from .gau_alpha.tokenizer import *
-from .gau_alpha.configuration import *
-from .gemma import *
-from .roformerv2.modeling import *
-from .roformerv2.tokenizer import *
-from .roformerv2.configuration import *
-from .optimization import *
-from .opt.configuration import *
-from .opt.modeling import *
-from .auto.modeling import *
-from .auto.tokenizer import *
-from .auto.processing import *
-from .auto.image_processing import *
-from .auto.configuration import *
-from .codegen.modeling import *
-from .codegen.tokenizer import *
-from .codegen.configuration import *
-from .artist.modeling import *
-from .artist.tokenizer import *
-from .artist.configuration import *
-from .dallebart.modeling import *
-from .dallebart.tokenizer import *
-from .dallebart.configuration import *
-from .clip.modeling import *
-from .clip.configuration import *
-from .clip.feature_extraction import *
-from .clip.tokenizer import *
-from .clip.processing import *
-from .clip.image_processing import *
-from .chineseclip.modeling import *
-from .chineseclip.configuration import *
-from .chineseclip.feature_extraction import *
-from .chineseclip.processing import *
-from .chineseclip.image_processing import *
-from .chineseclip.tokenizer import *
-from .gptj.modeling import *
-from .gptj.tokenizer import *
-from .gptj.configuration import *
-from .pegasus.modeling import *
-from .pegasus.tokenizer import *
-from .pegasus.configuration import *
-from .glm.configuration import *
-from .glm.modeling import *
-from .glm.tokenizer import *
-from .nystromformer.configuration import *
-from .nystromformer.modeling import *
-from .nystromformer.tokenizer import *
-from .bloom.configuration import *
-from .bloom.modeling import *
-from .bloom.tokenizer import *
-from .bloom.tokenizer_fast import *
-from .clipseg.configuration import *
-from .clipseg.modeling import *
-from .clipseg.processing import *
-from .clipseg.image_processing import *
-from .blip_2.modeling import *
-from .blip_2.configuration import *
-from .blip_2.processing import *
-from .chatglm.configuration import *
-from .chatglm.modeling import *
-from .chatglm.tokenizer import *
-from .chatglm_v2.configuration import *
-from .chatglm_v2.modeling import *
-from .chatglm_v2.modeling_pp import *
-from .chatglm_v2.tokenizer import *
-from .speecht5.configuration import *
-from .speecht5.modeling import *
-from .speecht5.tokenizer import *
-from .speecht5.processing import *
-from .speecht5.feature_extraction import *
-from .minigpt4.modeling import *
-from .minigpt4.configuration import *
-from .minigpt4.processing import *
-from .minigpt4.image_processing import *
-from .clap.configuration import *
-from .clap.feature_extraction import *
-from .clap.modeling import *
-from .clap.processing import *
-from .visualglm.modeling import *
 from .visualglm.configuration import *
-from .visualglm.processing import *
 from .visualglm.image_processing import *
-from .rw.modeling import *
-from .rw.configuration import *
-from .rw.tokenizer import *
-from .mistral.modeling import *
-from .mistral.configuration import *
-from .qwen import *
-from .mixtral.modeling import *
-from .mixtral.configuration import *
-from .deberta.modeling import *
-from .deberta.tokenizer import *
-from .deberta.configuration import *
-from .deberta_v2.modeling import *
-from .deberta_v2.tokenizer import *
-from .deberta_v2.configuration import *
-from .qwen2 import *
-from .qwen2_moe import *
+from .visualglm.modeling import *
+from .visualglm.processing import *
+from .xlm.configuration import *
+from .xlm.modeling import *
+from .xlm.tokenizer import *
+from .xlnet.configuration import *
+from .xlnet.modeling import *
+from .xlnet.tokenizer import *
+from .xlm_roberta.modeling import *
+from .xlm_roberta.tokenizer import *
+from .xlm_roberta.configuration import *
 from .yuan import *
-from .mamba.configuration import *
-from .mamba.modeling import *
-from .mamba.tokenizer import *
-from .jamba.modeling import *
-from .jamba.configuration import *
-from .jamba.tokenizer import *
diff --git a/paddlenlp/transformers/artist/configuration.py b/paddlenlp/transformers/artist/configuration.py
index 8a0fd4c0e6bd..b12f99573cc5 100644
--- a/paddlenlp/transformers/artist/configuration.py
+++ b/paddlenlp/transformers/artist/configuration.py
@@ -14,7 +14,7 @@
 """ MBart model configuration"""
 from __future__ import annotations
 
-from paddlenlp.transformers import GPTConfig
+from ..gpt.configuration import GPTConfig
 
 __all__ = ["ARTIST_PRETRAINED_INIT_CONFIGURATION", "ARTIST_PRETRAINED_RESOURCE_FILES_MAP", "ArtistConfig"]
 
diff --git a/paddlenlp/transformers/auto/configuration.py b/paddlenlp/transformers/auto/configuration.py
index f2058a5ec389..d800252a7a5e 100644
--- a/paddlenlp/transformers/auto/configuration.py
+++ b/paddlenlp/transformers/auto/configuration.py
@@ -56,6 +56,8 @@
         ("dallebart", "DalleBartConfig"),
         ("deberta", "DebertaConfig"),
         ("debertav2", "DebertaV2Config"),
+        ("deepseek_v2", "DeepseekV2Config"),
+        ("deepseek_v3", "DeepseekV3Config"),
         ("distilbert", "DistilBertConfig"),
         ("dpt", "DPTConfig"),
         ("electra", "ElectraConfig"),
@@ -113,6 +115,7 @@
         ("unimo", "UNIMOConfig"),
         ("visualglm", "VisualGLMConfig"),
         ("xlm", "XLMConfig"),
+        ("xlm-roberta", "XLMRobertaConfig"),
         ("xlnet", "XLNetConfig"),
         ("yuan", "YuanConfig"),
     ]
@@ -145,6 +148,8 @@
         ("dallebart", "DalleBart"),
         ("deberta", "Deberta"),
         ("debertav2", "DebertaV2"),
+        ("deepseek_v2", "DeepseekV2"),
+        ("deepseek_v3", "DeepseekV3"),
         ("distilbert", "DistilBert"),
         ("dpt", "DPT"),
         ("electra", "Electra"),
@@ -202,6 +207,7 @@
         ("unimo", "UNIMO"),
         ("visualglm", "VisualGLM"),
         ("xlm", "XLM"),
+        ("xlm-roberta", "XLMRoberta"),
         ("xlnet", "XLNet"),
         ("yuan", "Yuan"),
     ]
diff --git a/paddlenlp/transformers/auto/factory.py b/paddlenlp/transformers/auto/factory.py
index 960ed741c655..e888ad08c0a9 100644
--- a/paddlenlp/transformers/auto/factory.py
+++ b/paddlenlp/transformers/auto/factory.py
@@ -79,7 +79,7 @@ def __getitem__(self, key):
     def _load_attr_from_module(self, model_type, attr):
         module_name = model_type_to_module_name(model_type)
         if module_name not in self._modules:
-            if "Tokenizer" in model_type:
+            if any(["Tokenizer" in name for name in [model_type, attr]]):
                 try:
                     self._modules[module_name] = importlib.import_module(
                         f".{module_name}.tokenizer", "paddlenlp.transformers"
@@ -87,7 +87,7 @@ def _load_attr_from_module(self, model_type, attr):
                 except ImportError:
                     pass
             if module_name not in self._modules:
-                if "Config" in model_type:
+                if any(["Config" in name for name in [model_type, attr]]):
                     try:
                         self._modules[module_name] = importlib.import_module(
                             f".{module_name}.configuration", "paddlenlp.transformers"
diff --git a/paddlenlp/transformers/auto/modeling.py b/paddlenlp/transformers/auto/modeling.py
index 8b94d9f4b53d..38e773f56bb4 100644
--- a/paddlenlp/transformers/auto/modeling.py
+++ b/paddlenlp/transformers/auto/modeling.py
@@ -57,6 +57,8 @@
         ("CTRL", "ctrl"),
         ("DistilBert", "distilbert"),
         ("DalleBart", "dallebart"),
+        ("DeepseekV2", "deepseek_v2"),
+        ("DeepseekV3", "deepseek_v3"),
         ("Electra", "electra"),
         ("ErnieViL", "ernie_vil"),
         ("ErnieCtm", "ernie_ctm"),
@@ -94,6 +96,7 @@
         ("UNIMO", "unimo"),
         ("XLNet", "xlnet"),
         ("XLM", "xlm"),
+        ("XLMRoberta", "xlm_roberta"),
         ("GPT", "gpt"),
         ("GLM", "glm"),
         ("MT5", "mt5"),
@@ -819,27 +822,38 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         tensor_parallel_degree = kwargs.pop("tensor_parallel_degree", 1)
         tensor_parallel_rank = kwargs.pop("tensor_parallel_rank", 0)
         model_arg = kwargs.pop("model_args", None)
+        spec_model_type = kwargs.pop("spec_model_type", "None")
+        spec_flag = ""
 
         # Check whether the model_type is img2txt in inference mode
-        if model_arg.model_type is not None and predictor_args.mode == "dynamic":
-            model_name = MODEL_FOR_CAUSAL_LM_INFERENCE_MAPPING_NAMES[model_arg.model_type]
-            predictor_args.block_attn = 0
-            if model_name is None:
-                raise ValueError(
-                    f"Model type {model_arg.model_type} is not supported for {config.architectures[0]} inference."
-                )
+        if spec_model_type == "eagle":
+            spec_flag = "Eagle"
+            attn_type = "Block"
+            model_name = f"{config.architectures[0]}{attn_type}"
+        elif spec_model_type == "mtp":
+            spec_flag = "MTP"
+            attn_type = "Block"
+            model_name = f"{config.architectures[0]}{attn_type}"
         else:
-            # Check whether the model use block attention
-            if predictor_args.block_attn or predictor_args.speculate_method is not None:
-                attn_type = "Block"
+            if model_arg.model_type is not None and predictor_args.mode == "dynamic":
+                model_name = MODEL_FOR_CAUSAL_LM_INFERENCE_MAPPING_NAMES[model_arg.model_type]
+                predictor_args.block_attn = 0
+                if model_name is None:
+                    raise ValueError(
+                        f"Model type {model_arg.model_type} is not supported for {config.architectures[0]} inference."
+                    )
             else:
-                attn_type = ""
-            model_name = f"{config.architectures[0]}{attn_type}"
+                # Check whether the model use block attention
+                if predictor_args.block_attn or predictor_args.speculate_method is not None:
+                    attn_type = "Block"
+                else:
+                    attn_type = ""
+                model_name = f"{config.architectures[0]}{attn_type}"
 
         # Import the InferenceModel
         import_class = importlib.import_module(f"paddlenlp.experimental.transformers.{config.model_type}.modeling")
 
-        model_class_name = f"{model_name}InferenceModel"
+        model_class_name = f"{spec_flag}{model_name}InferenceModel"
         model_class = getattr(import_class, model_class_name)
 
         # It may return a new model class, like LlamaForCausalLMAvxInferenceModel
diff --git a/paddlenlp/transformers/auto/tokenizer.py b/paddlenlp/transformers/auto/tokenizer.py
index a53e36c4935a..93d2ea633e06 100644
--- a/paddlenlp/transformers/auto/tokenizer.py
+++ b/paddlenlp/transformers/auto/tokenizer.py
@@ -13,6 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import importlib
+import inspect
 import io
 import json
 import os
@@ -56,22 +57,38 @@
             ("blenderbot", "BlenderbotTokenizer"),
             (
                 "bloom",
-                ("BloomTokenizer", "BloomTokenizerFast" if is_tokenizers_available() else None),
+                (
+                    "BloomTokenizer",
+                    "BloomTokenizerFast" if is_tokenizers_available() else None,
+                ),
             ),
             ("clip", "CLIPTokenizer"),
             ("codegen", "CodeGenTokenizer"),
             ("convbert", "ConvBertTokenizer"),
             ("ctrl", "CTRLTokenizer"),
             ("distilbert", "DistilBertTokenizer"),
+            (
+                "deepseek_v2",
+                "DeepseekTokenizerFast" if is_tokenizers_available() else None,
+            ),
             ("electra", "ElectraTokenizer"),
             (
                 "ernie",
-                ("ErnieTokenizer", "ErnieTokenizerFast" if is_tokenizers_available() else None),
+                (
+                    "ErnieTokenizer",
+                    "ErnieTokenizerFast" if is_tokenizers_available() else None,
+                ),
             ),
             ("ernie_m", "ErnieMTokenizer"),
             ("fnet", "FNetTokenizer"),
             ("funnel", "FunnelTokenizer"),
-            ("gemma", ("GemmaTokenizer", "GemmaTokenizerFast" if is_tokenizers_available() else None)),
+            (
+                "gemma",
+                (
+                    "GemmaTokenizer",
+                    "GemmaTokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),
             ("jamba", "JambaTokenizer"),
             ("layoutlm", "LayoutLMTokenizer"),
             ("layoutlmv2", "LayoutLMv2Tokenizer"),
@@ -99,6 +116,7 @@
             ("squeezebert", "SqueezeBertTokenizer"),
             ("t5", "T5Tokenizer"),
             ("xlm", "XLMTokenizer"),
+            ("xlm_roberta", "XLMRobertaTokenizer"),
             ("xlnet", "XLNetTokenizer"),
             ("bert_japanese", "BertJapaneseTokenizer"),
             ("bigbird", "BigBirdTokenizer"),
@@ -122,7 +140,10 @@
             ("unimo", "UNIMOTokenizer"),
             (
                 "gpt",
-                (("GPTTokenizer", "GPTChineseTokenizer"), "GPTTokenizerFast" if is_tokenizers_available() else None),
+                (
+                    ("GPTTokenizer", "GPTChineseTokenizer"),
+                    "GPTTokenizerFast" if is_tokenizers_available() else None,
+                ),
             ),
             ("gau_alpha", "GAUAlphaTokenizer"),
             ("artist", "ArtistTokenizer"),
@@ -130,7 +151,13 @@
             ("ernie_vil", "ErnieViLTokenizer"),
             ("glm", "GLMGPT2Tokenizer"),
             ("qwen", "QWenTokenizer"),
-            ("qwen2", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
+            (
+                "qwen2",
+                (
+                    "Qwen2Tokenizer",
+                    "Qwen2TokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),
             ("yuan", "YuanTokenizer"),
         ]
     )
@@ -156,7 +183,10 @@ def get_configurations():
     for class_name, values in TOKENIZER_MAPPING_NAMES.items():
         all_tokenizers = get_mapping_tokenizers(values, with_fast=False)
         for key in all_tokenizers:
-            import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.tokenizer")
+            try:
+                import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.tokenizer")
+            except ImportError:
+                import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.tokenizer_fast")
             tokenizer_name = getattr(import_class, key)
             name = tuple(tokenizer_name.pretrained_init_configuration.keys())
             MAPPING_NAMES[name] = tokenizer_name
@@ -460,7 +490,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
             else:
                 if tokenizer_class_py is not None:
-                    if isinstance(tokenizer_class_py, str):
+                    if inspect.isclass(tokenizer_class_py):
                         return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
                     else:
                         # Use the first tokenizer class in the list
@@ -481,7 +511,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
             "- or the correct path to a directory containing relevant tokenizer files.\n"
         )
 
-    def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None, exist_ok=False):
+    def register(
+        config_class,
+        slow_tokenizer_class=None,
+        fast_tokenizer_class=None,
+        exist_ok=False,
+    ):
         """
         Register a new tokenizer in this mapping.
 
@@ -522,4 +557,8 @@ def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None,
             if fast_tokenizer_class is None:
                 fast_tokenizer_class = existing_fast
 
-        TOKENIZER_MAPPING.register(config_class, (slow_tokenizer_class, fast_tokenizer_class), exist_ok=exist_ok)
+        TOKENIZER_MAPPING.register(
+            config_class,
+            (slow_tokenizer_class, fast_tokenizer_class),
+            exist_ok=exist_ok,
+        )
diff --git a/paddlenlp/transformers/blenderbot/tokenizer.py b/paddlenlp/transformers/blenderbot/tokenizer.py
index 20748ad4307c..935cda42ab9c 100644
--- a/paddlenlp/transformers/blenderbot/tokenizer.py
+++ b/paddlenlp/transformers/blenderbot/tokenizer.py
@@ -16,7 +16,8 @@
 
 from paddle.utils import try_import
 
-from .. import AddedToken, GPTTokenizer
+from ..gpt.tokenizer import GPTTokenizer
+from ..tokenizer_utils import AddedToken
 
 __all__ = ["BlenderbotTokenizer"]
 
diff --git a/paddlenlp/transformers/chatglm/modeling.py b/paddlenlp/transformers/chatglm/modeling.py
index 5e3d8e493896..708b068f5fd8 100755
--- a/paddlenlp/transformers/chatglm/modeling.py
+++ b/paddlenlp/transformers/chatglm/modeling.py
@@ -132,14 +132,7 @@ def forward(self, position_ids):
             cos_cached = emb.cos().unsqueeze(1).cast(self.default_dtype)
             sin_cached = emb.sin().unsqueeze(1).cast(self.default_dtype)
 
-            if hasattr(paddle.framework, "_no_check_dy2st_diff"):
-                # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
-                # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
-                # removed after static graphs support inplace and stride.
-                with paddle.framework._no_check_dy2st_diff():
-                    self.cos_cached, self.sin_cached = cos_cached, sin_cached
-            else:
-                self.cos_cached, self.sin_cached = cos_cached, sin_cached
+            self.cos_cached, self.sin_cached = cos_cached, sin_cached
 
         cos, sin = self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
         if self.position_encoding_2d:
diff --git a/paddlenlp/transformers/codegen/tokenizer.py b/paddlenlp/transformers/codegen/tokenizer.py
index 2bc72cfcc282..14769b274048 100644
--- a/paddlenlp/transformers/codegen/tokenizer.py
+++ b/paddlenlp/transformers/codegen/tokenizer.py
@@ -14,7 +14,8 @@
 # limitations under the License.
 
 from paddle.utils import try_import
-from .. import GPTTokenizer
+
+from ..gpt.tokenizer import GPTTokenizer
 
 __all__ = ["CodeGenTokenizer"]
 
diff --git a/paddlenlp/transformers/configuration_utils.py b/paddlenlp/transformers/configuration_utils.py
index 5ecd7f907db6..40db64a003bf 100644
--- a/paddlenlp/transformers/configuration_utils.py
+++ b/paddlenlp/transformers/configuration_utils.py
@@ -235,6 +235,7 @@ class LlmMetaConfig:
         ("use_fused_rope", bool, False, "Enable rope fusion or not."),
         ("use_fused_linear", bool, False, "GPT3 model, use fused linear layer"),
         ("use_fused_dropout_add", bool, False, "GPT3 model, use fused `dropout + residual add` op."),
+        ("use_fused_linear_cross_entropy", bool, False, "use fused `linear + cross_entropy` fuse op."),
     ]
 
     hybrid_parallel_attributes = [
diff --git a/paddlenlp/transformers/contrastive_loss.py b/paddlenlp/transformers/contrastive_loss.py
index 3e132b6f454a..632198f74ef6 100644
--- a/paddlenlp/transformers/contrastive_loss.py
+++ b/paddlenlp/transformers/contrastive_loss.py
@@ -52,10 +52,10 @@ def forward(self, q_reps, p_reps):
         if len(self.embedding_matryoshka_dims) > 0:
             loss = 0.0
             for dim in self.embedding_matryoshka_dims:
-                reduced_q_reps = q_reps[:, :dim]
+                reduced_q_reps = q_reps[:, :dim].astype("float32")
                 reduced_q_reps = nn.functional.normalize(reduced_q_reps, axis=-1)
 
-                reduced_p_reps = p_reps[:, :dim]
+                reduced_p_reps = p_reps[:, :dim].astype("float32")
                 reduced_p_reps = nn.functional.normalize(reduced_p_reps, axis=-1)
 
                 dim_loss = self.loss_fn(reduced_q_reps, reduced_p_reps)
diff --git a/paddlenlp/transformers/conversion_utils.py b/paddlenlp/transformers/conversion_utils.py
index f457bf28e856..e95d94f8a3ed 100644
--- a/paddlenlp/transformers/conversion_utils.py
+++ b/paddlenlp/transformers/conversion_utils.py
@@ -1312,7 +1312,8 @@ def _resolve_prefix_keys(state_keys_base, state_keys_real, ignore_error=False):
         # state_keys_map base to real
         state_keys_map = {}
 
-        state_keys_base = set(state_keys_base)
+        # sorted by length，match from long to short for A.key B.key ...
+        state_keys_base = sorted(state_keys_base, key=lambda x: len(x), reverse=True)
         state_keys_real = set(state_keys_real)
 
         for key in state_keys_base:
diff --git a/paddlenlp/transformers/dallebart/tokenizer.py b/paddlenlp/transformers/dallebart/tokenizer.py
index 13335b6bc646..fe83feb78a00 100644
--- a/paddlenlp/transformers/dallebart/tokenizer.py
+++ b/paddlenlp/transformers/dallebart/tokenizer.py
@@ -23,7 +23,8 @@
 
 from paddle.utils import try_import
 
-from ...transformers import AddedToken, GPTTokenizer
+from ..gpt.tokenizer import GPTTokenizer
+from ..tokenizer_utils import AddedToken
 
 __all__ = ["DalleBartTokenizer"]
 
diff --git a/paddlenlp/transformers/deepseek_v2/__init__.py b/paddlenlp/transformers/deepseek_v2/__init__.py
index 5144d20699db..f68a341b4fbc 100644
--- a/paddlenlp/transformers/deepseek_v2/__init__.py
+++ b/paddlenlp/transformers/deepseek_v2/__init__.py
@@ -14,4 +14,6 @@
 
 from .configuration import *
 from .modeling import *
+from .modeling_auto import *
+from .modeling_pp import *
 from .tokenizer_fast import *
diff --git a/paddlenlp/transformers/deepseek_v2/configuration.py b/paddlenlp/transformers/deepseek_v2/configuration.py
index 90aa9481c704..221e732b3f47 100644
--- a/paddlenlp/transformers/deepseek_v2/configuration.py
+++ b/paddlenlp/transformers/deepseek_v2/configuration.py
@@ -42,6 +42,8 @@ class DeepseekV2Config(PretrainedConfig):
             Dimension of the MoE representations.
         num_hidden_layers (`int`, *optional*, defaults to 32):
             Number of hidden layers in the Transformer decoder.
+        num_nextn_predict_layers (`int`, *optional*, defaults to 1):
+            Number of nextn predict layers in the DeepSeekV3 Model.
         num_attention_heads (`int`, *optional*, defaults to 32):
             Number of attention heads for each attention layer in the Transformer decoder.
         n_shared_experts (`int`, *optional*, defaults to None):
@@ -114,6 +116,8 @@ class DeepseekV2Config(PretrainedConfig):
             Whether to use a bias in the query, key, value and output projection layers during self-attention.
         attention_dropout (`float`, *optional*, defaults to 0.0):
             The dropout ratio for the attention probabilities.
+        speculate_model_type (`str`, defaults to `None`, *optional*, defaults to `False`):
+            The model type for speculate. Support ['eagle', 'mtp'] Now.
 
     ```python
     >>> from paddlenlp.transformers import DeepseekV2Model, DeepseekV2Config
@@ -135,6 +139,7 @@ def __init__(
         intermediate_size=11008,
         moe_intermediate_size=1407,
         num_hidden_layers=30,
+        num_nextn_predict_layers=1,
         num_attention_heads=32,
         num_key_value_heads=32,
         n_shared_experts=None,
@@ -171,6 +176,7 @@ def __init__(
         rope_scaling=None,
         attention_bias=False,
         attention_dropout=0.0,
+        speculate_model_type=False,
         **kwargs,
     ):
         self.vocab_size = vocab_size
@@ -180,6 +186,7 @@ def __init__(
         self.intermediate_size = intermediate_size
         self.moe_intermediate_size = moe_intermediate_size
         self.num_hidden_layers = num_hidden_layers
+        self.num_nextn_predict_layers = num_nextn_predict_layers
         self.num_attention_heads = num_attention_heads
         self.n_shared_experts = n_shared_experts
         self.n_routed_experts = n_routed_experts
@@ -214,6 +221,7 @@ def __init__(
         self.rope_scaling = rope_scaling
         self.attention_bias = attention_bias
         self.attention_dropout = attention_dropout
+        self.speculate_model_type = speculate_model_type
 
         super().__init__(
             pad_token_id=pad_token_id,
diff --git a/paddlenlp/transformers/deepseek_v2/mfu_utils.py b/paddlenlp/transformers/deepseek_v2/mfu_utils.py
new file mode 100644
index 000000000000..3574ec16e5f8
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v2/mfu_utils.py
@@ -0,0 +1,206 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# https://github.com/GHGmc2/deepseek-projection/blob/af62687fba22e3362469a343d048a1235047388c/projection/deepseek_proj.py#L1
+
+
+class DeepSeekProjection:
+    def __init__(self, model_config, train_options=None):
+        self._model_config = model_config
+        self._train_options = train_options
+
+        # for internal usage
+        (
+            self._vocab_size,
+            self._max_seq_len,
+            self._dim,
+            self._intermediate_size,
+            self._moe_intermediate_size,
+            self._n_layers,
+            self._n_dense_layers,
+            self._n_heads,
+            self._qk_nope_head_dim,
+            self._q_lora_rank,
+            self._kv_lora_rank,
+            self._qk_rope_head_dim,
+            self._n_experts_shared,
+            self._n_experts_routed,
+            self._router_top_k,
+        ) = (
+            model_config.vocab_size,
+            model_config.seq_length,
+            model_config.hidden_size,
+            model_config.intermediate_size,
+            model_config.moe_intermediate_size,
+            model_config.num_hidden_layers,  #
+            model_config.first_k_dense_replace,  #
+            model_config.num_attention_heads,  #
+            model_config.qk_nope_head_dim,  #
+            model_config.q_lora_rank,  #
+            model_config.kv_lora_rank,  #
+            model_config.qk_rope_head_dim,  #
+            model_config.n_shared_experts,  #
+            model_config.n_routed_experts,  #
+            model_config.num_experts_per_tok,
+        )
+
+        if train_options is not None:
+            self._causal_mask = train_options.causal_mask
+            self._fused_atten = train_options.fused_atten
+            # self._bytes_of_dtype = train_options.use_dtype.bytes_of_dtype()
+        else:
+            self._causal_mask = True
+            self._fused_atten = True
+
+    def get_num_params(self, include_embedding: bool = True) -> tuple[int, int]:
+        num_params_embedding = 0
+        if include_embedding:
+            num_params_embedding = (
+                self._vocab_size
+                * self._dim  # Word Token Embedding(WTE)
+                # + self._max_seq_len * self._dim  # Word Position Embedding (WPE)
+            )
+
+        # MLA projection for Q, K and V
+        if self._q_lora_rank is None:
+            num_params_proj_q = self._dim * self._n_heads * (self._qk_nope_head_dim + self._qk_rope_head_dim)
+        else:
+            num_params_down_q = self._dim * self._q_lora_rank
+            num_params_up_q = self._q_lora_rank * self._n_heads * self._qk_nope_head_dim
+            num_params_rope_q = self._q_lora_rank * self._n_heads * self._qk_rope_head_dim
+            num_params_proj_q = num_params_down_q + num_params_up_q + num_params_rope_q
+        num_params_down_kv = self._dim * self._kv_lora_rank
+        num_params_up_k = self._kv_lora_rank * self._n_heads * self._qk_nope_head_dim
+        num_params_rope_k = self._dim * self._qk_rope_head_dim
+        num_params_up_v = self._kv_lora_rank * self._n_heads * self._qk_nope_head_dim
+        # out proj
+        num_params_o = self._n_heads * self._qk_nope_head_dim * self._dim  # v_head_dim = qk_nope_head_dim
+        num_params_atten = (
+            num_params_proj_q
+            + num_params_down_kv
+            + num_params_up_k
+            + num_params_rope_k
+            + num_params_up_v
+            + num_params_o
+        )
+
+        num_params_ffn = self._dim * self._moe_intermediate_size * 3
+        num_params_ffn_dense = self._dim * self._intermediate_size * 3
+        # MoE, the sparse param count
+        num_params_gate = 0
+        n_experts = self._n_experts_routed + self._n_experts_shared
+        num_params_ffn_activated = num_params_ffn
+        if n_experts > 1:
+            num_params_gate = self._dim * self._n_experts_routed
+            num_params_ffn *= n_experts
+            num_params_ffn_activated *= self._n_experts_shared + self._router_top_k
+
+        num_params_norm = 2 * self._dim
+        # additional RMSNorm after the compressed latent vectors
+        num_params_norm += self._kv_lora_rank + 0 if self._q_lora_rank is None else self._q_lora_rank
+
+        num_params_final_norm = self._dim
+
+        num_params = (
+            num_params_embedding
+            + self._n_dense_layers * (num_params_atten + num_params_norm + num_params_ffn_dense)
+            + (self._n_layers - self._n_dense_layers)
+            * (num_params_atten + num_params_norm + num_params_ffn + num_params_gate)
+            + num_params_final_norm
+        )
+
+        num_params_activated = (
+            num_params_embedding
+            + self._n_dense_layers * (num_params_atten + num_params_norm + num_params_ffn_dense)
+            + (self._n_layers - self._n_dense_layers)
+            * (num_params_atten + num_params_norm + num_params_ffn_activated + num_params_gate)
+            + num_params_final_norm
+        )
+        return num_params, num_params_activated
+
+    def get_num_flop_fwd(self, batch_size: int) -> int:
+        # MLA projection of Q, K and V
+        if self._q_lora_rank is None:
+            num_flop_proj_q = (
+                2
+                * batch_size
+                * self._max_seq_len
+                * self._dim
+                * self._n_heads
+                * (self._qk_nope_head_dim + self._qk_rope_head_dim)
+            )
+        else:
+            num_flop_down_q = 2 * batch_size * self._max_seq_len * self._dim * self._q_lora_rank
+            num_flop_up_q = (
+                2 * batch_size * self._max_seq_len * self._q_lora_rank * self._qk_nope_head_dim * self._n_heads
+            )
+            num_flop_rope_q = (
+                2 * batch_size * self._max_seq_len * self._q_lora_rank * self._qk_rope_head_dim * self._n_heads
+            )
+            num_flop_proj_q = num_flop_down_q + num_flop_up_q + num_flop_rope_q
+        num_flop_down_k = 2 * batch_size * self._max_seq_len * self._dim * self._kv_lora_rank
+        num_flop_up_k = (
+            2 * batch_size * self._max_seq_len * self._kv_lora_rank * self._qk_nope_head_dim * self._n_heads
+        )
+        num_flop_rope_k = 2 * batch_size * self._max_seq_len * self._dim * self._qk_rope_head_dim
+        num_flop_proj_k = num_flop_down_k + num_flop_up_k + num_flop_rope_k
+        num_flop_proj_v = 2 * batch_size * self._max_seq_len * self._qk_nope_head_dim * self._n_heads * self._dim
+        num_flop_qkv_proj = num_flop_proj_q + num_flop_proj_k + num_flop_proj_v
+
+        # see the discussion: https://github.com/pytorch/torchtitan/pull/280
+        num_flop_sdpa = 4 * batch_size * self._max_seq_len**2 * self._dim
+        num_flop_sdpa //= 2 if self._causal_mask else 1
+        num_flop_out_proj = 2 * batch_size * self._max_seq_len * self._dim**2
+        num_flop_fwd_atten = num_flop_qkv_proj + num_flop_sdpa + num_flop_out_proj
+
+        num_flop_fwd_ffn = (2 * batch_size * self._max_seq_len * self._dim * self._moe_intermediate_size) * 3
+        num_flop_fwd_ffn_dense = (2 * batch_size * self._max_seq_len * self._dim * self._intermediate_size) * 3
+        # MoE, the active param
+        n_experts = self._n_experts_shared + self._n_experts_routed
+        if n_experts > 1:
+            num_flop_fwd_ffn *= self._n_experts_shared + self._router_top_k  # num of activated experts
+            num_flop_gate = 2 * batch_size * self._max_seq_len * self._dim * self._n_experts_routed
+            num_flop_fwd_ffn += num_flop_gate
+
+        num_flop_fwd_logits = 2 * batch_size * self._max_seq_len * self._dim * self._vocab_size
+
+        return (
+            self._n_dense_layers * (num_flop_fwd_atten + num_flop_fwd_ffn_dense)
+            + (self._n_layers - self._n_dense_layers) * (num_flop_fwd_atten + num_flop_fwd_ffn)
+            + num_flop_fwd_logits
+        )
+
+    def get_num_flop_per_token(self):
+        batch_size = 1  # dummy
+        num_flop_per_token = self.get_num_flop_fwd(batch_size) / batch_size / self._max_seq_len * 3  # bwd = 2 * fwd
+        print("num_flop_per_token:\t", num_flop_per_token)
+        return num_flop_per_token
+
+    def _get_num_flop_QK_fwd(self, batch_size: int) -> int:
+        """
+        Forward FLOPs for QK^T of all chunked transformer blocks, which is re-computed on backward by Flash attention
+        """
+        num_flop_qk = self._n_layers * (2 * batch_size * self._max_seq_len**2 * self._dim)
+        num_flop_qk //= 2 if self._causal_mask else 1
+        return num_flop_qk
+
+    def get_num_flop_bwd(self, batch_size: int) -> int:
+        num_flop_fwd = self.get_num_flop_fwd(batch_size)
+        num_flop_bwd = num_flop_fwd * 2
+        # Flash-attention uses re-computation for QK^T
+        if self._fused_atten:
+            qk_fwd_flop = self._get_num_flop_QK_fwd(batch_size)
+            num_flop_bwd += qk_fwd_flop
+
+        return num_flop_bwd
diff --git a/paddlenlp/transformers/deepseek_v2/modeling.py b/paddlenlp/transformers/deepseek_v2/modeling.py
index 5c5c0b43b95a..ee58e1b638d9 100644
--- a/paddlenlp/transformers/deepseek_v2/modeling.py
+++ b/paddlenlp/transformers/deepseek_v2/modeling.py
@@ -17,7 +17,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Paddle DeepSeek model."""
+"""Paddle DeepSeek model."""
+
 from __future__ import annotations
 
 import math
@@ -60,14 +61,28 @@
 from ..activations import ACT2FN
 from ..conversion_utils import StateDictNameMapping, init_name_mappings
 from ..linear_utils import Linear
+from ..llama import fusion_ops
+from ..llama.modeling import get_use_casual_mask
 from ..model_outputs import (
     BaseModelOutputWithPast,
     CausalLMOutputWithPast,
     SequenceClassifierOutputWithPast,
 )
 from ..model_utils import PretrainedModel, register_base_model
+from ..moe_gate import PretrainedMoEGate
+from ..moe_layer import MoELayer
+from ..utils import device_guard
 from .configuration import DeepseekV2Config
 
+__all__ = [
+    "DeepseekV2LMHead",
+    "DeepseekV2PretrainingCriterion",
+    "DeepseekV2ForCausalLM",
+    "DeepseekV2ForSequenceClassification",
+    "DeepseekV2Model",
+    "DeepseekV2PretrainedModel",
+]
+
 
 def get_triangle_upper_mask(x, mask=None):
     if mask is not None:
@@ -148,34 +163,49 @@ def scaled_dot_product_attention(
     value_states,
     attention_mask,
     output_attentions,
+    attn_mask_startend_row_indices=None,
     softmax_scale=1.0,
     training=True,
     sequence_parallel=False,
 ):
     bsz, q_len, num_heads, head_dim = query_states.shape
-    _, kv_seq_len, _, v_head_dim = value_states.shape
+    _, kv_seq_len, v_num_heads, v_head_dim = value_states.shape
 
     if config.use_flash_attention and flash_attention:
         # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
         # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
 
-        attn_output = F.scaled_dot_product_attention(
+        # Note: Flash Attention does not support softmax_scale, so we need to scale the query_states
+        q_head_dim = query_states.shape[-1]
+        softmax_scale = softmax_scale * (q_head_dim**0.5)
+        query_states = query_states * softmax_scale
+        value_padding = paddle.zeros(
+            [bsz, kv_seq_len, v_num_heads, head_dim - v_head_dim],
+            dtype=value_states.dtype,
+        )
+        value_states = paddle.concat([value_states, value_padding], axis=-1)
+
+        outputs = fusion_ops.fusion_flash_attention(
             query_states,
+            config,
             key_states,
             value_states,
-            attn_mask=attention_mask,
-            is_causal=attention_mask is None,
-            dropout_p=config.attention_dropout if training else 0.0,
-            training=training,
+            attention_mask,
+            output_attentions,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            sequence_parallel=sequence_parallel,
         )
-        attn_output *= (head_dim ** (0.5)) * softmax_scale
-        attn_weights = None
 
-        if sequence_parallel:
-            attn_output = attn_output.reshape([bsz * q_len, v_head_dim * num_heads])
+        if isinstance(outputs, tuple):
+            outputs[0] = outputs[0].reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs[0] = outputs[0][..., :v_head_dim]
+            outputs[0] = outputs[0].reshape([bsz, q_len, -1])
         else:
-            attn_output = attn_output.reshape([bsz, q_len, v_head_dim * num_heads])
-        return (attn_output, attn_weights) if output_attentions else attn_output
+            outputs = outputs.reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs = outputs[..., :v_head_dim]
+            outputs = outputs.reshape([bsz, q_len, -1])
+        return outputs
+
     else:
         #  [ bz, seqlen, nhead, head_dim] -> [bs, nhead, seq_len, head_dim]
         query_states = paddle.transpose(query_states, [0, 2, 1, 3])
@@ -221,7 +251,7 @@ def scaled_dot_product_attention(
 
 def masked_fill(x, mask, value):
     y = paddle.full(x.shape, value, x.dtype)
-    return paddle.where(mask, y, x)
+    return paddle.where(mask.to("bool"), y, x)
 
 
 def is_casual_mask(attention_mask):
@@ -293,6 +323,18 @@ def __init__(self, config: DeepseekV2Config, hidden_size=None, eps=1e-6, use_seq
             mark_as_sequence_parallel_parameter(self.weight)
 
     def forward(self, hidden_states):
+        if self.config.use_fused_rms_norm and get_env_device() == "xpu":
+            if self.weight.dtype != hidden_states.dtype:
+                hidden_states = paddle.cast(hidden_states, self.weight.dtype)
+            try:
+                import paddle_xpu_nn  # noqa: F821
+
+                return paddle_xpu_nn.xpu_rms_norm(hidden_states, self.weight, self.variance_epsilon)[0]
+            except ImportError:
+                raise NotImplementedError(
+                    f"Implementation of fused_rms_norm is not available on {get_env_device()}. Please install paddle_xpu to use this feature"
+                )
+
         if paddle.in_dynamic_mode():
             with paddle.amp.auto_cast(False):
                 hidden_states = hidden_states.astype("float32")
@@ -316,8 +358,11 @@ def __init__(self, dim, max_position_embeddings=2048, base=10000):
         self.max_position_embeddings = max_position_embeddings
         self.base = base
         # [dim / 2]
-        self.inv_freq = 1.0 / (self.base ** (paddle.cast(paddle.arange(0, self.dim, 2), dtype="float32") / self.dim))
-        self._set_cos_sin_cache(seq_len=max_position_embeddings)
+        with device_guard("cpu"):
+            self.inv_freq = 1.0 / (
+                self.base ** (paddle.cast(paddle.arange(0, self.dim, 2), dtype="float32") / self.dim)
+            )
+            self._set_cos_sin_cache(seq_len=max_position_embeddings)
 
         self.max_seq_len_cached = None
 
@@ -517,7 +562,7 @@ def rotate_half(x):
     return paddle.concat([-x2, x1], axis=-1)  # shape is the same as x
 
 
-def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids, fuse_rope=False):
     """Applies Rotary Position Embedding to the query and key tensors.
 
     Args:
@@ -538,6 +583,24 @@ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
     Returns:
         `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
     """
+    b, s, h, d = q.shape
+    q = q.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
+
+    b, s, h, d = k.shape
+    k = k.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
+
+    if get_env_device() == "xpu" and fuse_rope:
+        q_embed, k_embed, _ = fused_rotary_position_embedding(
+            q,
+            k,
+            None,
+            sin=sin,
+            cos=cos,
+            position_ids=position_ids,
+            use_neox_rotary_style=False,
+        )
+        return q_embed, k_embed
+
     if position_ids is None:
         # Note: Only for MixtralForCausalLMPipe model pretraining
         cos = cos[:, : q.shape[1], :, :]  # [bs, seq_len, 1, axis]
@@ -548,19 +611,13 @@ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
         cos = cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, axis]
         sin = sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, axis]
 
-    b, s, h, d = q.shape
-    q = q.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
-
-    b, s, h, d = k.shape
-    k = k.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])
-
     q_embed = (q * cos) + (rotate_half(q) * sin)
     k_embed = (k * cos) + (rotate_half(k) * sin)
     return q_embed, k_embed
 
 
 class DeepseekV2MLP(nn.Layer):
-    def __init__(self, config: DeepseekV2Config, hidden_size=None, intermediate_size=None):
+    def __init__(self, config: DeepseekV2Config, hidden_size=None, intermediate_size=None, is_moe=False):
         super().__init__()
         self.config = config
         self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
@@ -573,7 +630,7 @@ def __init__(self, config: DeepseekV2Config, hidden_size=None, intermediate_size
             ColumnParallelLinear = linear_utils.ColumnParallelLinear
             RowParallelLinear = linear_utils.RowParallelLinear
 
-        if config.tensor_parallel_degree > 1:
+        if config.tensor_parallel_degree > 1 and not is_moe:
             self.gate_proj = ColumnParallelLinear(
                 self.hidden_size,
                 self.intermediate_size,
@@ -604,93 +661,44 @@ def forward(self, x):
         return down_proj
 
 
-class MoEGate(nn.Layer):
-    def __init__(self, config: DeepseekV2Config):
-        super().__init__()
-        self.config = config
-        self.top_k = config.num_experts_per_tok
-        self.n_routed_experts = config.n_routed_experts
-        self.routed_scaling_factor = config.routed_scaling_factor
+class MoEGate(PretrainedMoEGate):
+    def __init__(self, config, num_experts, expert_hidden_size, **kwargs):
+        super().__init__(config, num_experts, expert_hidden_size, **kwargs)
+        # [hidden_size, n_expert]
+
         self.scoring_func = config.scoring_func
-        self.alpha = config.aux_loss_alpha
-        self.seq_aux = config.seq_aux
         self.topk_method = config.topk_method
-        self.n_group = config.n_group
-        self.topk_group = config.topk_group
 
-        # topk selection algorithm
-        self.norm_topk_prob = config.norm_topk_prob
-        self.gating_dim = config.hidden_size
         self.weight = paddle.create_parameter(
-            shape=[self.gating_dim, self.n_routed_experts],
+            shape=[expert_hidden_size, num_experts],
             dtype=paddle.get_default_dtype(),
+            is_bias=False,
             default_initializer=nn.initializer.Constant(1.0),
         )
 
+        if config.topk_method == "noaux_tc":
+            self.e_score_correction_bias = paddle.create_parameter(
+                shape=[num_experts],
+                dtype=paddle.get_default_dtype(),
+                default_initializer=nn.initializer.Constant(0.0),
+            )
+
     def forward(self, hidden_states):
-        bsz, seq_len, h = hidden_states.shape
+        """
+        Args:
+            hidden_states (_type_): [batch_size * seq_len, hidden_size]
+        """
+        _, h_dim = hidden_states.shape
+
         # compute gating score
-        hidden_states = hidden_states.reshape([-1, h])
-        with paddle.amp.auto_cast(False):
-            logits = F.linear(
-                paddle.cast(hidden_states, paddle.float32), paddle.cast(self.weight, paddle.float32), None
-            )
+        logits = F.linear(hidden_states, self.weight, None)
 
-        if self.scoring_func == "softmax":
+        with paddle.amp.auto_cast(False):
+            scores = self.gate_score_func(logits=logits)
+            scores = scores.cast(paddle.get_default_dtype())
 
-            with paddle.amp.auto_cast(False):
-                scores = F.softmax(logits.astype("float32"), axis=-1)
-        else:
-            raise NotImplementedError(f"insupportable scoring function for MoE gating: {self.scoring_func}")
-
-        # select top-k experts
-        if self.topk_method == "greedy":
-            topk_weight, topk_idx = paddle.topk(scores, k=self.top_k, axis=-1, sorted=False)
-        elif self.topk_method == "group_limited_greedy":
-            group_scores = scores.reshape([bsz * seq_len, self.n_group, -1]).max(axis=-1).values  # [n, n_group]
-            group_idx = paddle.topk(group_scores, k=self.topk_group, axis=-1, sorted=False)[1]  # [n, top_k_group]
-            group_mask = paddle.zeros_like(group_scores)  # [n, n_group]
-            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
-            score_mask = (
-                group_mask.unsqueeze(-1)
-                .expand(bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group)
-                .reshape(bsz * seq_len, -1)
-            )  # [n, e]
-            tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
-            topk_weight, topk_idx = paddle.topk(tmp_scores, k=self.top_k, axis=-1, sorted=False)
-
-        # norm gate to sum 1
-        if self.top_k > 1 and self.norm_topk_prob:
-            denominator = topk_weight.sum(axis=-1, keepdim=True) + 1e-20
-            topk_weight = topk_weight / denominator
-        else:
-            topk_weight = topk_weight * self.routed_scaling_factor
-        # expert-level computation auxiliary loss
-        if self.training and self.alpha > 0.0:
-            scores_for_aux = scores
-            aux_topk = self.top_k
-            # always compute aux loss based on the naive greedy topk method
-            topk_idx_for_aux_loss = topk_idx.reshape([bsz, -1])  # [bsz, top_k*seq_len]
-            if self.seq_aux:
-                scores_for_seq_aux = scores_for_aux.reshape([bsz, seq_len, -1])
-                ce = paddle.zeros([bsz, self.n_routed_experts])
-                ce.put_along_axis_(
-                    axis=1,
-                    indices=topk_idx_for_aux_loss,
-                    values=paddle.ones([bsz, seq_len * aux_topk]),
-                    reduce="add",
-                )
-                ce /= seq_len * aux_topk / self.n_routed_experts
-                aux_loss = (ce * scores_for_seq_aux.mean(axis=1)).sum(axis=1).mean() * self.alpha
-            else:
-                mask_ce = F.one_hot(topk_idx_for_aux_loss.reshape([-1]), num_classes=self.n_routed_experts)
-                ce = mask_ce.float().mean(0)
-                Pi = scores_for_aux.mean(0)
-                fi = ce * self.n_routed_experts
-                aux_loss = (Pi * fi).sum() * self.alpha
-        else:
-            aux_loss = None
-        return topk_idx, topk_weight, aux_loss
+        capacity, combine_weights, dispatch_mask, exp_counts, l_aux, l_zloss = self.topkgating(scores)
+        return capacity, combine_weights, dispatch_mask, exp_counts, l_aux, l_zloss
 
 
 class AddAuxiliaryLoss(paddle.autograd.PyLayer):
@@ -714,49 +722,47 @@ def backward(ctx, grad_output):
         return grad_output, grad_loss
 
 
-class DeepseekV2MoE(nn.Layer):
+class DeepseekV2MoE(MoELayer):
     """
     A mixed expert module containing shared experts.
     """
 
-    def __init__(self, config):
-        super().__init__()
-        self.config = config
-        self.num_experts_per_tok = config.num_experts_per_tok
+    def __init__(self, config: DeepseekV2Config):
+        gate = MoEGate(
+            config=config,
+            num_experts=config.n_routed_experts,
+            expert_hidden_size=config.hidden_size,
+            top_k=config.num_experts_per_tok,
+            topk_method=config.topk_method,
+            n_group=config.n_group,
+            topk_group=config.topk_group,
+            norm_topk_prob=config.norm_topk_prob,
+            routed_scaling_factor=config.routed_scaling_factor,
+            drop_tokens=False,
+        )
 
-        self.ep_size = 1
-        self.experts_per_rank = config.n_routed_experts
-        self.ep_rank = 0
-        self.experts = nn.LayerList(
-            [
-                DeepseekV2MLP(config, intermediate_size=config.moe_intermediate_size)
-                for i in range(config.n_routed_experts)
-            ]
+        super().__init__(
+            config=config,
+            moe_num_experts=config.n_routed_experts,
+            expert_class=DeepseekV2MLP,
+            expert_kwargs={"config": config, "intermediate_size": config.moe_intermediate_size, "is_moe": True},
+            gate=gate,
+            capacity=2.0,
         )
-        self.gate = MoEGate(config)
+        self.alpha = config.aux_loss_alpha
         if config.n_shared_experts is not None:
             intermediate_size = config.moe_intermediate_size * config.n_shared_experts
-            self.shared_experts = DeepseekV2MLP(config=config, intermediate_size=intermediate_size)
+            self.shared_experts = DeepseekV2MLP(config=config, intermediate_size=intermediate_size, is_moe=False)
 
     def forward(self, hidden_states):
-        identity = hidden_states
-        orig_shape = hidden_states.shape
-        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
-        hidden_states = hidden_states.reshape([-1, hidden_states.shape[-1]])
-        flat_topk_idx = topk_idx.reshape([-1])
-        # remove the infer method
-        hidden_states = hidden_states.repeat_interleave(self.num_experts_per_tok, axis=0)
-        y = paddle.empty_like(hidden_states)
-        for i, expert in enumerate(self.experts):
-            if paddle.any(flat_topk_idx == i):
-                y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
-        y = (y.reshape([*topk_weight.shape, -1]) * topk_weight.unsqueeze(-1)).sum(axis=1)
-        y = paddle.cast(y, hidden_states.dtype).reshape([*orig_shape])
-        if self.training and self.gate.alpha > 0.0:
-            y = AddAuxiliaryLoss.apply(y, aux_loss)
+        final_hidden_states, l_aux, l_zloss = super().forward(hidden_states)
+        if self.training and self.alpha > 0.0:
+            final_hidden_states = AddAuxiliaryLoss.apply(final_hidden_states, l_aux)
+
         if self.config.n_shared_experts is not None:
-            y = y + self.shared_experts(identity)
-        return y
+            shared_expert_output = self.shared_experts(hidden_states)
+            final_hidden_states = final_hidden_states + shared_expert_output
+        return final_hidden_states
 
 
 def repeat_kv(hidden_states: paddle.Tensor, n_rep: int) -> paddle.Tensor:
@@ -794,6 +800,7 @@ def __init__(self, config: DeepseekV2Config, layerwise_recompute: bool = False):
         self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
 
         self.is_causal = True
+        self.fuse_rope = config.use_fused_rope
 
         self.seq_length = config.seq_length
         self.sequence_parallel = config.sequence_parallel
@@ -832,6 +839,9 @@ def __init__(self, config: DeepseekV2Config, layerwise_recompute: bool = False):
 
             self.o_proj = RowParallelLinear(self.num_heads * self.v_head_dim, self.hidden_size, has_bias=config.attention_bias, input_is_parallel=True)
 
+            assert self.num_heads % config.tensor_parallel_degree == 0, f"num_heads: {self.num_heads}, tensor_parallel_degree: {config.tensor_parallel_degree}"
+            self.num_heads = self.num_heads // config.tensor_parallel_degree
+
         else:
             # for without tensor parallel
             if self.q_lora_rank is None:
@@ -912,11 +922,12 @@ def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
     def forward(
         self,
         hidden_states: paddle.Tensor,
-        attention_mask: Optional[paddle.Tensor] = None,
-        position_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
         past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: bool = False,
         use_cache: bool = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
         if "padding_mask" in kwargs:
@@ -948,17 +959,11 @@ def forward(
         k_nope, value_states = paddle.split(kv, [self.qk_nope_head_dim, self.v_head_dim], axis=-1)
         kv_seq_len = value_states.shape[1]
         if past_key_value is not None:
-            if self.layer_idx is None:
-                raise ValueError(
-                    f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
-                    "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
-                    "with a layer index."
-                )
             kv_seq_len += past_key_value[0].shape[-3]
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
         cos = cos[None, :, None, :]
         sin = sin[None, :, None, :]
-        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
+        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids, self.fuse_rope)
 
         query_states = paddle.empty([bsz, q_len, self.num_heads, self.q_head_dim], dtype=self.config.dtype)
         query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
@@ -990,6 +995,7 @@ def forward(
                 value_states,
                 attention_mask,
                 output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 softmax_scale=self.softmax_scale,
                 training=self.training,
                 sequence_parallel=self.sequence_parallel,
@@ -1003,6 +1009,7 @@ def forward(
                 value_states,
                 attention_mask,
                 output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 softmax_scale=self.softmax_scale,
                 training=self.training,
                 sequence_parallel=self.sequence_parallel,
@@ -1025,6 +1032,7 @@ def forward(
 class DeepseekV2DecoderLayer(nn.Layer):
     def __init__(self, config: DeepseekV2Config, layer_idx: int, layerwise_recompute: bool = False):
         super().__init__()
+        self.config = config
 
         self.enable_recompute = False
         self.layerwise_recompute = layerwise_recompute
@@ -1049,11 +1057,12 @@ def __init__(self, config: DeepseekV2Config, layer_idx: int, layerwise_recompute
     def forward(
         self,
         hidden_states: paddle.Tensor,
-        attention_mask: Optional[paddle.Tensor] = None,
         position_ids: Optional[paddle.Tensor] = None,
-        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: Optional[bool] = False,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
         use_cache: Optional[bool] = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
         """
@@ -1086,24 +1095,26 @@ def forward(
             and has_gradient
             and self.recompute_granularity == "full_attn"
         ):
-            recompute()
-            hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states, self_attn_weights, present_key_value = recompute(
+                self.self_attn,
                 hidden_states=hidden_states,
-                attention_mask=attention_mask,
                 position_ids=position_ids,
-                past_key_value=past_key_value,
+                attention_mask=attention_mask,
                 output_attentions=output_attentions,
+                past_key_value=past_key_value,
                 use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 **kwargs,
             )
         else:
             hidden_states, self_attn_weights, present_key_value = self.self_attn(
                 hidden_states=hidden_states,
-                attention_mask=attention_mask,
                 position_ids=position_ids,
-                past_key_value=past_key_value,
+                attention_mask=attention_mask,
                 output_attentions=output_attentions,
+                past_key_value=past_key_value,
                 use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 **kwargs,
             )
         hidden_states = residual + hidden_states
@@ -1122,6 +1133,9 @@ def forward(
         if use_cache:
             outputs += (present_key_value,)
 
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
         return outputs
 
 
@@ -1130,6 +1144,22 @@ class DeepseekV2PretrainedModel(PretrainedModel):
     base_model_prefix = "deepseek_v2"
     _no_split_modules = ["DeepseekV2DecoderLayer"]
 
+    def _get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
+        from .mfu_utils import DeepSeekProjection
+
+        # self._
+        mfu_cal_proj = DeepSeekProjection(self.config)
+        if seq_length is None:
+            if hasattr(self.config, "seq_length"):
+                seq_length = self.config.seq_length
+            else:
+                seq_length = 2048
+
+        return mfu_cal_proj.get_num_flop_per_token()
+
+    def _get_hardware_flops(self, *args, **kwargs):
+        return self._get_model_flops(*args, **kwargs)
+
     @classmethod
     def _get_name_mappings(cls, config: DeepseekV2Config) -> list[StateDictNameMapping]:
         mappings: list[StateDictNameMapping] = []
@@ -1137,7 +1167,8 @@ def _get_name_mappings(cls, config: DeepseekV2Config) -> list[StateDictNameMappi
             ["embed_tokens.weight"],
             ["norm.weight"],
         ]
-        for layer_index in range(config.num_hidden_layers):
+        # last one layer contains MTP (eagle) parameters for inference
+        for layer_index in range(config.num_hidden_layers + config.num_nextn_predict_layers):
             layer_mappings = [
                 [f"layers.{layer_index}.self_attn.q_proj.weight", None, "transpose"],
                 [f"layers.{layer_index}.self_attn.q_a_proj.weight", None, "transpose"],
@@ -1155,8 +1186,9 @@ def _get_name_mappings(cls, config: DeepseekV2Config) -> list[StateDictNameMappi
             ]
             model_mappings.extend(layer_mappings)
 
-            # MoE paramerters
+            # MoE parameters
             model_mappings.append([f"layers.{layer_index}.mlp.gate.weight", None, "transpose"])
+            model_mappings.append([f"layers.{layer_index}.mlp.gate.e_score_correction_bias"])
             for expert_idx in range(config.n_routed_experts):
                 expert_mappings = [
                     [f"layers.{layer_index}.mlp.experts.{expert_idx}.gate_proj.weight", None, "transpose"],
@@ -1168,12 +1200,20 @@ def _get_name_mappings(cls, config: DeepseekV2Config) -> list[StateDictNameMappi
             model_mappings.append([f"layers.{layer_index}.mlp.shared_experts.up_proj.weight", None, "transpose"])
             model_mappings.append([f"layers.{layer_index}.mlp.shared_experts.down_proj.weight", None, "transpose"])
 
+            # MTP (eagle) parameters for inference
+            if layer_index >= config.num_hidden_layers:
+                model_mappings.append([f"layers.{layer_index}.embed_tokens.weight"])
+                model_mappings.append([f"layers.{layer_index}.enorm.weight"])
+                model_mappings.append([f"layers.{layer_index}.hnorm.weight"])
+                model_mappings.append([f"layers.{layer_index}.eh_proj.weight", None, "transpose"])
+                model_mappings.append([f"layers.{layer_index}.shared_head.norm.weight"])
+                model_mappings.append([f"layers.{layer_index}.shared_head.head.weight", None, "transpose"])
+
         init_name_mappings(mappings=model_mappings)
-        # base-model prefix "Qwen2MoEModel"
-        if "Qwen2Model" not in config.architectures:
+        if cls.base_model_class.__name__ not in config.architectures:
             for mapping in model_mappings:
                 mapping[0] = "model." + mapping[0]
-                mapping[1] = "deepseek_v2." + mapping[1]
+                mapping[1] = f"{cls.base_model_prefix}." + mapping[1]
             if not config.tie_word_embeddings:
                 model_mappings.append(["lm_head.weight", "lm_head.weight", "transpose"])
 
@@ -1211,23 +1251,45 @@ def get_tensor_parallel_split_mappings(num_layers):
             # Column Linear
             base_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
             base_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True)
+            base_actions["layers.0.self_attn.q_b_proj.weight"] = partial(fn, is_column=True)
+
             # if we have enough num_key_value_heads to split, then split it.
             if config.num_key_value_heads % config.tensor_parallel_degree == 0:
                 base_actions["layers.0.self_attn.k_proj.weight"] = partial(fn, is_column=True)
                 base_actions["layers.0.self_attn.v_proj.weight"] = partial(fn, is_column=True)
                 base_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True)
                 base_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.kv_b_proj.weight"] = partial(fn, is_column=True)
 
             base_actions["layers.0.mlp.up_proj.weight"] = partial(fn, is_column=True)
             base_actions["layers.0.mlp.gate_proj.weight"] = partial(fn, is_column=True)
             base_actions["layers.0.mlp.down_proj.weight"] = partial(fn, is_column=False)
 
+            base_actions["layers.0.mlp.shared_experts.gate_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.up_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.mlp.shared_experts.down_proj.weight"] = partial(fn, is_column=False)
+
             for key, action in base_actions.items():
                 if "layers.0." in key:
                     for i in range(num_layers):
                         final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
                 final_actions[key] = action
 
+            # for MTP (eagle) parameters for inference
+            base_actions.pop("embed_tokens.weight")
+            base_actions.pop("lm_head.weight")
+            base_actions["layers.0.embed_tokens.weight"] = partial(fn, is_column=False)
+            base_actions["layers.0.eh_proj.weight"] = partial(fn, is_column=True)
+            base_actions["layers.0.shared_head.head.weight"] = partial(fn, is_column=True)
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(
+                        config.num_hidden_layers, config.num_hidden_layers + config.num_nextn_predict_layers
+                    ):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                else:
+                    final_actions[key] = action
+
             return final_actions
 
         mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
@@ -1250,7 +1312,6 @@ def _init_weights(self, layer):
                 linear_utils.ColumnSequenceParallelLinear,
             ),
         ):
-
             # In the dygraph mode, use the `set_value` to reset the parameter directly,
             # and reset the `state_dict` to update parameter in static mode.
             if isinstance(layer.weight, paddle.Tensor):
@@ -1261,7 +1322,7 @@ def _init_weights(self, layer):
                                 mean=0.0,
                                 std=self.config.initializer_range
                                 if hasattr(self.config, "initializer_range")
-                                else self.deepseek_v2.config.initializer_range,
+                                else self.config.initializer_range,
                                 shape=layer.weight.shape,
                             )
                         )
@@ -1271,7 +1332,7 @@ def _init_weights(self, layer):
                             mean=0.0,
                             std=self.config.initializer_range
                             if hasattr(self.config, "initializer_range")
-                            else self.deepseek_v2.config.initializer_range,
+                            else self.config.initializer_range,
                             shape=layer.weight.shape,
                         )
                     )
@@ -1341,12 +1402,10 @@ def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values
                 # For decoding phase in generation, seq_length = 1, we don't need to add causal mask
                 if input_shape[-1] > 1:
                     combined_attention_mask = _make_causal_mask(
-                        input_shape, past_key_values_length=past_key_values_length
+                        input_shape,
+                        past_key_values_length=past_key_values_length,
                     )
-                    if get_env_device() == "npu":
-                        expanded_attn_mask = expanded_attn_mask.astype("bool") & combined_attention_mask.astype("bool")
-                    else:
-                        expanded_attn_mask = expanded_attn_mask & combined_attention_mask
+                    expanded_attn_mask = expanded_attn_mask & combined_attention_mask
             # [bsz, seq_len, seq_len] -> [bsz, 1, seq_len, seq_len]
             elif len(attention_mask.shape) == 3:
                 expanded_attn_mask = attention_mask.unsqueeze(1).astype("bool")
@@ -1354,20 +1413,19 @@ def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values
             else:
                 expanded_attn_mask = attention_mask
         else:
-            expanded_attn_mask = _make_causal_mask(input_shape, past_key_values_length=past_key_values_length)
+            expanded_attn_mask = _make_causal_mask(
+                input_shape,
+                past_key_values_length=past_key_values_length,
+            )
         # Convert bool attention_mask to float attention mask, which will be added to attention_scores later
-        if get_env_device() == "npu":
+        if get_env_device() == "xpu":
             x = paddle.to_tensor(0.0, dtype="float32")
-            y = paddle.to_tensor(paddle.finfo(dtype).min, dtype="float32")
-            expanded_attn_mask = expanded_attn_mask.astype("float32")
-            expanded_attn_mask = paddle.where(expanded_attn_mask, x, y).astype(dtype)
-        elif get_env_device() in ["xpu", "gcu"]:
-            x = paddle.to_tensor(0.0, dtype=dtype)
-            y = paddle.to_tensor(paddle.finfo(dtype).min, dtype=dtype)
-            expanded_attn_mask = expanded_attn_mask.astype(dtype)
-            expanded_attn_mask = paddle.where(expanded_attn_mask, x, y).astype(dtype)
+            y = paddle.to_tensor(-1.7005809656952787e38, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask, x, y)
         else:
-            expanded_attn_mask = paddle.where(expanded_attn_mask, 0.0, paddle.finfo(dtype).min).astype(dtype)
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), 0.0, paddle.finfo(dtype).min).astype(
+                dtype
+            )
         return expanded_attn_mask
 
     @paddle.jit.not_to_static
@@ -1375,11 +1433,12 @@ def recompute_training_full(
         self,
         layer_module: nn.Layer,
         hidden_states: Tensor,
-        attention_mask: Tensor,
         position_ids: Optional[Tensor],
-        past_key_value: Tensor,
+        attention_mask: Tensor,
         output_attentions: bool,
+        past_key_value: Tensor,
         use_cache: bool,
+        attn_mask_startend_row_indices: Optional[Tensor] = None,
     ):
         def create_custom_forward(module):
             def custom_forward(*inputs):
@@ -1390,11 +1449,12 @@ def custom_forward(*inputs):
         hidden_states = recompute(
             create_custom_forward(layer_module),
             hidden_states,
-            attention_mask,
             position_ids,
-            past_key_value,
+            attention_mask,
             output_attentions,
+            past_key_value,
             use_cache,
+            attn_mask_startend_row_indices,
             use_reentrant=self.config.recompute_use_reentrant,
         )
 
@@ -1403,14 +1463,16 @@ def custom_forward(*inputs):
     def forward(
         self,
         input_ids: paddle.Tensor = None,
-        attention_mask: Optional[paddle.Tensor] = None,
         position_ids: Optional[paddle.Tensor] = None,
-        past_key_values: Optional[List[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         inputs_embeds: Optional[paddle.Tensor] = None,
         use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices: Optional[Tensor] = None,
+        **kwargs,
     ) -> Union[Tuple, BaseModelOutputWithPast]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
@@ -1459,17 +1521,20 @@ def forward(
             inputs_embeds = self.embed_tokens(input_ids)
 
         # embed positions
-        if attention_mask is None:
+        if attn_mask_startend_row_indices is not None or get_use_casual_mask():
+            attention_mask = None
+        else:
             # [bs, seq_len]
-            attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
-
-        # 4d mask is passed through the layers
-        attention_mask = self._prepare_decoder_attention_mask(
-            attention_mask,
-            (batch_size, seq_length),
-            past_key_values_length,
-            inputs_embeds.dtype,
-        )
+            attention_mask = (
+                paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+                if attention_mask is None
+                else attention_mask
+            )
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), past_key_values_length, inputs_embeds.dtype
+            )  # [bs, 1, seq_len, seq_len]
+            if self.config.use_flash_attention:
+                attention_mask = None if is_casual_mask(attention_mask) else attention_mask
 
         if self.config.sequence_parallel:
             # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
@@ -1501,21 +1566,23 @@ def forward(
             ):
                 layer_outputs = self.recompute_training_full(
                     decoder_layer,
-                    hidden_states,
-                    attention_mask,
-                    position_ids,
-                    past_key_value,
-                    output_attentions,
-                    use_cache,
+                    hidden_states=hidden_states,
+                    position_ids=position_ids,
+                    attention_mask=attention_mask,
+                    output_attentions=output_attentions,
+                    past_key_value=past_key_value,
+                    use_cache=use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
             else:
                 layer_outputs = decoder_layer(
-                    hidden_states,
-                    attention_mask=attention_mask,
+                    hidden_states=hidden_states,
                     position_ids=position_ids,
-                    past_key_value=past_key_value,
+                    attention_mask=attention_mask,
                     output_attentions=output_attentions,
+                    past_key_value=past_key_value,
                     use_cache=use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
 
             # NOTE: clear outdate cache after it has been used for memory saving
@@ -1549,14 +1616,14 @@ def forward(
         )
 
 
-class DeepSeekV2PretrainingCriterion(nn.Layer):
+class DeepseekV2PretrainingCriterion(nn.Layer):
     """
     Criterion for Mixtral.
     It calculates the final loss.
     """
 
     def __init__(self, config: DeepseekV2Config):
-        super(DeepSeekV2PretrainingCriterion, self).__init__()
+        super(DeepseekV2PretrainingCriterion, self).__init__()
         self.ignore_index = getattr(config, "ignore_index", -100)
         self.config = config
         self.enable_parallel_cross_entropy = config.tensor_parallel_degree > 1 and config.tensor_parallel_output
@@ -1578,15 +1645,23 @@ def forward(self, prediction_scores, masked_lm_labels):
             masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
 
             # skip ignore_index which loss == 0
-            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
-            loss = paddle.mean(masked_lm_loss)
+            # masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
+            # loss = paddle.mean(masked_lm_loss)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
 
         return loss
 
 
-class DeepSeekV2LMHead(nn.Layer):
+class DeepseekV2LMHead(nn.Layer):
     def __init__(self, config: DeepseekV2Config):
-        super().__init__()
+        super(DeepseekV2LMHead, self).__init__()
 
         self.config = config
         if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
@@ -1611,9 +1686,7 @@ def forward(self, hidden_states, tensor_parallel_output=None):
         if tensor_parallel_output is None:
             tensor_parallel_output = self.config.tensor_parallel_output
 
-        logits = parallel_matmul(
-            hidden_states, self.weight, transpose_y=False, tensor_parallel_output=tensor_parallel_output
-        )
+        logits = parallel_matmul(hidden_states, self.weight, tensor_parallel_output=tensor_parallel_output)
         return logits
 
 
@@ -1622,10 +1695,11 @@ class DeepseekV2ForCausalLM(DeepseekV2PretrainedModel):
 
     def __init__(self, config: DeepseekV2Config):
         super().__init__(config)
+        self.config = config
         self.deepseek_v2 = DeepseekV2Model(config)
         self.vocab_size = config.vocab_size
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias_attr=False)
-        self.criterion = DeepSeekV2PretrainingCriterion(config)
+        self.lm_head = DeepseekV2LMHead(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
 
     def get_input_embeddings(self):
         return self.deepseek_v2.embed_tokens
@@ -1648,15 +1722,16 @@ def get_decoder(self):
     def forward(
         self,
         input_ids: paddle.Tensor = None,
-        attention_mask: Optional[paddle.Tensor] = None,
         position_ids: Optional[paddle.Tensor] = None,
-        past_key_values: Optional[List[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
         inputs_embeds: Optional[paddle.Tensor] = None,
         labels: Optional[paddle.Tensor] = None,
         use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
         r"""
         Args:
@@ -1689,26 +1764,57 @@ def forward(
         )
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
         outputs = self.deepseek_v2(
             input_ids=input_ids,
-            attention_mask=attention_mask,
             position_ids=position_ids,
-            past_key_values=past_key_values,
+            attention_mask=attention_mask,
             inputs_embeds=inputs_embeds,
             use_cache=use_cache,
+            past_key_values=past_key_values,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
         )
 
         hidden_states = outputs[0]
-        logits = self.lm_head(hidden_states)
 
-        loss = None
-        # TODO@DrownFish19: shift labels
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            # if labels is None，means we need full output, instead of tensor_parallel_output
+            # tensor_parallel_output is together with ParallelCrossEntropy
+            tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
+
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
@@ -1723,49 +1829,13 @@ def forward(
         )
 
     def prepare_inputs_for_generation(
-        self,
-        input_ids,
-        past_key_values=None,
-        attention_mask=None,
-        inputs_embeds=None,
-        **kwargs,
+        self, input_ids, use_cache=False, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
     ):
-        if past_key_values is not None:
-            if isinstance(past_key_values, Tuple[paddle.Tensor]):
-                cache_length = past_key_values.get_seq_length()
-                past_length = past_key_values.seen_tokens
-                max_cache_length = past_key_values.get_max_length()
-            else:
-                cache_length = past_length = past_key_values[0][0].shape[2]
-                max_cache_length = None
-
-            # Keep only the unprocessed tokens:
-            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
-            # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
-            # input)
-            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
-                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
-            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
-            # input_ids based on the past_length.
-            elif past_length < input_ids.shape[1]:
-                input_ids = input_ids[:, past_length:]
-            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
-
-            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
-            if (
-                max_cache_length is not None
-                and attention_mask is not None
-                and cache_length + input_ids.shape[1] > max_cache_length
-            ):
-                attention_mask = attention_mask[:, -max_cache_length:]
-
-        position_ids = kwargs.get("position_ids", None)
-        if attention_mask is not None and position_ids is None:
-            # create position_ids on the fly for batch generation
-            position_ids = attention_mask.long().cumsum(-1) - 1
-            position_ids.masked_fill_(attention_mask == 0, 1)
-            if past_key_values:
-                position_ids = position_ids[:, -input_ids.shape[1] :]
+        batch_size, seq_length = input_ids.shape
+        position_ids = kwargs.get("position_ids", paddle.arange(seq_length).expand((batch_size, seq_length)))
+        if past_key_values:
+            input_ids = input_ids[:, -1].unsqueeze(axis=-1)
+            position_ids = position_ids[:, -1].unsqueeze(-1)
 
         # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
         if inputs_embeds is not None and past_key_values is None:
@@ -1777,12 +1847,49 @@ def prepare_inputs_for_generation(
             {
                 "position_ids": position_ids,
                 "past_key_values": past_key_values,
-                "use_cache": kwargs.get("use_cache"),
+                "use_cache": use_cache,
                 "attention_mask": attention_mask,
             }
         )
         return model_inputs
 
+    def _get_model_inputs_spec(self, dtype: str):
+        return {
+            "input_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "attention_mask": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "position_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        }
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, CausalLMOutputWithPast) and "past_key_values" in outputs:
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
+
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            # TODO: support attention mask for other models
+            attention_mask = model_kwargs["attention_mask"]
+            if len(attention_mask.shape) == 2:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )
+            elif len(attention_mask.shape) == 4:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([*attention_mask.shape[:3], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )[:, :, -1:, :]
+
+        return model_kwargs
+
     @staticmethod
     def _reorder_cache(past_key_values, beam_idx):
         reordered_past = ()
diff --git a/paddlenlp/transformers/deepseek_v2/modeling_auto.py b/paddlenlp/transformers/deepseek_v2/modeling_auto.py
new file mode 100644
index 000000000000..284b12a29cb8
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v2/modeling_auto.py
@@ -0,0 +1,994 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle DeepSeek_V2 model."""
+
+from __future__ import annotations
+
+import warnings
+from typing import List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import Linear
+
+try:
+    from paddle.incubate.nn.functional import fused_rotary_position_embedding
+except ImportError:
+    fused_rotary_position_embedding = None
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+
+import paddle.distributed as dist
+
+from ...utils.log import logger
+from ...utils.tools import get_env_device
+from ..activations import ACT2FN
+from ..llama import fusion_ops
+from ..llama.modeling import get_use_casual_mask
+from ..model_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from ..model_utils import PretrainedModel, register_base_model
+from ..moe_layer import MoELayer
+from .configuration import DeepseekV2Config
+from .modeling import (
+    AddAuxiliaryLoss,
+    DeepseekV2DynamicNTKScalingRotaryEmbedding,
+    DeepseekV2LinearScalingRotaryEmbedding,
+    DeepseekV2PretrainingCriterion,
+    DeepseekV2RMSNorm,
+    DeepseekV2RotaryEmbedding,
+    DeepseekV2YarnRotaryEmbedding,
+    MoEGate,
+    _expand_2d_mask,
+    _make_causal_mask,
+    apply_rotary_pos_emb,
+    get_triangle_upper_mask,
+    is_casual_mask,
+    yarn_get_mscale,
+)
+
+__all__ = [
+    "DeepseekV2LMHeadAuto",
+    "DeepseekV2ForCausalLMAuto",
+    "DeepseekV2ModelAuto",
+    "DeepseekV2PretrainedModelAuto",
+]
+
+
+def scaled_dot_product_attention(
+    query_states,
+    config,
+    key_states,
+    value_states,
+    attention_mask,
+    output_attentions,
+    attn_mask_startend_row_indices=None,
+    softmax_scale=1.0,
+    training=True,
+    sequence_parallel=False,
+):
+    bsz, q_len, num_heads, head_dim = query_states.shape
+    _, kv_seq_len, v_num_heads, v_head_dim = value_states.shape
+
+    if config.use_flash_attention and flash_attention:
+        # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
+        # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
+
+        # Note: Flash Attention does not support softmax_scale, so we need to scale the query_states
+        q_head_dim = query_states.shape[-1]
+        softmax_scale = softmax_scale * (q_head_dim**0.5)
+        query_states = query_states * softmax_scale
+        value_padding = paddle.zeros(
+            [bsz, kv_seq_len, v_num_heads, head_dim - v_head_dim],
+            dtype=value_states.dtype,
+        )
+        value_states = paddle.concat([value_states, value_padding], axis=-1)
+
+        outputs = fusion_ops.fusion_flash_attention(
+            query_states,
+            config,
+            key_states,
+            value_states,
+            attention_mask,
+            output_attentions,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            sequence_parallel=False,
+        )
+
+        if isinstance(outputs, tuple):
+            outputs[0] = outputs[0].reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs[0] = outputs[0][..., :v_head_dim]
+            outputs[0] = outputs[0].reshape([bsz, q_len, -1])
+        else:
+            outputs = outputs.reshape([bsz, q_len, v_num_heads, head_dim])
+            outputs = outputs[..., :v_head_dim]
+            outputs = outputs.reshape([bsz, q_len, -1])
+        return outputs
+
+    else:
+        #  [ bz, seqlen, nhead, head_dim] -> [bs, nhead, seq_len, head_dim]
+        query_states = paddle.transpose(query_states, [0, 2, 1, 3])
+        # merge with the next transpose
+        key_states = paddle.transpose(key_states, [0, 2, 1, 3])
+        value_states = paddle.transpose(value_states, [0, 2, 1, 3])
+
+        # matmul and divide by sqrt(head_dim)
+        attn_weights = paddle.matmul(query_states * softmax_scale, key_states.transpose([0, 1, 3, 2]))
+
+        if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention weights should be of shape {(bsz, num_heads, q_len, kv_seq_len)}, but is"
+                f" {attn_weights.shape}"
+            )
+
+        if attention_mask is None:
+            attention_mask = get_triangle_upper_mask(attn_weights)
+        attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len])
+        if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention mask should be of shape {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
+            )
+
+        attn_weights = attn_weights + attention_mask
+        if not paddle.in_dynamic_mode():
+            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+        else:
+            with paddle.amp.auto_cast(False):
+                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+
+        attn_weights = F.dropout(attn_weights, p=config.attention_dropout, training=training)
+
+        attn_output = paddle.matmul(attn_weights.astype("float32"), value_states.astype("float32"))
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+
+        if sequence_parallel:
+            attn_output = attn_output.reshape([bsz * q_len, v_head_dim * num_heads])
+        else:
+            attn_output = attn_output.reshape([bsz, q_len, v_head_dim * num_heads])
+        return (attn_output, attn_weights) if output_attentions else attn_output
+
+
+class DeepseekV2MLPAuto(nn.Layer):
+    def __init__(self, config: DeepseekV2Config, hidden_size=None, intermediate_size=None):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
+        self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
+
+        self.gate_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)
+        self.up_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)
+        self.down_proj = Linear(self.intermediate_size, self.hidden_size, bias_attr=False)
+
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+
+
+class DeepseekV2MoEAuto(MoELayer):
+    """
+    A mixed expert module containing shared experts.
+    """
+
+    def __init__(self, config: DeepseekV2Config):
+        gate = MoEGate(
+            config=config,
+            num_experts=config.n_routed_experts,
+            expert_hidden_size=config.hidden_size,
+            top_k=config.num_experts_per_tok,
+            topk_method=config.topk_method,
+            n_group=config.n_group,
+            topk_group=config.topk_group,
+            norm_topk_prob=config.norm_topk_prob,
+            routed_scaling_factor=config.routed_scaling_factor,
+            drop_tokens=False,
+        )
+
+        super().__init__(
+            config=config,
+            moe_num_experts=config.n_routed_experts,
+            expert_class=DeepseekV2MLPAuto,
+            expert_kwargs={"config": config, "intermediate_size": config.moe_intermediate_size},
+            gate=gate,
+            capacity=2.0,
+        )
+        self.alpha = config.aux_loss_alpha
+        if config.n_shared_experts is not None:
+            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
+            self.shared_experts = DeepseekV2MLPAuto(config=config, intermediate_size=intermediate_size)
+
+    def forward(self, hidden_states):
+        final_hidden_states, l_aux, l_zloss = super().forward(hidden_states)
+        if self.training and self.alpha > 0.0:
+            final_hidden_states = AddAuxiliaryLoss.apply(final_hidden_states, l_aux)
+
+        if self.config.n_shared_experts is not None:
+            shared_expert_output = self.shared_experts(hidden_states)
+            final_hidden_states = final_hidden_states + shared_expert_output
+        return final_hidden_states
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->DeepseekV2
+class DeepseekV2AttentionAuto(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: DeepseekV2Config, layerwise_recompute: bool = False):
+        super().__init__()
+        self.config = config
+        self.attention_dropout = config.attention_dropout
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.q_lora_rank = config.q_lora_rank
+        self.qk_rope_head_dim = config.qk_rope_head_dim
+        self.kv_lora_rank = config.kv_lora_rank
+        self.v_head_dim = config.v_head_dim
+        self.qk_nope_head_dim = config.qk_nope_head_dim
+        self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
+
+        self.is_causal = True
+
+        self.seq_length = config.seq_length
+
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.layerwise_recompute = layerwise_recompute
+        self.recompute_granularity = config.recompute_granularity
+
+        # Note (@DrownFish19): For tensor parallel we consider that q_a_proj and kv_a_proj_with_mqa
+        # are the small weight and cannot achieve performance gain. So we use the original
+        # linear layers. We use the tensor parallel linear layers for q_proj，q_b_proj and kv_b_proj
+        # for which are the large weight and can achieve performance gain.
+
+        # fmt: off
+        # for without tensor parallel
+        if self.q_lora_rank is None:
+            self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.q_head_dim, bias_attr=False)
+        else:
+            self.q_a_proj = nn.Linear(self.hidden_size, config.q_lora_rank, bias_attr=config.attention_bias)
+            self.q_a_layernorm = DeepseekV2RMSNorm(config=config, hidden_size=config.q_lora_rank)
+            self.q_b_proj = nn.Linear(config.q_lora_rank, self.num_heads * self.q_head_dim, bias_attr=False)
+
+        self.kv_a_proj_with_mqa = nn.Linear(self.hidden_size, config.kv_lora_rank + config.qk_rope_head_dim, bias_attr=config.attention_bias)
+        self.kv_a_layernorm = DeepseekV2RMSNorm(config=config, hidden_size=config.kv_lora_rank)
+        self.kv_b_proj = nn.Linear(config.kv_lora_rank, self.num_heads * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim), bias_attr=False)
+
+        self.o_proj = nn.Linear(self.num_heads * self.v_head_dim, self.hidden_size, bias_attr=config.attention_bias)
+        # fmt: on
+
+        self._init_rope()
+
+        self.softmax_scale = self.q_head_dim ** (-0.5)
+        if self.config.rope_scaling is not None:
+            mscale_all_dim = self.config.rope_scaling.get("mscale_all_dim", 0)
+            scaling_factor = self.config.rope_scaling["factor"]
+            if mscale_all_dim:
+                mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
+                self.softmax_scale = self.softmax_scale * mscale * mscale
+
+        self.attn_func = scaled_dot_product_attention
+
+    def _init_rope(self):
+        if self.config.rope_scaling is None:
+            self.rotary_emb = DeepseekV2RotaryEmbedding(
+                self.qk_rope_head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.rope_theta,
+            )
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            scaling_factor = self.config.rope_scaling["factor"]
+            if scaling_type == "linear":
+                self.rotary_emb = DeepseekV2LinearScalingRotaryEmbedding(
+                    self.qk_rope_head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "dynamic":
+                self.rotary_emb = DeepseekV2DynamicNTKScalingRotaryEmbedding(
+                    self.qk_rope_head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "yarn":
+                kwargs = {
+                    key: self.config.rope_scaling[key]
+                    for key in [
+                        "original_max_position_embeddings",
+                        "beta_fast",
+                        "beta_slow",
+                        "mscale",
+                        "mscale_all_dim",
+                    ]
+                    if key in self.config.rope_scaling
+                }
+                self.rotary_emb = DeepseekV2YarnRotaryEmbedding(
+                    self.qk_rope_head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                    **kwargs,
+                )
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.v_head_dim]).transpose([1, 0, 2, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
+        **kwargs,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+        bsz, q_len, _ = hidden_states.shape
+
+        # DeepSeekV2 q_lora_rank=1536
+        # DeepSeekV2-lite q_lora_rank=None
+        if self.q_lora_rank is None:
+            q = self.q_proj(hidden_states)
+        else:
+            q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
+        q = q.reshape([bsz, q_len, self.num_heads, self.q_head_dim])
+        q_nope, q_pe = paddle.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], axis=-1)
+
+        # DeepSeekV2 kv_lora_rank+qk_rope_head_dim=512+64
+        compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
+        compressed_kv, k_pe = paddle.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], axis=-1)
+        k_pe = k_pe.reshape([bsz, q_len, 1, self.qk_rope_head_dim])
+
+        # self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim = 128+64
+        # self.num_heads * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim) = config.qk_nope_head_dim + self.v_head_dim = 128+128
+        kv = self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).reshape(
+            [bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim]
+        )
+
+        k_nope, value_states = paddle.split(kv, [self.qk_nope_head_dim, self.v_head_dim], axis=-1)
+        kv_seq_len = value_states.shape[1]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-3]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        cos = cos[None, :, None, :]
+        sin = sin[None, :, None, :]
+        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
+
+        query_states = paddle.empty([bsz, q_len, self.num_heads, self.q_head_dim], dtype=self.config.dtype)
+        query_states = paddle.concat([q_nope, q_pe], axis=-1)
+        # query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
+        # query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
+
+        key_states = paddle.empty([bsz, q_len, self.num_heads, self.q_head_dim], dtype=self.config.dtype)
+        # input[0]'s shape = [1, 2048, 16, 128], input[1]'s shape = [1, 2048, 1, 64].
+        key_states = paddle.concat([k_nope, k_pe.expand([bsz, q_len, self.num_heads, k_pe.shape[-1]])], axis=-1)
+
+        # key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
+        # key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
+
+        # [bs, seq_len, num_head, head_dim]
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = paddle.concat([past_key_value[0], key_states], axis=1)
+            value_states = paddle.concat([past_key_value[1], value_states], axis=1)
+        past_key_value = (key_states, value_states) if use_cache else None
+
+        has_gradient = not (query_states.stop_gradient and key_states.stop_gradient and value_states.stop_gradient)
+        if (
+            self.enable_recompute
+            and self.layerwise_recompute
+            and has_gradient
+            and self.recompute_granularity == "core_attn"
+        ):
+            outputs = recompute(
+                self.attn_func,
+                query_states,
+                self.config,
+                key_states,
+                value_states,
+                attention_mask,
+                output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                softmax_scale=self.softmax_scale,
+                training=self.training,
+                use_reentrant=self.config.recompute_use_reentrant,
+            )
+        else:
+            outputs = self.attn_func(
+                query_states,
+                self.config,
+                key_states,
+                value_states,
+                attention_mask,
+                output_attentions,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                softmax_scale=self.softmax_scale,
+                training=self.training,
+            )
+        if output_attentions:
+            attn_output, attn_weights = outputs
+        else:
+            attn_output = outputs
+
+        # if sequence_parallel is true, out shape are [q_len / n, bs, num_head * head_dim]
+        # else their shape are [bs, q_len, num_head * head_dim], n is mp parallelism.
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class DeepseekV2DecoderLayerAuto(nn.Layer):
+    def __init__(self, config: DeepseekV2Config, layer_idx: int, layerwise_recompute: bool = False):
+        super().__init__()
+        self.config = config
+
+        self.enable_recompute = False
+        self.layerwise_recompute = layerwise_recompute
+        self.recompute_granularity = config.recompute_granularity
+
+        self.hidden_size = config.hidden_size
+
+        self.self_attn = DeepseekV2AttentionAuto(config=config, layerwise_recompute=layerwise_recompute)
+
+        self.mlp = (
+            DeepseekV2MoEAuto(config)
+            if (
+                config.n_routed_experts is not None
+                and layer_idx >= config.first_k_dense_replace
+                and layer_idx % config.moe_layer_freq == 0
+            )
+            else DeepseekV2MLPAuto(config)
+        )
+        self.input_layernorm = DeepseekV2RMSNorm(config)
+        self.post_attention_layernorm = DeepseekV2RMSNorm(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
+        **kwargs,
+    ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_axis)`
+            attention_mask (`paddle.Tensor`, *optional*):
+                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
+                query_sequence_length, key_sequence_length)` if default attention is used.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(paddle.Tensor)`, *optional*): cached past key and value projection states
+        """
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        has_gradient = not hidden_states.stop_gradient
+        if (
+            self.enable_recompute
+            and self.layerwise_recompute
+            and has_gradient
+            and self.recompute_granularity == "full_attn"
+        ):
+            hidden_states, self_attn_weights, present_key_value = recompute(
+                self.self_attn,
+                hidden_states=hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                **kwargs,
+            )
+        else:
+            hidden_states, self_attn_weights, present_key_value = self.self_attn(
+                hidden_states=hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                **kwargs,
+            )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
+        return outputs
+
+
+class DeepseekV2PretrainedModelAuto(PretrainedModel):
+    config_class = DeepseekV2Config
+    base_model_prefix = "deepseek_v2"
+    _no_split_modules = ["DeepseekV2DecoderLayerAuto"]
+
+
+@register_base_model
+class DeepseekV2ModelAuto(DeepseekV2PretrainedModelAuto):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`DeepseekV2DecoderLayerAuto`]
+
+    Args:
+        config: DeepseekV2Config
+    """
+
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+
+        self.config = config
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.recompute_granularity = config.recompute_granularity
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+
+        self.layers = nn.LayerList(
+            [
+                DeepseekV2DecoderLayerAuto(config, layer_idx, layer_idx not in self.no_recompute_layers)
+                for layer_idx in range(config.num_hidden_layers)
+            ]
+        )
+        self.norm = DeepseekV2RMSNorm(config)
+
+        self.enable_recompute = False
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @staticmethod
+    def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values_length, dtype):
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            if len(attention_mask.shape) == 2:
+                expanded_attn_mask = _expand_2d_mask(attention_mask, dtype, tgt_length=input_shape[-1])
+                # For decoding phase in generation, seq_length = 1, we don't need to add causal mask
+                if input_shape[-1] > 1:
+                    combined_attention_mask = _make_causal_mask(
+                        input_shape,
+                        past_key_values_length=past_key_values_length,
+                    )
+                    expanded_attn_mask = expanded_attn_mask & combined_attention_mask
+            # [bsz, seq_len, seq_len] -> [bsz, 1, seq_len, seq_len]
+            elif len(attention_mask.shape) == 3:
+                expanded_attn_mask = attention_mask.unsqueeze(1).astype("bool")
+            # if attention_mask is already 4-D, do nothing
+            else:
+                expanded_attn_mask = attention_mask
+        else:
+            expanded_attn_mask = _make_causal_mask(
+                input_shape,
+                past_key_values_length=past_key_values_length,
+            )
+        # Convert bool attention_mask to float attention mask, which will be added to attention_scores later
+        if get_env_device() == "xpu":
+            x = paddle.to_tensor(0.0, dtype="float32")
+            y = paddle.to_tensor(-1.7005809656952787e38, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask, x, y)
+        else:
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), 0.0, paddle.finfo(dtype).min).astype(
+                dtype
+            )
+        return expanded_attn_mask
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices: Optional[Tensor] = None,
+        **kwargs,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape[:2]
+        elif inputs_embeds is not None:
+            batch_size, seq_length = inputs_embeds.shape[:2]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if self.enable_recompute and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers."
+                )
+                use_cache = False
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.layers))
+        # NOTE: to make cache can be clear in-time
+        past_key_values = list(past_key_values)
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values[0] is not None:
+            past_key_values_length = past_key_values[0][0].shape[1]
+            seq_length_with_past += past_key_values_length
+
+        if position_ids is None:
+            position_ids = paddle.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=paddle.int64
+            )
+            position_ids = position_ids.unsqueeze(0)
+
+        if inputs_embeds is None:
+            # [bs, seq_len, dim]
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        # embed positions
+        if attn_mask_startend_row_indices is not None or get_use_casual_mask():
+            attention_mask = None
+        else:
+            # [bs, seq_len]
+            attention_mask = (
+                paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+                if attention_mask is None
+                else attention_mask
+            )
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), past_key_values_length, inputs_embeds.dtype
+            )  # [bs, 1, seq_len, seq_len]
+            if self.config.use_flash_attention:
+                attention_mask = None if is_casual_mask(attention_mask) else attention_mask
+
+        # embed positions
+        hidden_states = inputs_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, (decoder_layer) in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            layer_outputs = decoder_layer(
+                hidden_states=hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            )
+
+            # NOTE: clear outdate cache after it has been used for memory saving
+            past_key_value = past_key_values[idx] = None
+            if type(layer_outputs) is tuple:
+                hidden_states = layer_outputs[0]
+            else:
+                hidden_states = layer_outputs
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+class DeepseekV2LMHeadAuto(nn.Layer):
+    def __init__(self, config: DeepseekV2Config):
+        super(DeepseekV2LMHeadAuto, self).__init__()
+
+        self.config = config
+
+        self.weight = self.create_parameter(
+            shape=[config.hidden_size, config.vocab_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.XavierNormal(1.0),
+        )
+
+    def forward(self, hidden_states, tensor_parallel_output=None):
+        if tensor_parallel_output is None:
+            tensor_parallel_output = self.config.tensor_parallel_output
+        logits = paddle.matmul(hidden_states, self.weight)
+        return logits
+
+
+class DeepseekV2ForCausalLMAuto(DeepseekV2PretrainedModelAuto):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+        self.config = config
+        self.deepseek_v2 = DeepseekV2ModelAuto(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = DeepseekV2LMHeadAuto(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
+
+    def get_input_embeddings(self):
+        return self.deepseek_v2.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.deepseek_v2.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.deepseek_v2 = decoder
+
+    def get_decoder(self):
+        return self.deepseek_v2
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, DeepseekV2ForCausalLMAuto
+
+        >>> model = DeepseekV2ForCausalLMAuto.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        input_ids.stop_gradient = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.deepseek_v2(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+        )
+
+        hidden_states = outputs[0]
+
+        # if labels is None，means we need full output, instead of tensor_parallel_output
+        # tensor_parallel_output is together with ParallelCrossEntropy
+        tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
+
+        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+        return logits
+
+    def prepare_inputs_for_generation(
+        self, input_ids, use_cache=False, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        batch_size, seq_length = input_ids.shape
+        position_ids = kwargs.get("position_ids", paddle.arange(seq_length).expand((batch_size, seq_length)))
+        if past_key_values:
+            input_ids = input_ids[:, -1].unsqueeze(axis=-1)
+            position_ids = position_ids[:, -1].unsqueeze(-1)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": use_cache,
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    def _get_model_inputs_spec(self, dtype: str):
+        return {
+            "input_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "attention_mask": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "position_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        }
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, CausalLMOutputWithPast) and "past_key_values" in outputs:
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
+
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            # TODO: support attention mask for other models
+            attention_mask = model_kwargs["attention_mask"]
+            if len(attention_mask.shape) == 2:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )
+            elif len(attention_mask.shape) == 4:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([*attention_mask.shape[:3], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )[:, :, -1:, :]
+
+        return model_kwargs
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past
+
+    def auto_dist_config(self, prefix=""):
+        if prefix != "":
+            assert prefix.endswith(".")
+        config = {
+            "dp_config": {"sharding_level": 1, "offload": False, "exclude_layer": None},
+            "mp_config": {
+                "parallelize_plan": {
+                    f"{prefix}deepseek_v2.embed_tokens": dist.ColWiseParallel(gather_output=True),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.q_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.q_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.kv_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.self_attn.o_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.shared_experts.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.shared_experts.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v2.layers.*.mlp.shared_experts.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}lm_head.weight": dist.ColWiseParallel(),
+                }
+            },
+        }
+        return config
diff --git a/paddlenlp/transformers/deepseek_v2/modeling_pp.py b/paddlenlp/transformers/deepseek_v2/modeling_pp.py
new file mode 100644
index 000000000000..d6eec969926e
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v2/modeling_pp.py
@@ -0,0 +1,358 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import OrderedDict
+
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+)
+from paddle.distributed.fleet.recompute.recompute import recompute
+
+from ...utils.tools import get_env_device
+from ..model_utils import PipelinePretrainedModel
+from .modeling import (
+    DeepseekV2Config,
+    DeepseekV2DecoderLayer,
+    DeepseekV2LMHead,
+    DeepseekV2Model,
+    DeepseekV2PretrainedModel,
+    DeepseekV2PretrainingCriterion,
+    DeepseekV2RMSNorm,
+)
+
+__all__ = [
+    "DeepseekV2ForCausalLMPipe",
+]
+
+
+def parse_args(args):
+    if isinstance(args, tuple):
+        if len(args) == 4:
+            hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = args
+        elif len(args) == 3:
+            hidden_states, attention_mask, attn_mask_startend_row_indices = args
+            position_ids = None
+        elif len(args) == 2:
+            hidden_states, attention_mask = args
+            attn_mask_startend_row_indices, position_ids = None, None
+    else:
+        hidden_states = args
+        attention_mask, attn_mask_startend_row_indices, position_ids = None, None, None
+
+    if position_ids is not None:
+        position_ids.stop_gradient = True
+
+    if attention_mask is not None:
+        attention_mask.stop_gradient = True
+
+    if attn_mask_startend_row_indices is not None:
+        attn_mask_startend_row_indices.stop_gradient = True
+
+    return hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids
+
+
+def return_args(hidden_states, attention_mask=None, attn_mask_startend_row_indices=None, position_ids=None):
+    ret = (hidden_states,)
+
+    if attention_mask is not None:
+        ret += (attention_mask.clone(),)
+    if attn_mask_startend_row_indices is not None:
+        ret += (attn_mask_startend_row_indices.clone(),)
+    if position_ids is not None:
+        ret += (position_ids.clone(),)
+    if len(ret) == 1:
+        ret = ret[0]
+
+    return ret
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+class DeepseekV2EmbeddingPipe(nn.Layer):
+    def __init__(self, config: DeepseekV2Config):
+        super(DeepseekV2EmbeddingPipe, self).__init__()
+        self.config = config
+        self.sequence_parallel = config.sequence_parallel
+        self.hidden_size = config.hidden_size
+        if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self.embed_tokens, "weight")
+
+    def forward(self, args):
+        """_summary_
+
+        Args:
+            input (_type_): _description_
+
+        Returns:
+            _type_: _description_
+        """
+        input_ids, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        input_embeds = self.embed_tokens(input_ids)
+        if self.config.sequence_parallel:
+            from paddlenlp.transformers import ScatterOp
+
+            # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
+            bs, seq_len, hidden_size = input_embeds.shape
+            input_embeds = paddle.reshape_(input_embeds, [bs * seq_len, hidden_size])
+            # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism)
+            input_embeds = ScatterOp.apply(input_embeds)
+
+        batch_size, seq_length = input_ids.shape
+
+        if attention_mask is not None:
+            assert (
+                attn_mask_startend_row_indices is None
+            ), "attention_mask and attn_mask_startend_row_indices can not be set at same time"
+
+            attention_mask = DeepseekV2Model._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), 0, input_embeds.dtype
+            )
+            attention_mask.stop_gradient = True
+            if get_env_device() == "npu":
+                attention_mask = attention_mask.astype("bool")
+        elif get_env_device() == "npu":
+            attention_mask = paddle.tril(paddle.ones((seq_length, seq_length), dtype="bool"))
+            attention_mask.stop_gradient = True
+
+        return return_args(input_embeds, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class DeepseekV2DecoderLayerPipe(DeepseekV2DecoderLayer):
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+
+        has_gradient = not hidden_states.stop_gradient
+
+        if attention_mask is not None and attention_mask.dtype == paddle.int32:
+            attention_mask, attn_mask_startend_row_indices, position_ids = (
+                None,
+                attention_mask,
+                attn_mask_startend_row_indices,
+            )
+        elif attention_mask is not None and attention_mask.dtype == paddle.int64:
+            attention_mask, attn_mask_startend_row_indices, position_ids = None, None, attention_mask
+        elif attn_mask_startend_row_indices is not None and attn_mask_startend_row_indices.dtype == paddle.int64:
+            attn_mask_startend_row_indices, position_ids = None, attn_mask_startend_row_indices
+
+        if self.enable_recompute and self.config.recompute_granularity == "full" and has_gradient:
+            if attention_mask is not None or attn_mask_startend_row_indices is not None:
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attention_mask=attention_mask,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=False,
+                )
+            else:
+                # for pretrain
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=self.config.recompute_use_reentrant,
+                )
+        else:
+            hidden_states = super().forward(
+                hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            )
+
+        return return_args(hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class DeepseekV2RMSNormPipe(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.norm = DeepseekV2RMSNorm(config)
+
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        return self.norm(hidden_states)
+
+
+class DeepseekV2LMHeadPipe(DeepseekV2LMHead):
+    def __init__(self, config):
+        super(DeepseekV2LMHeadPipe, self).__init__(config)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self, "weight")
+
+
+class DeepseekV2ForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
+    """DeepseekV2ForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the DeepseekV2Model class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = DeepseekV2Config
+    _base_model = DeepseekV2PretrainedModel
+    _get_tensor_parallel_mappings = DeepseekV2PretrainedModel._get_tensor_parallel_mappings
+    _init_weights = DeepseekV2PretrainedModel._init_weights
+    _keys_to_ignore_on_load_unexpected = DeepseekV2PretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = DeepseekV2PretrainedModel._get_model_flops
+    _get_hardware_flops = DeepseekV2PretrainedModel._get_hardware_flops
+
+    _tied_weights_keys = ["lm_head.weight"]
+
+    # DONOT Add base_model_prefix !!!!
+
+    @classmethod
+    def _prepare_pipeline_inputs_func(cls, inputs):
+        first_stage_keys = ["input_ids", "attention_mask", "attn_mask_startend_row_indices", "position_ids"]
+        last_stage_keys = ["labels"]
+
+        def get_expected_keys(inputs, keys):
+            ret = tuple([inputs.pop(k) if k in inputs else None for k in keys])
+            if len(ret) == 1:
+                ret = ret[0]
+            return ret
+
+        if type(inputs) is dict or type(inputs) is OrderedDict:
+            return [
+                get_expected_keys(inputs, first_stage_keys),
+                get_expected_keys(inputs, last_stage_keys),
+            ]
+
+        keys = list(inputs[0].keys())
+        inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys}
+        return [
+            get_expected_keys(inputs_batch, first_stage_keys),
+            get_expected_keys(inputs_batch, last_stage_keys),
+        ]
+
+    def __init__(self, config: DeepseekV2Config):
+        self.config = config
+
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.recompute_granularity = self.config.recompute_granularity
+        self.pp_recompute_interval = self.config.pp_recompute_interval
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+        if self.recompute_granularity == "full":
+            assert len(self.no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support"
+
+        virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1)
+
+        def get_hcg():
+            return fleet.get_hybrid_communicate_group()
+
+        hcg = get_hcg()
+        tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1)
+        tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0)
+
+        # TODO: fix tensor_parallel_degree rewrite in here
+        config.tensor_parallel_degree = tensor_parallel_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "DeepseekV2_shared_weight",
+                    DeepseekV2EmbeddingPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                ),
+                self._base_model.base_model_prefix,
+            )
+        else:
+            self.add_sequential_layer(
+                LayerDesc(DeepseekV2EmbeddingPipe, config=config), self._base_model.base_model_prefix
+            )
+
+        for i in range(config.num_hidden_layers):
+            self.add_sequential_layer(
+                LayerDesc(
+                    DeepseekV2DecoderLayerPipe,
+                    config=config,
+                    layer_idx=i,
+                    layerwise_recompute=i not in self.no_recompute_layers,
+                ),
+                f"{self._base_model.base_model_prefix}.layers.{i}",
+            )
+        self.add_sequential_layer(LayerDesc(DeepseekV2RMSNormPipe, config=config), self._base_model.base_model_prefix)
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "DeepseekV2_shared_weight",
+                    DeepseekV2LMHeadPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                    **{"transpose_y": True},
+                ),
+                "lm_head",
+            )
+        else:
+            self.add_sequential_layer(LayerDesc(DeepseekV2LMHeadPipe, config=config), "lm_head")
+
+        recompute_interval = 0
+        if self.enable_recompute and self.recompute_granularity == "full":
+            assert self.config.pp_recompute_interval <= config.num_hidden_layers // (
+                virtual_pp_degree * get_hcg().topology().get_dim_size("pipe")
+            ), "pp recompute interval should smaller than num layers of each pp chunk"
+            recompute_interval = self.config.pp_recompute_interval
+
+        seg_method = "layer:DeepseekV2DecoderLayer"
+        if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        PipelineLayer.__init__(
+            self,
+            layers=self.get_sequential_layers(),
+            loss_fn=self.get_loss_fn(config),
+            topology=get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+        # You should call init here, since there is a  diamond inheritance problem
+        self.apply(self._init_weights)
+        # DON'T init PipelinePretrainedModel
+        # PipelinePretrainedModel.__init__(self.super(), config=config)
+
+    def get_loss_fn(self, config):
+        return DeepseekV2PretrainingCriterion(config)
diff --git a/paddlenlp/transformers/deepseek_v2/tokenizer_fast.py b/paddlenlp/transformers/deepseek_v2/tokenizer_fast.py
index b754699c48e9..5bfc8019ef10 100644
--- a/paddlenlp/transformers/deepseek_v2/tokenizer_fast.py
+++ b/paddlenlp/transformers/deepseek_v2/tokenizer_fast.py
@@ -15,6 +15,10 @@
 
 from ..llama import LlamaTokenizerFast
 
+__all__ = [
+    "DeepseekTokenizerFast",
+]
+
 
 class DeepseekTokenizerFast(LlamaTokenizerFast):
     def convert_ids_to_tokens(
diff --git a/paddlenlp/transformers/deepseek_v3/__init__.py b/paddlenlp/transformers/deepseek_v3/__init__.py
new file mode 100644
index 000000000000..a9e40981dc0d
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration import *
+from .modeling import *
+from .modeling_auto import *
+from .modeling_pp import *
diff --git a/paddlenlp/transformers/deepseek_v3/configuration.py b/paddlenlp/transformers/deepseek_v3/configuration.py
new file mode 100644
index 000000000000..35fc0767a996
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/configuration.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 Mistral AI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DeepSeekV3 model configuration"""
+from ..deepseek_v2.configuration import DeepseekV2Config
+
+__all__ = [
+    "DeepseekV3Config",
+]
+
+
+class DeepseekV3Config(DeepseekV2Config):
+    model_type = "deepseek_v3"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        **kwargs,
+    ):
+        super().__init__(
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/deepseek_v3/modeling.py b/paddlenlp/transformers/deepseek_v3/modeling.py
new file mode 100644
index 000000000000..8008aa2ce68d
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/modeling.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle DeepSeek model."""
+
+from __future__ import annotations
+
+from typing import List, Optional, Tuple, Union
+
+import paddle
+
+from ..deepseek_v2.modeling import (
+    DeepseekV2ForSequenceClassification,
+    DeepseekV2LMHead,
+    DeepseekV2Model,
+    DeepseekV2PretrainedModel,
+    DeepseekV2PretrainingCriterion,
+)
+from ..model_outputs import CausalLMOutputWithPast
+from ..model_utils import register_base_model
+from .configuration import DeepseekV3Config
+
+__all__ = [
+    "DeepseekV3ForCausalLM",
+    "DeepseekV3ForSequenceClassification",
+    "DeepseekV3Model",
+    "DeepseekV3PretrainedModel",
+]
+
+
+class DeepseekV3PretrainedModel(DeepseekV2PretrainedModel):
+    config_class = DeepseekV3Config
+    base_model_prefix = "deepseek_v3"
+    _no_split_modules = ["DeepseekV2DecoderLayer"]
+
+
+@register_base_model
+class DeepseekV3Model(DeepseekV2Model):
+    def __init__(self, config: DeepseekV3Config):
+        super().__init__(config)
+
+
+class DeepseekV3ForCausalLM(DeepseekV3PretrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: DeepseekV3Config):
+        super().__init__(config)
+        self.deepseek_v3 = DeepseekV3Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = DeepseekV2LMHead(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
+
+    def get_input_embeddings(self):
+        return self.deepseek_v3.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.deepseek_v3.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.deepseek_v3 = decoder
+
+    def get_decoder(self):
+        return self.deepseek_v3
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, DeepseekV3ForCausalLM
+
+        >>> model = DeepseekV3ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.deepseek_v3(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        # TODO@DrownFish19: shift labels
+        if labels is not None:
+            loss = self.criterion(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class DeepseekV3ForSequenceClassification(DeepseekV2ForSequenceClassification):
+    def __init__(self, config):
+        super().__init__(config)
diff --git a/paddlenlp/transformers/deepseek_v3/modeling_auto.py b/paddlenlp/transformers/deepseek_v3/modeling_auto.py
new file mode 100644
index 000000000000..1dff442f0f88
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/modeling_auto.py
@@ -0,0 +1,204 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle DeepSeek_V3 model."""
+
+from __future__ import annotations
+
+from typing import List, Optional, Tuple, Union
+
+import paddle
+
+try:
+    from paddle.incubate.nn.functional import fused_rotary_position_embedding
+except ImportError:
+    fused_rotary_position_embedding = None
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+
+import paddle.distributed as dist
+
+from ...utils.log import logger
+from ..deepseek_v2.modeling_auto import (
+    DeepseekV2LMHeadAuto,
+    DeepseekV2ModelAuto,
+    DeepseekV2PretrainedModelAuto,
+    DeepseekV2PretrainingCriterion,
+)
+from ..model_outputs import CausalLMOutputWithPast
+from ..model_utils import register_base_model
+from .configuration import DeepseekV2Config
+
+__all__ = [
+    "DeepseekV3LMHeadAuto",
+    "DeepseekV3ForCausalLMAuto",
+    "DeepseekV3ModelAuto",
+    "DeepseekV3PretrainedModelAuto",
+]
+
+
+class DeepseekV3PretrainedModelAuto(DeepseekV2PretrainedModelAuto):
+    config_class = DeepseekV2Config
+    base_model_prefix = "deepseek_v3"
+    _no_split_modules = ["DeepseekV2DecoderLayerAuto"]
+
+
+@register_base_model
+class DeepseekV3ModelAuto(DeepseekV2ModelAuto):
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+
+
+class DeepseekV3LMHeadAuto(DeepseekV2LMHeadAuto):
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+
+
+class DeepseekV3ForCausalLMAuto(DeepseekV3PretrainedModelAuto):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: DeepseekV2Config):
+        super().__init__(config)
+        self.config = config
+        self.deepseek_v3 = DeepseekV3ModelAuto(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = DeepseekV3LMHeadAuto(config)
+        self.criterion = DeepseekV2PretrainingCriterion(config)
+
+    def get_input_embeddings(self):
+        return self.deepseek_v3.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.deepseek_v3.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.deepseek_v3 = decoder
+
+    def get_decoder(self):
+        return self.deepseek_v3
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, DeepseekV3ForCausalLMAuto
+
+        >>> model = DeepseekV3ForCausalLMAuto.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        input_ids.stop_gradient = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.deepseek_v3(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+        )
+
+        hidden_states = outputs[0]
+
+        # if labels is None，means we need full output, instead of tensor_parallel_output
+        # tensor_parallel_output is together with ParallelCrossEntropy
+        tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
+
+        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+        return logits
+
+    def auto_dist_config(self, prefix=""):
+        if prefix != "":
+            assert prefix.endswith(".")
+        config = {
+            "dp_config": {"sharding_level": 1, "offload": False, "exclude_layer": None},
+            "mp_config": {
+                "parallelize_plan": {
+                    f"{prefix}deepseek_v3.embed_tokens": dist.ColWiseParallel(gather_output=True),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.q_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.q_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.kv_b_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.self_attn.o_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.shared_experts.gate_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.shared_experts.up_proj": dist.ColWiseParallel(),
+                    f"{prefix}deepseek_v3.layers.*.mlp.shared_experts.down_proj": dist.RowWiseParallel(),
+                    f"{prefix}lm_head.weight": dist.ColWiseParallel(),
+                }
+            },
+        }
+        return config
diff --git a/paddlenlp/transformers/deepseek_v3/modeling_pp.py b/paddlenlp/transformers/deepseek_v3/modeling_pp.py
new file mode 100644
index 000000000000..e48a7dabc2d6
--- /dev/null
+++ b/paddlenlp/transformers/deepseek_v3/modeling_pp.py
@@ -0,0 +1,41 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from ..deepseek_v2.modeling_pp import DeepseekV2ForCausalLMPipe
+from .configuration import DeepseekV3Config
+from .modeling import DeepseekV3PretrainedModel
+
+__all__ = [
+    "DeepseekV3ForCausalLMPipe",
+]
+
+
+class DeepseekV3ForCausalLMPipe(DeepseekV2ForCausalLMPipe):
+    """DeepseekV2ForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the DeepseekV2Model class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = DeepseekV3Config
+    _base_model = DeepseekV3PretrainedModel
+    _get_tensor_parallel_mappings = DeepseekV3PretrainedModel._get_tensor_parallel_mappings
+    _init_weights = DeepseekV3PretrainedModel._init_weights
+    _keys_to_ignore_on_load_unexpected = DeepseekV3PretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = DeepseekV3PretrainedModel._get_model_flops
+    _get_hardware_flops = DeepseekV3PretrainedModel._get_hardware_flops
+    _tied_weights_keys = ["lm_head.weight"]
+
+    # DONOT Add base_model_prefix !!!!
diff --git a/paddlenlp/transformers/ernie_gen/modeling.py b/paddlenlp/transformers/ernie_gen/modeling.py
index 3a0a2f5fa3f4..45029813bfb6 100644
--- a/paddlenlp/transformers/ernie_gen/modeling.py
+++ b/paddlenlp/transformers/ernie_gen/modeling.py
@@ -21,16 +21,13 @@
 from paddle import nn
 from paddle.nn import functional as F
 
-from paddlenlp.transformers import (
-    BertPretrainedModel,
-    ElectraPretrainedModel,
-    ErniePretrainedModel,
-    RobertaPretrainedModel,
-)
-from paddlenlp.utils.download import resolve_file_path
-from paddlenlp.utils.log import logger
-
+from ...utils.download import resolve_file_path
+from ...utils.log import logger
 from .. import PretrainedModel, register_base_model
+from ..bert.modeling import BertPretrainedModel
+from ..electra.modeling import ElectraPretrainedModel
+from ..ernie.modeling import ErniePretrainedModel
+from ..roberta.modeling import RobertaPretrainedModel
 from ..utils import InitTrackerMeta, fn_args_to_dict
 
 __all__ = ["ErnieGenPretrainedModel", "ErnieForGeneration", "ErnieGenModel"]
diff --git a/paddlenlp/transformers/gemma/modeling.py b/paddlenlp/transformers/gemma/modeling.py
index 1aa75ece7a21..e7cfb6fa6856 100644
--- a/paddlenlp/transformers/gemma/modeling.py
+++ b/paddlenlp/transformers/gemma/modeling.py
@@ -55,7 +55,7 @@
 from .. import linear_utils
 from ..linear_utils import Linear
 from ..segment_parallel_utils import ReshardLayer
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import (
     GEMMA_PRETRAINED_INIT_CONFIGURATION,
     GEMMA_PRETRAINED_RESOURCE_FILES_MAP,
@@ -898,6 +898,37 @@ class GemmaPretrainedModel(PretrainedModel):
     _keys_to_ignore_on_load_unexpected = []
     _keep_in_fp32_modules = ["inv_freq", "rotary_emb", "cos_cached", "sin_cached"]
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_name_mappings(cls, config: GemmaConfig) -> List[StateDictNameMapping]:
         mappings: list[StateDictNameMapping] = []
@@ -1075,39 +1106,6 @@ def __init__(self, config: GemmaConfig):
 
         self.gradient_checkpointing = False
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embed_tokens
 
@@ -1560,11 +1558,30 @@ def forward(
         # tensor_parallel_output is togather with ParallelCrossEntropy
         tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
 
-        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/gemma/modeling_pp.py b/paddlenlp/transformers/gemma/modeling_pp.py
index 8839248a28c4..66f4a2c200ec 100644
--- a/paddlenlp/transformers/gemma/modeling_pp.py
+++ b/paddlenlp/transformers/gemma/modeling_pp.py
@@ -237,6 +237,8 @@ class GemmaForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_tensor_parallel_mappings = GemmaPretrainedModel._get_tensor_parallel_mappings
     _init_weights = GemmaPretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = GemmaPretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = GemmaPretrainedModel._get_model_flops
+    _get_hardware_flops = GemmaPretrainedModel._get_hardware_flops
 
     # DONOT Add base_model_prefix !!!!
 
diff --git a/paddlenlp/transformers/glm/tokenizer.py b/paddlenlp/transformers/glm/tokenizer.py
index 6f535ac264b4..e9e314091730 100644
--- a/paddlenlp/transformers/glm/tokenizer.py
+++ b/paddlenlp/transformers/glm/tokenizer.py
@@ -23,7 +23,8 @@
 from scipy.linalg import block_diag
 
 from ...utils.log import logger
-from .. import BertTokenizer, GPTTokenizer
+from ..bert.tokenizer import BertTokenizer
+from ..gpt.tokenizer import GPTTokenizer
 from ..tokenizer_utils import PretrainedTokenizer
 from ..tokenizer_utils_base import BatchEncoding
 
diff --git a/paddlenlp/transformers/gpt/modeling.py b/paddlenlp/transformers/gpt/modeling.py
index 1a5bbd3ed698..16cf34aaa24d 100644
--- a/paddlenlp/transformers/gpt/modeling.py
+++ b/paddlenlp/transformers/gpt/modeling.py
@@ -53,7 +53,7 @@
     TokenClassifierOutput,
 )
 from ..model_utils import dy2st_nocheck_guard_context
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import (
     GPT_PRETRAINED_INIT_CONFIGURATION,
     GPT_PRETRAINED_RESOURCE_FILES_MAP,
@@ -805,6 +805,37 @@ class GPTPretrainedModel(PretrainedModel):
     pretrained_init_configuration = GPT_PRETRAINED_INIT_CONFIGURATION
     pretrained_resource_files_map = GPT_PRETRAINED_RESOURCE_FILES_MAP
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_tensor_parallel_mappings(cls, config, is_split=True):
 
@@ -1106,39 +1137,6 @@ def __init__(self, config: GPTConfig):
             decoder_layers,
         )
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embeddings.word_embeddings
 
diff --git a/paddlenlp/transformers/gpt/modeling_pp.py b/paddlenlp/transformers/gpt/modeling_pp.py
index 7734e8a990ed..02ee09151b85 100644
--- a/paddlenlp/transformers/gpt/modeling_pp.py
+++ b/paddlenlp/transformers/gpt/modeling_pp.py
@@ -167,6 +167,9 @@ class GPTForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     pretrained_init_configuration = GPTPretrainedModel.pretrained_init_configuration
     pretrained_resource_files_map = GPTPretrainedModel.pretrained_resource_files_map
 
+    _get_model_flops = GPTPretrainedModel._get_model_flops
+    _get_hardware_flops = GPTPretrainedModel._get_hardware_flops
+
     # NO base_model_prefix !!!!
 
     def __init__(
diff --git a/paddlenlp/transformers/llama/fusion_ops.py b/paddlenlp/transformers/llama/fusion_ops.py
index 4cb9101cc730..62f3660a5bfe 100644
--- a/paddlenlp/transformers/llama/fusion_ops.py
+++ b/paddlenlp/transformers/llama/fusion_ops.py
@@ -177,8 +177,10 @@ def fusion_flash_attention(
     npu_is_casual=False,
     skip_recompute=False,
 ):
-    bsz, q_len, num_heads, head_dim = query_states.shape
-    _, kv_seq_len, _, _ = value_states.shape
+    # Note:
+    # 1. The head_dim of query_states and key_states should be the same. And the head_dim of value_states should be used for reshape.
+    bsz, q_len, num_heads, _ = query_states.shape
+    _, kv_seq_len, _, head_dim = value_states.shape
     version = paddle.version.full_version
     if version != "0.0.0" and version <= "2.5.2":
         if alibi is not None:
@@ -206,8 +208,8 @@ def fusion_flash_attention(
                 value_states,
                 None,
                 attention_mask,
-                [],
-                [],
+                None,
+                None,
                 0.0,
                 attention_mask is None,
                 True,
diff --git a/paddlenlp/transformers/llama/modeling.py b/paddlenlp/transformers/llama/modeling.py
index dc3318b621a2..c0308a7b1297 100755
--- a/paddlenlp/transformers/llama/modeling.py
+++ b/paddlenlp/transformers/llama/modeling.py
@@ -80,7 +80,7 @@ def swiglu(x, y=None):
 from .. import linear_utils
 from ..linear_utils import Linear
 from ..segment_parallel_utils import ReshardLayer
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import (
     LLAMA_PRETRAINED_INIT_CONFIGURATION,
     LLAMA_PRETRAINED_RESOURCE_FILES_MAP,
@@ -259,17 +259,20 @@ def scaled_dot_product_attention(
         key_states = paddle.transpose(key_states, [0, 2, 1, 3])
         value_states = paddle.transpose(value_states, [0, 2, 1, 3])
 
-        # matmul and devide by sqrt(head_dim)
-        if get_env_device() == "intel_hpu":
-            # optimize div(const) to mul(const) for better performance
-            attn_weights = paddle.matmul(query_states * (1 / math.sqrt(head_dim)), key_states.transpose([0, 1, 3, 2]))
+        # Add pre divided factor to fix nan under float16.
+        if paddle.in_dynamic_mode() and query_states.dtype == paddle.float16:
+            pre_divided_factor = 32
         else:
-            attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
+            pre_divided_factor = 1
+
+        attn_weights = paddle.matmul(
+            query_states * (1 / (math.sqrt(head_dim) * pre_divided_factor)), key_states.transpose([0, 1, 3, 2])
+        )
 
         # then add alibi bias
         if alibi is not None:
             alibi = alibi.reshape([bsz, num_heads, 1, -1])
-            attn_weights = attn_weights + alibi
+            attn_weights = attn_weights + alibi / pre_divided_factor
 
         if paddle.in_dynamic_mode() and attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
             raise ValueError(
@@ -297,7 +300,9 @@ def scaled_dot_product_attention(
             attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
         else:
             with paddle.amp.auto_cast(False):
-                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+                attn_weights = F.softmax(attn_weights.astype("float32") * pre_divided_factor, axis=-1).astype(
+                    query_states.dtype
+                )
 
         attn_output = paddle.matmul(attn_weights, value_states)
         attn_output = attn_output.transpose([0, 2, 1, 3])
@@ -1294,6 +1299,37 @@ class LlamaPretrainedModel(PretrainedModel):
     pretrained_resource_files_map = LLAMA_PRETRAINED_RESOURCE_FILES_MAP
     _keys_to_ignore_on_load_unexpected = [r"self_attn.rotary_emb.inv_freq"]
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_name_mappings(cls, config: LlamaConfig) -> list[StateDictNameMapping]:
         mappings: list[StateDictNameMapping] = []
@@ -1536,39 +1572,6 @@ def __init__(self, config: LlamaConfig):
 
         self.gradient_checkpointing = False
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embed_tokens
 
@@ -2123,11 +2126,30 @@ def forward(
 
         hidden_states = outputs[0]  # [bs, seq_len, dim]
 
-        logits = self.lm_head(hidden_states)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states)
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/llama/modeling_auto.py b/paddlenlp/transformers/llama/modeling_auto.py
index 8793cd106828..3edc22d601fd 100644
--- a/paddlenlp/transformers/llama/modeling_auto.py
+++ b/paddlenlp/transformers/llama/modeling_auto.py
@@ -52,7 +52,9 @@ def swiglu(x, y=None):
     CausalLMOutputWithCrossAttentions,
 )
 from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+from paddlenlp.utils.tools import get_env_device
 
+from . import fusion_ops
 from .configuration import (
     LLAMA_PRETRAINED_INIT_CONFIGURATION,
     LLAMA_PRETRAINED_RESOURCE_FILES_MAP,
@@ -69,7 +71,6 @@ def swiglu(x, y=None):
     build_alibi_tensor,
     get_triangle_upper_mask,
     repeat_kv,
-    rms_norm_fused,
 )
 
 try:
@@ -218,7 +219,9 @@ def __init__(self, config, ipp):
 
     def forward(self, hidden_states):
         if self.config.use_fused_rms_norm:
-            return rms_norm_fused(hidden_states, self.weight, self.variance_epsilon)
+            return fusion_ops.fusion_rms_norm(
+                hidden_states, self.weight, self.variance_epsilon, self.config.use_fast_layer_norm
+            )
 
         with paddle.amp.auto_cast(False):
             variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
@@ -308,7 +311,7 @@ def __init__(self, config: LlamaConfig, layerwise_recompute: bool = False, ipp:
         self.ipp = ipp
 
         self.use_fused_rope = config.use_fused_rope
-        if self.use_fused_rope:
+        if self.use_fused_rope and get_env_device() not in ["npu", "mlu", "xpu", "gcu", "intel_hpu"]:
             if "gpu" not in paddle.device.get_device() or fused_rotary_position_embedding is None:
                 warnings.warn(
                     "Enable fuse rope in the config, but fuse rope is not available. "
@@ -519,7 +522,7 @@ def forward(
         # repeat k/v heads if n_kv_heads < n_heads
         # paddle version > 2.6 or develop support flash-attn with gqa/mqa
         paddle_version = float(paddle.__version__[:3])
-        if (paddle_version != 0.0) and (paddle_version <= 2.6):
+        if not self.config.use_flash_attention or (paddle_version != 0.0) and (paddle_version <= 2.6):
             key_states = repeat_kv(key_states, self.num_key_value_groups)
             value_states = repeat_kv(value_states, self.num_key_value_groups)
 
@@ -935,7 +938,22 @@ def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values
         else:
             expanded_attn_mask = _make_causal_mask(input_shape, past_key_values_length=past_key_values_length)
         # Convert bool attention_mask to float attention mask, which will be added to attention_scores later
-        expanded_attn_mask = paddle.where(expanded_attn_mask, 0.0, paddle.finfo(dtype).min).astype(dtype)
+        if get_env_device() in ["npu", "mlu", "intel_hpu"]:
+            x = paddle.to_tensor(0.0, dtype="float32")
+            y = paddle.to_tensor(paddle.finfo(dtype).min, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), x, y).astype(dtype)
+        elif get_env_device() == "xpu":
+            x = paddle.to_tensor(0.0, dtype="float32")
+            y = paddle.to_tensor(-1.7005809656952787e38, dtype="float32")
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), x, y)
+        elif get_env_device() == "gcu":
+            min_val = paddle.finfo(dtype).min
+            x = paddle.to_tensor(0.0, dtype=dtype)
+            y = paddle.to_tensor(min_val, dtype=dtype)
+            expanded_attn_mask = paddle.where(expanded_attn_mask.cast("bool"), x, y).astype(dtype)
+        else:
+            expanded_attn_mask = paddle.where(expanded_attn_mask, 0.0, paddle.finfo(dtype).min)
+            expanded_attn_mask = expanded_attn_mask.astype(dtype)
         return expanded_attn_mask
 
     def forward(
@@ -1166,8 +1184,27 @@ def forward(self, prediction_scores, masked_lm_labels):
                     masked_lm_labels.unsqueeze(2),
                 )
 
-            masked_lm_loss = paddle.masked_select(masked_lm_loss, masked_lm_loss > 0).astype("float32")
-            loss = paddle.mean(masked_lm_loss)
+            # XPU dose not support allgather mask with bool dtype, so we use LocalLayer here.
+            if get_env_device() == "xpu":
+
+                class LocalLossLayer(paddle.distributed.LocalLayer):
+                    def __init__(self, out_dist_attrs):
+                        super().__init__(out_dist_attrs)
+
+                    def forward(self, x, mask):
+                        masked_lm_loss = paddle.masked_select(x, mask).astype("float32")
+                        loss = paddle.mean(masked_lm_loss)
+                        return loss
+
+                out_dist_attrs = [
+                    (masked_lm_loss.process_mesh, [dist.Partial(dist.ReduceType.kRedSum), dist.Replicate()]),
+                ]
+                loss_func = LocalLossLayer(out_dist_attrs)
+                loss = loss_func(masked_lm_loss, masked_lm_loss > 0)
+            else:
+                masked_lm_loss = paddle.masked_select(masked_lm_loss, masked_lm_loss > 0).astype("float32")
+                loss = paddle.mean(masked_lm_loss)
+
         return loss
 
 
@@ -1175,6 +1212,7 @@ class LlamaLMHeadAuto(nn.Layer):
     def __init__(self, config: LlamaConfig):
         super(LlamaLMHeadAuto, self).__init__()
         self.config = config
+
         vocab_size = config.vocab_size
         self.weight = self.create_parameter(
             shape=[config.hidden_size, vocab_size],
diff --git a/paddlenlp/transformers/llama/modeling_network.py b/paddlenlp/transformers/llama/modeling_network.py
index ed309016c882..414896c79c0b 100644
--- a/paddlenlp/transformers/llama/modeling_network.py
+++ b/paddlenlp/transformers/llama/modeling_network.py
@@ -57,7 +57,7 @@ def swiglu(x, y=None):
     LlamaDynamicNTKScalingRotaryEmbedding,
     LlamaLinearScalingRotaryEmbedding,
     LlamaNTKScalingRotaryEmbedding,
-    LlamaRotaryEmbedding,
+    Llama3RotaryEmbedding,
     _expand_2d_mask,
     _make_causal_mask,
     apply_rotary_pos_emb,
@@ -296,7 +296,22 @@ def __init__(self, config: LlamaConfig, layerwise_recompute: bool = False):
         self.config = config
 
     def _init_rope(self):
-        if self.config.rope_scaling_type is None:
+        if (
+            hasattr(self.config, "rope_scaling")
+            and self.config.rope_scaling is not None
+            and self.config.rope_scaling.get("rope_type", None) == "llama3"
+        ):
+            self.rotary_emb = Llama3RotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.config.rope_theta,
+                factor=self.config.rope_scaling["factor"],
+                high_freq_factor=self.config.rope_scaling["high_freq_factor"],
+                low_freq_factor=self.config.rope_scaling["low_freq_factor"],
+                original_max_position_embeddings=self.config.rope_scaling["original_max_position_embeddings"],
+            )
+
+        elif self.config.rope_scaling_type is None:
             self.rotary_emb = LlamaRotaryEmbedding(
                 self.head_dim,
                 max_position_embeddings=self.max_position_embeddings,
@@ -422,7 +437,7 @@ def forward(
         # repeat k/v heads if n_kv_heads < n_heads
         # paddle version > 2.6 or develop support flash-attn with gqa/mqa
         paddle_version = float(paddle.__version__[:3])
-        if (paddle_version != 0.0) and (paddle_version <= 2.6):
+        if not self.config.use_flash_attention or (paddle_version != 0.0) and (paddle_version <= 2.6):
             key_states = repeat_kv(key_states, self.num_key_value_groups)
             value_states = repeat_kv(value_states, self.num_key_value_groups)
 
@@ -656,6 +671,10 @@ class LlamaPretrainedModelNet(PretrainedModel):
     pretrained_resource_files_map = LLAMA_PRETRAINED_RESOURCE_FILES_MAP
     _keys_to_ignore_on_load_unexpected = [r"self_attn.rotary_emb.inv_freq"]
 
+    # TODO(): wa that loading weight first, then parallelize.
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split):
+        return {}
 
 @register_base_model
 class LlamaModelNet(LlamaPretrainedModelNet):
@@ -1053,7 +1072,7 @@ def auto_dist_config(self, prefix=""):
             },
             "mp_config": {
                 "parallelize_plan": {
-                    f"{prefix}llama.embed_tokens": dist.ColWiseParallel(gather_output=True),
+                    f"{prefix}llama.embed_tokens": dist.RowWiseParallel(),
                     f"{prefix}llama.layers.*.self_attn.qkv_proj": dist.ColWiseParallel(),
                     f"{prefix}llama.layers.*.self_attn.q_proj": dist.ColWiseParallel(),
                     f"{prefix}llama.layers.*.self_attn.k_proj": dist.ColWiseParallel(),
diff --git a/paddlenlp/transformers/llama/modeling_pp.py b/paddlenlp/transformers/llama/modeling_pp.py
index 3392a460c573..a3f21326626c 100644
--- a/paddlenlp/transformers/llama/modeling_pp.py
+++ b/paddlenlp/transformers/llama/modeling_pp.py
@@ -147,6 +147,41 @@ def forward(self, args):
             _type_: _description_
         """
         input_ids, attention_mask, attn_mask_startend_row_indices, position_ids, alibi = parse_args(args)
+
+        # we can't distinguish
+        if self.config.alibi and alibi is None and position_ids is None and attn_mask_startend_row_indices is not None:
+            # input_ids, attention_mask, alibi
+            alibi = attn_mask_startend_row_indices
+            position_ids = None
+            attn_mask_startend_row_indices = None
+        elif (
+            self.config.alibi
+            and alibi is None
+            and position_ids is not None
+            and attn_mask_startend_row_indices is not None
+        ):
+            # input_ids, attention_mask, position_ids, alibi
+            alibi = position_ids
+            position_ids = attn_mask_startend_row_indices
+            attn_mask_startend_row_indices = None
+        elif not self.config.alibi:
+            if get_env_device() in ["gpu"]:
+                if attention_mask is not None and attention_mask.dtype == paddle.int32:
+                    attention_mask, attn_mask_startend_row_indices, position_ids = (
+                        None,
+                        attention_mask,
+                        attn_mask_startend_row_indices,
+                    )
+                elif attention_mask is not None and attention_mask.dtype == paddle.int64:
+                    attention_mask, attn_mask_startend_row_indices, position_ids = None, None, attention_mask
+                elif (
+                    attn_mask_startend_row_indices is not None and attn_mask_startend_row_indices.dtype == paddle.int64
+                ):
+                    attn_mask_startend_row_indices, position_ids = None, attn_mask_startend_row_indices
+            elif position_ids is None and attn_mask_startend_row_indices is not None:
+                position_ids = attn_mask_startend_row_indices
+                attn_mask_startend_row_indices = None
+
         input_embeds = self.embed_tokens(input_ids)
         if self.sequence_parallel:
             from paddlenlp.transformers import ScatterOp
@@ -315,6 +350,9 @@ class LlamaForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_fuse_or_split_param_mappings = LlamaPretrainedModel._get_fuse_or_split_param_mappings
     _init_weights = LlamaPretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = LlamaPretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = LlamaPretrainedModel._get_model_flops
+    _get_hardware_flops = LlamaPretrainedModel._get_hardware_flops
+
     _tied_weights_keys = ["lm_head.weight"]
 
     # DONOT Add base_model_prefix !!!!
diff --git a/paddlenlp/transformers/llm_embed/__init__.py b/paddlenlp/transformers/llm_embed/__init__.py
new file mode 100644
index 000000000000..0f0d00141b52
--- /dev/null
+++ b/paddlenlp/transformers/llm_embed/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/transformers/llm_embed/modeling.py b/paddlenlp/transformers/llm_embed/modeling.py
new file mode 100644
index 000000000000..b50128e5c8f2
--- /dev/null
+++ b/paddlenlp/transformers/llm_embed/modeling.py
@@ -0,0 +1,298 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+from tqdm import tqdm
+
+from ...utils.log import logger
+from .. import AutoConfig, AutoModel, PretrainedModel
+from ..model_outputs import ModelOutput
+
+
+@dataclass
+class EncoderOutput(ModelOutput):
+    q_reps: Optional[paddle.Tensor] = None
+    p_reps: Optional[paddle.Tensor] = None
+    loss: Optional[paddle.Tensor] = None
+    scores: Optional[paddle.Tensor] = None
+
+
+__all__ = ["BiEncoderModel"]
+
+
+class BiEncoderModel(PretrainedModel):
+    def __init__(
+        self,
+        model_name_or_path: str = None,
+        dtype: str = "float16",
+        normalized: bool = False,
+        sentence_pooling_method: str = "cls",
+        negatives_cross_device: bool = False,
+        temperature: float = 1.0,
+        use_inbatch_neg: bool = True,
+        margin: float = 0.3,
+        matryoshka_dims: Optional[List[int]] = None,
+        matryoshka_loss_weights: Optional[List[float]] = None,
+        query_instruction: Optional[str] = None,
+        document_instruction: Optional[str] = None,
+        eval_batch_size: int = 8,
+        tokenizer=None,
+        max_seq_length: int = 4096,
+    ):
+        super().__init__()
+        self.model = AutoModel.from_pretrained(model_name_or_path, dtype=dtype, convert_from_torch=True)
+        self.model_config = AutoConfig.from_pretrained(model_name_or_path)
+        self.cross_entropy = nn.CrossEntropyLoss(reduction="mean")
+
+        self.normalized = normalized
+        self.sentence_pooling_method = sentence_pooling_method
+        self.temperature = temperature
+        self.use_inbatch_neg = use_inbatch_neg
+        self.config = self.model_config
+        self.margin = margin
+        self.matryoshka_dims = matryoshka_dims
+
+        self.query_instruction = query_instruction
+        self.document_instruction = document_instruction
+        self.eval_batch_size = eval_batch_size
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+
+        if self.matryoshka_dims:
+            self.matryoshka_loss_weights = (
+                matryoshka_loss_weights if matryoshka_loss_weights else [1] * len(self.matryoshka_dims)
+            )
+        else:
+            self.matryoshka_loss_weights = None
+
+        if not normalized:
+            self.temperature = 1.0
+            logger.info("reset temperature = 1.0 due to using inner product to compute similarity")
+
+        self.negatives_cross_device = negatives_cross_device
+        if self.negatives_cross_device:
+            if not dist.is_initialized():
+                raise ValueError("Distributed training has not been initialized for representation all gather.")
+            self.process_rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+
+    def sentence_embedding(self, hidden_state, mask):
+        if self.sentence_pooling_method == "mean":
+            s = paddle.sum(hidden_state * mask.unsqueeze(-1).float(), axis=1)
+            d = mask.sum(axis=1, keepdim=True).float()
+            return s / d
+        elif self.sentence_pooling_method == "cls":
+            return hidden_state[:, 0]
+        elif self.sentence_pooling_method == "last":
+            # return hidden_state[:, -1] # this is for padding side is left
+            sequence_lengths = mask.sum(axis=1)
+            last_token_indices = sequence_lengths - 1
+            embeddings = hidden_state[paddle.arange(hidden_state.shape[0]), last_token_indices]
+            return embeddings
+        else:
+            raise ValueError(f"Invalid sentence pooling method: {self.sentence_pooling_method}")
+
+    def get_model_config(
+        self,
+    ):
+        return self.model_config.to_dict()
+
+    def encode(self, features):
+        psg_out = self.model(**features, return_dict=True, output_hidden_states=True)
+        p_reps = self.sentence_embedding(psg_out.hidden_states[-1], features["attention_mask"])
+        return p_reps
+
+    def compute_similarity(self, q_reps, p_reps):
+        # q_reps [batch_size, embedding_dim]
+        # p_reps [batch_size, embedding_dim]
+        return paddle.matmul(q_reps, p_reps.transpose([1, 0]))
+
+    def hard_negative_loss(self, q_reps, p_reps):
+        scores = self.compute_similarity(q_reps, p_reps)
+        scores = scores / self.temperature
+        scores = scores.reshape([q_reps.shape[0], -1])
+
+        target = paddle.arange(scores.shape[0], dtype="int64")
+        target = target * (p_reps.shape[0] // q_reps.shape[0])
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def in_batch_negative_loss(self, q_reps, p_reps):
+        # In batch negatives
+        scores = self.compute_similarity(q_reps, p_reps)
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(shape=[q_reps.shape[0]], fill_value=self.margin, dtype=q_reps.dtype)
+        scores = scores - paddle.diag(margin_diag)
+        # Scale cosine to ease training converge
+        scores = scores / self.temperature
+        target = paddle.arange(0, q_reps.shape[0], dtype="int64")
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def forward(
+        self,
+        query: Dict[str, paddle.Tensor] = None,
+        passage: Dict[str, paddle.Tensor] = None,
+        teacher_score: paddle.Tensor = None,
+    ):
+        q_reps = self.encode(query)
+        p_reps = self.encode(passage)
+
+        # For non-matryoshka loss, we normalize the representations
+        if not self.matryoshka_dims:
+            if self.normalized:
+                q_reps = paddle.nn.functional.normalize(q_reps, axis=-1)
+                p_reps = paddle.nn.functional.normalize(p_reps, axis=-1)
+
+        if self.training:
+            # Cross device negatives
+            if self.negatives_cross_device:
+                q_reps = self._dist_gather_tensor(q_reps)
+                p_reps = self._dist_gather_tensor(p_reps)
+
+            if self.matryoshka_dims:
+                loss = 0.0
+                scores = 0.0
+                for loss_weight, dim in zip(self.matryoshka_loss_weights, self.matryoshka_dims):
+                    reduced_q = q_reps[:, :dim]
+                    reduced_d = p_reps[:, :dim]
+                    if self.normalized:
+                        reduced_q = paddle.nn.functional.normalize(reduced_q, axis=-1)
+                        reduced_d = paddle.nn.functional.normalize(reduced_d, axis=-1)
+
+                    if self.use_inbatch_neg:
+                        dim_score, dim_loss = self.in_batch_negative_loss(reduced_q, reduced_d)
+                    else:
+                        dim_score, dim_loss = self.hard_negative_loss(reduced_q, reduced_d)
+                    scores += dim_score
+                    loss += loss_weight * dim_loss
+
+            elif self.use_inbatch_neg:
+                scores, loss = self.in_batch_negative_loss(q_reps, p_reps)
+            else:
+                scores, loss = self.hard_negative_loss(q_reps, p_reps)
+
+        else:
+            scores = self.compute_similarity(q_reps, p_reps)
+            loss = None
+        return EncoderOutput(
+            loss=loss,
+            scores=scores,
+            q_reps=q_reps,
+            p_reps=p_reps,
+        )
+
+    def compute_loss(self, scores, target):
+        return self.cross_entropy(scores, target)
+
+    def _dist_gather_tensor(self, t: Optional[paddle.Tensor]):
+        if t is None:
+            return None
+
+        all_tensors = [paddle.empty_like(t) for _ in range(self.world_size)]
+        dist.all_gather(all_tensors, t)
+
+        all_tensors[self.process_rank] = t
+        all_tensors = paddle.concat(all_tensors, axis=0)
+
+        return all_tensors
+
+    def save_pretrained(self, output_dir: str, **kwargs):
+        state_dict = self.model.state_dict()
+        state_dict = type(state_dict)({k: v.clone().cpu() for k, v in state_dict.items()})
+        self.model.save_pretrained(output_dir, state_dict=state_dict)
+
+    @paddle.no_grad()
+    def encode_sentences(self, sentences: List[str], **kwargs) -> np.ndarray:
+        self.model.eval()
+        all_embeddings = []
+        for start_index in tqdm(range(0, len(sentences), self.eval_batch_size), desc="Batches"):
+            sentences_batch = sentences[start_index : start_index + self.eval_batch_size]
+
+            inputs = self.tokenizer(
+                sentences_batch,
+                padding=True,
+                truncation=True,
+                return_tensors="pd",
+                max_length=self.max_seq_length,
+                return_attention_mask=True,
+            )
+            outputs = self.model(
+                input_ids=inputs.input_ids,
+                attention_mask=inputs.attention_mask,
+                return_dict=True,
+                output_hidden_states=True,
+            )
+            last_hidden_state = outputs.hidden_states[-1]
+
+            if self.sentence_pooling_method == "last":
+                if self.tokenizer.padding_side == "right":
+                    sequence_lengths = inputs.attention_mask.sum(axis=1)
+                    last_token_indices = sequence_lengths - 1
+                    embeddings = last_hidden_state[paddle.arange(last_hidden_state.shape[0]), last_token_indices]
+                elif self.tokenizer.padding_side == "left":
+                    embeddings = last_hidden_state[:, -1]
+                else:
+                    raise NotImplementedError(f"Padding side {self.tokenizer.padding_side} not supported.")
+            elif self.sentence_pooling_method == "cls":
+                embeddings = last_hidden_state[:, 1]
+            elif self.sentence_pooling_method == "mean":
+                s = paddle.sum(last_hidden_state * inputs.attention_mask.unsqueeze(-1), axis=1)
+                d = inputs.attention_mask.sum(axis=1, keepdim=True)
+                embeddings = s / d
+            else:
+                raise NotImplementedError(f"Pooling method {self.pooling_method} not supported.")
+
+            embeddings = paddle.nn.functional.normalize(embeddings, p=2, axis=-1)
+
+            all_embeddings.append(embeddings.cpu().numpy().astype("float32"))
+
+        return np.concatenate(all_embeddings, axis=0)
+
+    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
+        """
+        This function will be used to encode queries for retrieval task
+        if there is a instruction for queries, we will add it to the query text
+        """
+        if self.query_instruction is not None:
+            input_texts = [f"{self.query_instruction}{query}" for query in queries]
+        else:
+            input_texts = queries
+        return self.encode_sentences(input_texts)
+
+    def encode_corpus(self, corpus: List[Union[Dict[str, str], str]], **kwargs) -> np.ndarray:
+        """
+        This function will be used to encode corpus for retrieval task
+        if there is a instruction for docs, we will add it to the doc text
+        """
+        if isinstance(corpus[0], dict):
+            if self.document_instruction is not None:
+                input_texts = [
+                    "{}{} {}".format(self.document_instruction, doc.get("title", ""), doc["text"]).strip()
+                    for doc in corpus
+                ]
+            else:
+                input_texts = ["{} {}".format(doc.get("title", ""), doc["text"]).strip() for doc in corpus]
+        else:
+            if self.document_instruction is not None:
+                input_texts = [f"{self.document_instruction}{doc}" for doc in corpus]
+            else:
+                input_texts = corpus
+        return self.encode_sentences(input_texts)
diff --git a/paddlenlp/transformers/luke/tokenizer.py b/paddlenlp/transformers/luke/tokenizer.py
index 56f8242b87ce..3607879d34e8 100644
--- a/paddlenlp/transformers/luke/tokenizer.py
+++ b/paddlenlp/transformers/luke/tokenizer.py
@@ -27,7 +27,7 @@
 import warnings
 from itertools import repeat
 
-from .. import RobertaBPETokenizer
+from ..roberta.tokenizer import RobertaBPETokenizer
 
 try:
     from functools import lru_cache
diff --git a/paddlenlp/transformers/model_outputs.py b/paddlenlp/transformers/model_outputs.py
index 5700522746ab..82de522ef534 100644
--- a/paddlenlp/transformers/model_outputs.py
+++ b/paddlenlp/transformers/model_outputs.py
@@ -662,6 +662,10 @@ class BaseModelOutputWithPastAndCrossAttentions(ModelOutput):
 
             Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
             weighted average in the cross-attention heads.
+        cum_offsets (`tuple(paddle.Tensor)`, *optional*, needed when `return_full_hidden_states=True`:
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, 1)`.
+
+            Offset of the current batch.
     """
 
     last_hidden_state: paddle.Tensor = None
@@ -669,6 +673,7 @@ class BaseModelOutputWithPastAndCrossAttentions(ModelOutput):
     hidden_states: Optional[Tuple[paddle.Tensor]] = None
     attentions: Optional[Tuple[paddle.Tensor]] = None
     cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cum_offsets: Optional[Tuple[paddle.Tensor]] = None
 
 
 @dataclass
diff --git a/paddlenlp/transformers/model_utils.py b/paddlenlp/transformers/model_utils.py
index 9c8f5f896ac6..5249da1d878a 100644
--- a/paddlenlp/transformers/model_utils.py
+++ b/paddlenlp/transformers/model_utils.py
@@ -424,7 +424,7 @@ def load_state_dict(
         with safe_open(checkpoint_file, framework="np") as f:
             metadata = f.metadata()
         if metadata is None:
-            metadata = {"format", "np"}
+            metadata = {"format": "np"}
 
         if metadata.get("format", "np") not in ["pd", "np"]:
             raise OSError(
@@ -1161,7 +1161,7 @@ def set_inference_config(cls, config, predictor_args, **kwargs):
         tensor_parallel_degree = kwargs.pop("tensor_parallel_degree", 1)
         tensor_parallel_rank = kwargs.pop("tensor_parallel_rank", 0)
 
-        if predictor_args.mode == "dynamic":
+        if predictor_args.mode == "dynamic" or predictor_args.speculate_method in ["eagle", "mtp"]:
             config.tensor_parallel_degree = tensor_parallel_degree
             config.tensor_parallel_rank = tensor_parallel_rank
             config.model_name_or_path = predictor_args.model_name_or_path
@@ -1203,7 +1203,12 @@ def set_inference_config(cls, config, predictor_args, **kwargs):
             config.speculate_max_ngram_size = predictor_args.speculate_max_ngram_size
             config.speculate_verify_window = predictor_args.speculate_verify_window
             config.speculate_max_candidate_len = predictor_args.speculate_max_candidate_len
-            config.decode_strategy = "speculate_decoding"
+            if predictor_args.speculate_method is not None:
+                if config.get("speculate_model_type", "None") in ["eagle", "mtp"]:
+                    config.decode_strategy = "draft_model_sample"
+                else:
+                    config.decode_strategy = "speculate_decoding"
+        config.return_full_hidden_states = predictor_args.return_full_hidden_states
 
     @classmethod
     def confirm_inference_model(cls, predictor_args, **kwargs):
@@ -1287,18 +1292,16 @@ def get_memory_footprint(self, return_buffers=True):
         return mem
 
     def get_model_flops(self, *args, **kwargs):
-        base_model = getattr(self, self.base_model_prefix, self)
-        if base_model is not self:
-            return base_model.get_model_flops()
+        if hasattr(self, "_get_model_flops"):
+            return self._get_model_flops()
 
-        raise NotImplementedError(f"model of {type(base_model)} has not implemented the `get_model_flops`")
+        raise NotImplementedError(f"model of {type(self)} has not implemented the `_get_model_flops`")
 
     def get_hardware_flops(self, *args, **kwargs):
-        base_model = getattr(self, self.base_model_prefix, self)
-        if base_model is not self:
-            return base_model.get_hardware_flops()
+        if hasattr(self, "_get_hardware_flops"):
+            return self._get_hardware_flops()
 
-        raise NotImplementedError(f"model of {type(base_model)} has not implemented the `get_hardware_flops`")
+        raise NotImplementedError(f"model of {type(self)} has not implemented the `_get_hardware_flops`")
 
     def get_input_embeddings(self) -> nn.Embedding:
         """get input embedding of model
diff --git a/paddlenlp/transformers/moe_gate.py b/paddlenlp/transformers/moe_gate.py
index 8118ba60f7ac..995226de893b 100644
--- a/paddlenlp/transformers/moe_gate.py
+++ b/paddlenlp/transformers/moe_gate.py
@@ -69,7 +69,11 @@ def _one_hot_to_int64(self, x, num_classes):
 
     @paddle.no_grad()
     def _capacity(
-        self, gates: paddle.Tensor, capacity_factor: float, max_capacity: int, min_capacity: int
+        self,
+        gates: paddle.Tensor,
+        capacity_factor: float,
+        max_capacity: int,
+        min_capacity: int,
     ) -> paddle.Tensor:
         """Calculate the capacity for each expert based on the gates and capacity factor.
 
@@ -107,6 +111,7 @@ def _cal_aux_loss(self, gates, mask):
             paddle.Tensor: The value of auxiliary loss.
 
         """
+        # TODO: @DrownFish19 update aux_loss for Qwen2MoE and DeepSeekV2&V3
         me = paddle.mean(gates, axis=0)
         ce = paddle.mean(mask.cast("float32"), axis=0)
         if self.global_aux_loss:
@@ -131,7 +136,7 @@ def _cal_z_loss(self, logits) -> paddle.Tensor:
         Returns:
             paddle.Tensor: The z loss value.
         """
-        l_zloss = logits.exp().sum(1).log().square().mean()
+        l_zloss = paddle.logsumexp(logits, axis=1).square().mean()
         return l_zloss
 
     def _cal_orthogonal_loss(self) -> paddle.Tensor:
@@ -175,8 +180,14 @@ def __init__(self, config, num_experts, expert_hidden_size, **kwargs):
         self.top2_2nd_expert_sampling = kwargs.pop("top2_2nd_expert_sampling", True)
 
         self.drop_policy = kwargs.pop("drop_policy", "probs")
+        # Qwen2MoE: greedy
+        # DeepSeekV2&V3: group_limited_greedy for training, and noaux_tc for inference
+        self.topk_method = kwargs.pop("topk_method", "greedy")
         self.top_k = kwargs.pop("top_k", 2)
+        self.n_group = kwargs.pop("n_group", 1)  # for group_limited_greedy
+        self.topk_group = kwargs.pop("topk_group", 1)  # for group_limited_greedy
         self.norm_topk_prob = kwargs.pop("norm_topk_prob", False)
+        self.routed_scaling_factor = kwargs.pop("routed_scaling_factor", 1.0)
 
     def _priority(self, topk_idx: paddle.Tensor, capacity: int) -> paddle.Tensor:
         """_summary_
@@ -228,7 +239,7 @@ def _priority(self, topk_idx: paddle.Tensor, capacity: int) -> paddle.Tensor:
 
         return dispatch_mask
 
-    def topk_naive(self, scores: paddle.Tensor, k: int) -> Tuple[paddle.Tensor, paddle.Tensor]:
+    def _topk_greedy(self, scores: paddle.Tensor, k: int) -> Tuple[paddle.Tensor, paddle.Tensor]:
         """_summary_
 
         Args:
@@ -240,10 +251,10 @@ def topk_naive(self, scores: paddle.Tensor, k: int) -> Tuple[paddle.Tensor, padd
             topk_weight: [bsz*seq_len, k]
             topk_idx: [bsz*seq_len, k]
         """
-        topk_weight, topk_idx = paddle.topk(scores, k=k, axis=-1, sorted=False)
+        topk_weight, topk_idx = paddle.topk(scores, k=k, axis=-1, sorted=True)
         return topk_weight, topk_idx
 
-    def topk_group(
+    def _topk_group_limited_greedy(
         self, scores: paddle.Tensor, k: int, n_group: int, topk_group: int
     ) -> Tuple[paddle.Tensor, paddle.Tensor]:
         """_summary_
@@ -275,6 +286,43 @@ def topk_group(
 
         return topk_weight, topk_idx
 
+    def _topk_noaux_tc(
+        self, scores: paddle.Tensor, k: int, n_group: int, topk_group: int
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """_summary_
+
+        Args:
+            scores (paddle.Tensor): [bsz*seq_len, n_experts]
+            k (int): select the top k experts in each group
+            n_groups (int): the number of groups for all experts
+            topk_group (int): the number of groups selected
+
+        Returns:
+            Tuple[paddle.Tensor, paddle.Tensor]: topk_weight, topk_idx
+            topk_weight: [bsz*seq_len, k]
+            topk_idx: [bsz*seq_len, k]
+
+        Note: the group size is normal greater than the number of k
+        """
+        bsz_seq_len, n_experts = scores.shape
+        assert n_experts % n_group == 0, "n_experts must be divisible by n_groups"
+
+        assert self.e_score_correction_bias is not None, "e_score_correction_bias is None"
+        scores_for_choice = scores.reshape([bsz_seq_len, -1]) + self.e_score_correction_bias.unsqueeze(0)
+        group_scores = (
+            scores_for_choice.reshape([bsz_seq_len, self.n_group, -1]).topk(2, axis=-1)[0].sum(axis=-1)
+        )  # fmt:skip [n, n_group]
+        group_idx = paddle.topk(group_scores, k=topk_group, axis=-1, sorted=False)[1]  # [n, top_k_group]
+        group_mask = paddle.zeros_like(group_scores).put_along_axis(group_idx, paddle.to_tensor(1.0), axis=-1)  # fmt:skip
+        score_mask = (
+            group_mask.unsqueeze(-1).expand([bsz_seq_len, n_group, n_experts // n_group]).reshape([bsz_seq_len, -1])
+        )  # [n, e]
+        tmp_scores = scores_for_choice * score_mask  # [n, e]
+        topk_weight, topk_idx = paddle.topk(tmp_scores, k=k, axis=-1, sorted=False)
+        topk_weight = scores.take_along_axis(topk_idx, axis=1) if not self.training else topk_weight
+
+        return topk_weight, topk_idx
+
     def top1gating(
         self,
         logits: paddle.Tensor,
@@ -432,7 +480,22 @@ def topkgating(
         l_zloss = self._cal_z_loss(gates)
 
         # get topk gates
-        top_gate, top_idx = paddle.topk(gates, k=self.top_k, axis=1)
+        if self.topk_method == "greedy":
+            top_gate, top_idx = self._topk_greedy(gates, k=self.top_k)
+        elif self.topk_method == "group_limited_greedy":
+            top_gate, top_idx = self._topk_group_limited_greedy(
+                gates, k=self.top_k, n_group=self.n_group, topk_group=self.topk_group
+            )
+        elif self.topk_method == "noaux_tc":
+            top_gate, top_idx = self._topk_noaux_tc(
+                gates, k=self.top_k, n_group=self.n_group, topk_group=self.topk_group
+            )
+            # norm gate to sum 1
+        if self.top_k > 1 and self.norm_topk_prob:
+            denominator = top_gate.sum(axis=-1, keepdim=True) + 1e-20
+            top_gate = top_gate / denominator
+        top_gate = top_gate * self.routed_scaling_factor
+
         # get topk mask
         mask = paddle.zeros_like(gates).put_along_axis(top_idx, paddle.to_tensor(1.0), axis=1)
         l_aux = self._cal_aux_loss(gates, mask)
@@ -441,7 +504,12 @@ def topkgating(
 
         if self.drop_tokens:
             # Calculate configured capacity and remove locations outside capacity from mask
-            capacity = self._capacity(gates, self.capacity_factor * self.top_k, self.max_capacity, self.min_capacity)
+            capacity = self._capacity(
+                gates,
+                self.capacity_factor * self.top_k,
+                self.max_capacity,
+                self.min_capacity,
+            )
 
             # update mask and locations by capacity
             if self.drop_policy == "probs":
@@ -462,13 +530,21 @@ def topkgating(
             token_priority = self._priority(top_idx, capacity)
 
         # normalize gates
-        gates_masked = gates * mask
-        gates_s = paddle.sum(gates_masked, axis=-1, keepdim=True)
-        denom_s = paddle.clip(gates_s, min=paddle.finfo(gates_masked.dtype).eps)
-        if self.norm_topk_prob:
-            gates_masked = gates_masked / denom_s
+        if self.training:
+            gates_masked = gates * mask
+            gates_s = paddle.sum(gates_masked, axis=-1, keepdim=True)
+            denom_s = paddle.clip(gates_s, min=paddle.finfo(gates_masked.dtype).eps)
+            if self.norm_topk_prob:
+                gates_masked = gates_masked / denom_s
+            combine_weights = paddle.einsum(
+                "se,sec->sec", gates_masked, token_priority.cast(paddle.get_default_dtype())
+            )
+        else:
+            topk_masked_gates = paddle.zeros_like(gates).put_along_axis(top_idx, top_gate, axis=1)
+            combine_weights = paddle.einsum(
+                "se,sec->sec", topk_masked_gates, token_priority.cast(paddle.get_default_dtype())
+            )
 
-        combine_weights = paddle.einsum("se,sec->sec", gates_masked, token_priority.cast(paddle.get_default_dtype()))
         dispatch_mask = combine_weights.cast(paddle.bool)
 
         return capacity, combine_weights, dispatch_mask, exp_counts, l_aux, l_zloss
diff --git a/paddlenlp/transformers/moe_layer.py b/paddlenlp/transformers/moe_layer.py
index 56369c6c3b92..90d4feae6c72 100644
--- a/paddlenlp/transformers/moe_layer.py
+++ b/paddlenlp/transformers/moe_layer.py
@@ -162,12 +162,14 @@ def __init__(
             self.moe_num_experts_per_device = self._parse_moe_expert_parallel(
                 self.moe_num_experts, self.expert_parallel_degree
             )
+            self.is_dummy_moe = False if self.expert_parallel_degree > 1 else True
         else:
             # when moe_group is dummy, we don't need to use all_to_all
             self.moe_group = None
             self.moe_rank = 0
             self.expert_parallel_degree = 1
             self.moe_num_experts_per_device = self.moe_num_experts
+            self.is_dummy_moe = True
 
         self.all_to_all_dropout = all_to_all_dropout
         self.enable_recompute = False
@@ -175,12 +177,13 @@ def __init__(
         self.experts = nn.LayerList([])
         for i in range(self.moe_num_experts):
             if i // self.moe_num_experts_per_device == self.moe_rank:
-                self.experts.append(expert_class(expert_kwargs))
+                self.experts.append(expert_class(**expert_kwargs))
             else:
                 self.experts.append(None)
 
         self.gate = gate
         self.gate.group = self.moe_group
+        self._post_init()
 
     def _parse_moe_expert_parallel(self, moe_num_experts, expert_parallel_degree):
         assert (
diff --git a/paddlenlp/transformers/nv_embed/__init__.py b/paddlenlp/transformers/nv_embed/__init__.py
new file mode 100644
index 000000000000..0f0d00141b52
--- /dev/null
+++ b/paddlenlp/transformers/nv_embed/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/transformers/nv_embed/modeling.py b/paddlenlp/transformers/nv_embed/modeling.py
new file mode 100644
index 000000000000..98004ac9428c
--- /dev/null
+++ b/paddlenlp/transformers/nv_embed/modeling.py
@@ -0,0 +1,530 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+import tqdm
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.log import logger
+from .. import AutoTokenizer, MistralModel, PretrainedConfig, PretrainedModel
+from ..model_outputs import BaseModelOutputWithPast, ModelOutput
+
+__all__ = ["NVEncodeModel"]
+
+
+@dataclass
+class EncoderOutput(ModelOutput):
+    q_reps: Optional[paddle.Tensor] = None
+    p_reps: Optional[paddle.Tensor] = None
+    loss: Optional[paddle.Tensor] = None
+    scores: Optional[paddle.Tensor] = None
+
+
+def scaled_dot_product_attention(q, k, v):  # [bs, len, num_heads, dim]
+    matmul_qk = paddle.matmul(q.transpose([0, 2, 1, 3]), k.transpose([0, 2, 3, 1]))
+    dk = paddle.to_tensor(k.shape[-1], dtype=paddle.float32)
+    scaled_attention_logits = matmul_qk / paddle.sqrt(dk)
+    attention_weights = paddle.nn.functional.softmax(scaled_attention_logits, axis=-1)  # [bs, num_heads, q_len, k_len]
+    output = paddle.matmul(attention_weights, v.transpose([0, 2, 1, 3]))  # [bs, num_heads, q_len, dim]
+    output = output.transpose([0, 2, 1, 3])  # [bs, q_len, num_heads, dim]
+    return output
+
+
+def _make_bidirection_mask(
+    input_ids_shape: paddle.shape,
+    dtype: paddle.dtype,
+    past_key_values_length: int = 0,
+):
+    """
+    Make bidirection mask used for sliding window attention
+    """
+    bsz, tgt_len = input_ids_shape
+
+    tensor = paddle.full(
+        (tgt_len, tgt_len),
+        fill_value=1,
+    )
+    mask = paddle.tril(tensor, diagonal=0)
+    mask = paddle.ones_like(mask)  # here is for bidirection attention
+    mask = paddle.log(mask).astype(dtype)
+
+    if past_key_values_length > 0:
+        mask = paddle.concat([paddle.zeros([tgt_len, past_key_values_length], dtype=dtype), mask], axis=-1)
+    return mask[None, None, :, :].expand([bsz, 1, tgt_len, tgt_len + past_key_values_length])
+
+
+def _expand_mask(mask: paddle.Tensor, dtype: paddle.dtype, tgt_len):
+    expanded_mask = mask
+    if len(mask.shape) == 2:
+        """
+        Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+        """
+        bsz, src_len = mask.shape
+        tgt_len = tgt_len if tgt_len is not None else src_len
+
+        expanded_mask = mask[:, None, None, :].expand([bsz, 1, tgt_len, src_len]).astype(dtype)
+    elif len(mask.shape) == 3:
+        """
+        Expands attention_mask from `[bsz, tgt_seq_len, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+        """
+        expanded_mask = mask.unsqueeze(1).astype(dtype)
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return paddle.where(inverted_mask > 0.5, paddle.full_like(inverted_mask, paddle.finfo(dtype).min), inverted_mask)
+
+
+class LatentModel(PretrainedModel):
+    config_class = PretrainedConfig
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.cross_attend_blocks_0_fn_to_kv = paddle.nn.Linear(
+            in_features=config.hidden_size, out_features=2 * config.max_position_embeddings, bias_attr=False
+        )
+        self.cross_attend_blocks_0_fn_to_out = paddle.nn.Linear(
+            in_features=config.max_position_embeddings, out_features=config.hidden_size, bias_attr=False
+        )
+        self.cross_attend_blocks_0_fn_to_q = paddle.nn.Linear(
+            in_features=config.hidden_size, out_features=config.max_position_embeddings, bias_attr=False
+        )
+        self.cross_attend_blocks_0_norm = paddle.nn.LayerNorm(config.hidden_size)
+        self.cross_attend_blocks_0_norm_context = paddle.nn.LayerNorm(config.hidden_size)
+
+        self.cross_attend_blocks_1_fn_net_0 = paddle.nn.Linear(
+            in_features=config.hidden_size, out_features=config.max_position_embeddings
+        )
+        self.cross_attend_blocks_1_fn_net_2 = paddle.nn.Linear(
+            in_features=config.max_position_embeddings // 2, out_features=config.hidden_size
+        )
+        self.cross_attend_blocks_1_norm = paddle.nn.LayerNorm(config.hidden_size)
+
+        self.latents = paddle.nn.Linear(in_features=config.hidden_size, out_features=512, bias_attr=False)
+
+    def forward(self, last_hidden_states, pool_mask):
+        one = paddle.eye(
+            num_rows=self.config.hidden_size,
+            num_columns=self.config.hidden_size,
+            dtype=self.latents.weight.dtype,
+        )
+        self_latents_weight_T = self.latents(one).T
+        # latents = repeat(self_latents_weight_T, "d h -> b d h", b=last_hidden_states.shape[0]) # from einops import repeat
+        latents = paddle.tile(self_latents_weight_T, repeat_times=last_hidden_states.shape[0]).reshape(
+            [self_latents_weight_T.shape[0], last_hidden_states.shape[0], self_latents_weight_T.shape[1]]
+        )
+        latents = latents.transpose([1, 0, 2])
+
+        normed_x = self.cross_attend_blocks_0_norm(last_hidden_states)
+        normed_context = self.cross_attend_blocks_0_norm_context(latents)
+
+        q = self.cross_attend_blocks_0_fn_to_q(normed_x)
+        kv = self.cross_attend_blocks_0_fn_to_kv(normed_context)
+        k = kv[:, :, : self.config.max_position_embeddings]
+        v = kv[:, :, self.config.max_position_embeddings :]
+
+        # q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b n h d", h=self.config.num_key_value_heads), (q, k, v)) # from einops import rearrange
+        q = q.reshape(
+            [q.shape[0], q.shape[1], self.config.num_key_value_heads, q.shape[2] // self.config.num_key_value_heads]
+        )
+        k = k.reshape(
+            [k.shape[0], k.shape[1], self.config.num_key_value_heads, k.shape[2] // self.config.num_key_value_heads]
+        )
+        v = v.reshape(
+            [v.shape[0], v.shape[1], self.config.num_key_value_heads, v.shape[2] // self.config.num_key_value_heads]
+        )
+
+        # k.stop_gradient = False
+        # v.stop_gradient = False
+        # out = paddle.nn.functional.scaled_dot_product_attention(q, k, v) # if use this, must set k and v stop_gradient to False
+        out = scaled_dot_product_attention(q, k, v)  # if use this, no need to manually set k and v
+        # out = rearrange(out, "b n h d -> b n (h d)", h=self.config.num_key_value_heads) # from einops import rearrange
+        out = out.reshape([out.shape[0], out.shape[1], out.shape[2] * out.shape[3]])
+
+        out_of_layer1 = self.cross_attend_blocks_0_fn_to_out(out) + last_hidden_states
+
+        normed_x = self.cross_attend_blocks_1_norm(out_of_layer1)
+
+        before_geglu = self.cross_attend_blocks_1_fn_net_0(normed_x)
+
+        x_in_gegle = before_geglu[:, :, : self.config.max_position_embeddings // 2]
+        gate_in_geglu = before_geglu[:, :, self.config.max_position_embeddings // 2 :]
+        x_after_geglu = x_in_gegle * paddle.nn.functional.gelu(gate_in_geglu)
+
+        after_geglu = self.cross_attend_blocks_1_fn_net_2(x_after_geglu)
+
+        out_of_layer2 = after_geglu + out_of_layer1
+
+        pool_mask = pool_mask.astype(out_of_layer2.dtype)
+        s = paddle.sum(
+            out_of_layer2 * pool_mask.unsqueeze(-1),
+            axis=1,
+            dtype=str(self.cross_attend_blocks_1_fn_net_2.weight.dtype).split(".")[-1],
+        )
+        d = paddle.sum(
+            pool_mask, axis=1, keepdim=True, dtype=str(self.cross_attend_blocks_1_fn_net_2.weight.dtype).split(".")[-1]
+        )
+        hiddens = s / d
+        hiddens = paddle.nn.functional.normalize(hiddens, p=2, axis=-1)
+
+        return hiddens
+
+
+class NVEncodeModel(MistralModel):
+    def __init__(
+        self,
+        config,
+        tokenizer_path,
+        query_instruction,
+        document_instruction,
+        eval_batch_size=999,
+        normalized=True,
+        negatives_cross_device=False,
+        temperature_=1,
+        margin=0.01,
+        use_inbatch_neg=True,
+        matryoshka_dims=None,
+        matryoshka_loss_weights=None,
+    ):
+        super().__init__(config)  # get mistral model structure
+
+        self.latent_model = LatentModel(config=config)  # get latent model structure
+
+        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, padding_side="right")
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+
+        self.query_instruction = query_instruction
+        self.document_instruction = document_instruction
+
+        self.eval_batch_size = eval_batch_size
+
+        self.normalized = normalized
+        self.negatives_cross_device = negatives_cross_device
+        if self.negatives_cross_device:
+            if not dist.is_initialized():
+                raise ValueError("Distributed training has not been initialized for representation all gather.")
+            self.process_rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+        self.temperature = temperature_
+        self.margin = margin
+        self.use_inbatch_neg = use_inbatch_neg
+        self.matryoshka_dims = matryoshka_dims
+        self.matryoshka_loss_weights = matryoshka_loss_weights
+
+        self.cross_entropy = nn.CrossEntropyLoss(reduction="mean")
+
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+
+        combined_attention_mask = _make_bidirection_mask(
+            input_shape,
+            inputs_embeds.dtype,
+            past_key_values_length=past_key_values_length,
+        )
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1])
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+
+        return combined_attention_mask
+
+    def get_model_config(
+        self,
+    ):
+        return self.model_config.to_dict()
+
+    def encode(self, features, instruction_len):
+        last_hidden_states = self.m_forward(**features)[0]  # get bs*len*4096
+        pool_mask = features["attention_mask"]
+        pool_mask[:, :instruction_len] = 0
+        embeddings = self.latent_model.forward(last_hidden_states, pool_mask)
+        embeddings = paddle.nn.functional.normalize(embeddings, p=2, axis=1)
+        return embeddings
+
+    def compute_similarity(self, q_reps, p_reps):
+        # q_reps [batch_size, embedding_dim]
+        # p_reps [batch_size, embedding_dim]
+        return paddle.matmul(q_reps, p_reps.transpose([1, 0]))
+
+    def hard_negative_loss(self, q_reps, p_reps):
+        scores = self.compute_similarity(q_reps, p_reps)
+        scores = scores / self.temperature
+        scores = scores.reshape([q_reps.shape[0], -1])
+
+        target = paddle.arange(scores.shape[0], dtype="int64")
+        target = target * (p_reps.shape[0] // q_reps.shape[0])
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def in_batch_negative_loss(self, q_reps, p_reps):
+        # In batch negatives
+        scores = self.compute_similarity(q_reps, p_reps)
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(shape=[q_reps.shape[0]], fill_value=self.margin, dtype=q_reps.dtype)
+        scores = scores - paddle.diag(margin_diag)
+        # Scale cosine to ease training converge
+        scores = scores / self.temperature
+        target = paddle.arange(0, q_reps.shape[0], dtype="int64")
+        loss = self.compute_loss(scores, target)
+        return scores, loss
+
+    def forward(
+        self,
+        query: Dict[str, paddle.Tensor] = None,
+        passage: Dict[str, paddle.Tensor] = None,
+        teacher_score: paddle.Tensor = None,
+    ):
+        instruction_len = len(self.tokenizer.encode(self.query_instruction, add_special_tokens=False)["input_ids"])
+        q_reps = self.encode(query, instruction_len)
+        instruction_len = len(self.tokenizer.encode(self.document_instruction, add_special_tokens=False)["input_ids"])
+        p_reps = self.encode(passage, instruction_len)
+
+        # For non-matryoshka loss, we normalize the representations
+        if not self.matryoshka_dims:
+            if self.normalized:
+                q_reps = paddle.nn.functional.normalize(q_reps, axis=-1)
+                p_reps = paddle.nn.functional.normalize(p_reps, axis=-1)
+
+        if self.training:
+            # Cross device negatives
+            if self.negatives_cross_device:
+                q_reps = self._dist_gather_tensor(q_reps)
+                p_reps = self._dist_gather_tensor(p_reps)
+
+            if self.matryoshka_dims:
+                loss = 0.0
+                scores = 0.0
+                for loss_weight, dim in zip(self.matryoshka_loss_weights, self.matryoshka_dims):
+                    reduced_q = q_reps[:, :dim]
+                    reduced_d = p_reps[:, :dim]
+                    if self.normalized:
+                        reduced_q = paddle.nn.functional.normalize(reduced_q, axis=-1)
+                        reduced_d = paddle.nn.functional.normalize(reduced_d, axis=-1)
+
+                    if self.use_inbatch_neg:
+                        dim_score, dim_loss = self.in_batch_negative_loss(reduced_q, reduced_d)
+                    else:
+                        dim_score, dim_loss = self.hard_negative_loss(reduced_q, reduced_d)
+                    scores += dim_score
+                    loss += loss_weight * dim_loss
+
+            elif self.use_inbatch_neg:
+                scores, loss = self.in_batch_negative_loss(q_reps, p_reps)
+            else:
+                scores, loss = self.hard_negative_loss(q_reps, p_reps)
+
+        else:
+            scores = self.compute_similarity(q_reps, p_reps)
+            loss = None
+        return EncoderOutput(
+            loss=loss,
+            scores=scores,
+            q_reps=q_reps,
+            p_reps=p_reps,
+        )
+
+    def compute_loss(self, scores, target):
+        return self.cross_entropy(scores, target)
+
+    def _dist_gather_tensor(self, t: Optional[paddle.Tensor]):
+        if t is None:
+            return None
+
+        all_tensors = [paddle.empty_like(t) for _ in range(self.world_size)]
+        dist.all_gather(all_tensors, t)
+
+        all_tensors[self.process_rank] = t
+        all_tensors = paddle.concat(all_tensors, axis=0)
+
+        return all_tensors
+
+    def save_pretrained(self, output_dir: str, **kwargs):
+        state_dict = self.model.state_dict()
+        state_dict = type(state_dict)({k: v.clone().cpu() for k, v in state_dict.items()})
+        self.model.save_pretrained(output_dir, state_dict=state_dict)
+
+    def m_forward(
+        self,
+        input_ids: paddle.Tensor = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+
+        if position_ids is None:
+            position_ids = paddle.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=paddle.int64
+            )
+            position_ids = position_ids.unsqueeze(0).expand((batch_size, seq_length))
+        else:
+            position_ids = position_ids.reshape([-1, seq_length]).astype("int64")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask,
+            (batch_size, seq_length),
+            inputs_embeds,
+            past_key_values_length,
+        )
+
+        hidden_states = inputs_embeds
+
+        if self.enable_recompute and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, decoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            has_gradient = not hidden_states.stop_gradient
+            if self.enable_recompute and has_gradient:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+    @paddle.no_grad()
+    def encode_sentences(self, sentences: List[str], instruction_len, **kwargs) -> np.ndarray:
+        all_embeddings = []
+        for start_index in tqdm.tqdm(list(range(0, len(sentences), self.eval_batch_size)), desc="Batches"):
+
+            sentences_batch = sentences[start_index : start_index + self.eval_batch_size]
+            inputs = self.tokenizer(
+                sentences_batch,
+                max_length=4096,
+                padding=True,
+                return_attention_mask=True,
+                return_token_type_ids=False,
+                return_tensors="pd",
+                truncation=True,
+            )
+            last_hidden_states = self.m_forward(**inputs)[0]  # get bs*len*4096
+            pool_mask = inputs["attention_mask"]
+            pool_mask[:, :instruction_len] = 0
+
+            embeddings = self.latent_model.forward(last_hidden_states, pool_mask)
+            embeddings = paddle.nn.functional.normalize(embeddings, p=2, axis=1)
+
+            all_embeddings.append(embeddings.cpu().numpy().astype("float32"))
+
+        return np.concatenate(all_embeddings, axis=0)
+
+    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
+        input_texts = [self.query_instruction + q + self.tokenizer.eos_token for q in queries]
+        instruction_len = len(self.tokenizer.encode(self.query_instruction, add_special_tokens=False)["input_ids"])
+        return self.encode_sentences(input_texts, instruction_len)
+
+    def encode_corpus(self, corpus: List[Union[Dict[str, str], str]], **kwargs) -> np.ndarray:
+        if isinstance(corpus[0], dict):
+            input_texts = ["{} {}".format(doc.get("title", ""), doc["text"]).strip() for doc in corpus]
+        else:
+            input_texts = corpus
+
+        input_texts = [self.document_instruction + doc + self.tokenizer.eos_token for doc in input_texts]
+        instruction_len = len(self.tokenizer.encode(self.document_instruction, add_special_tokens=False)["input_ids"])
+        return self.encode_sentences(input_texts, instruction_len)
diff --git a/paddlenlp/transformers/qwen/modeling.py b/paddlenlp/transformers/qwen/modeling.py
index 2f465e9c3d8c..6f44737bc45a 100755
--- a/paddlenlp/transformers/qwen/modeling.py
+++ b/paddlenlp/transformers/qwen/modeling.py
@@ -59,7 +59,7 @@ def swiglu(x, y=None):
 from .. import linear_utils
 from ..linear_utils import Linear
 from ..model_outputs import ModelOutput
-from ..utils import caculate_llm_flops
+from ..utils import caculate_llm_per_token_flops
 from .configuration import QWenConfig
 
 try:
@@ -281,7 +281,15 @@ def _attn(self, query, key, value, attention_mask=None):
             # [bz, sql, nh, hid] ==> [bz, nh, sql hdim]
             value = value.transpose([0, 2, 1, 3])
 
-            attn_weights = paddle.matmul(query / math.sqrt(head_dim), key.transpose([0, 1, 3, 2]))
+            # Add pre divided factor to fix nan under float16.
+            if paddle.in_dynamic_mode() and query.dtype == paddle.float16:
+                pre_divided_factor = 32
+            else:
+                pre_divided_factor = 1
+
+            attn_weights = paddle.matmul(
+                query / (math.sqrt(head_dim) * pre_divided_factor), key.transpose([0, 1, 3, 2])
+            )
 
             if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
                 raise ValueError(
@@ -292,7 +300,7 @@ def _attn(self, query, key, value, attention_mask=None):
             if attention_mask is None:
                 attention_mask = get_triangle_upper_mask(attn_weights)
             attn_weights = attn_weights + attention_mask
-            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(value.dtype)
+            attn_weights = F.softmax(attn_weights.astype("float32") * pre_divided_factor, axis=-1).astype(value.dtype)
 
             attn_weights = self.attn_dropout(attn_weights)
             attn_output = paddle.matmul(attn_weights, value)
@@ -555,6 +563,37 @@ class QWenPretrainedModel(PretrainedModel):
     def __init__(self, *inputs, **kwargs):
         super().__init__(*inputs, **kwargs)
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     @classmethod
     def _get_tensor_parallel_mappings(cls, config, is_split=True):
 
@@ -744,39 +783,6 @@ def __init__(self, config):
         )
         self.ln_f = QWenRMSNorm(config)
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.wte
 
@@ -1167,26 +1173,38 @@ def forward(
         )
         hidden_states = transformer_outputs[0]
 
-        lm_logits = self.lm_head(hidden_states)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(lm_logits, labels)
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
 
-        # lm_logits = self.lm_head(hidden_states)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states)
 
-        # loss = None
-        # if labels is not None:
-        #     loss_fct = nn.CrossEntropyLoss()
-        #     loss = loss_fct(lm_logits, labels)
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
-            output = (lm_logits,) + transformer_outputs[1:]
+            output = (logits,) + transformer_outputs[1:]
             return ((loss,) + output) if loss is not None else output
 
         return CausalLMOutputWithPast(
             loss=loss,
-            logits=lm_logits,
+            logits=logits,
             past_key_values=transformer_outputs.past_key_values,
             hidden_states=transformer_outputs.hidden_states,
             attentions=transformer_outputs.attentions,
diff --git a/paddlenlp/transformers/qwen/modeling_pp.py b/paddlenlp/transformers/qwen/modeling_pp.py
index 0f3d285ce465..613f197f825b 100644
--- a/paddlenlp/transformers/qwen/modeling_pp.py
+++ b/paddlenlp/transformers/qwen/modeling_pp.py
@@ -143,6 +143,8 @@ class QWenForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_tensor_parallel_mappings = QWenPretrainedModel._get_tensor_parallel_mappings
     _init_weights = QWenPretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = QWenPretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = QWenPretrainedModel._get_model_flops
+    _get_hardware_flops = QWenPretrainedModel._get_hardware_flops
 
     # DONOT Add base_model_prefix !!!!
 
diff --git a/paddlenlp/transformers/qwen2/modeling.py b/paddlenlp/transformers/qwen2/modeling.py
index 4f6be646afc1..71a1d2abf321 100644
--- a/paddlenlp/transformers/qwen2/modeling.py
+++ b/paddlenlp/transformers/qwen2/modeling.py
@@ -59,7 +59,7 @@
     TokenClassifierOutput,
 )
 from ..model_utils import PretrainedModel, register_base_model
-from ..utils import caculate_llm_flops, logger
+from ..utils import caculate_llm_per_token_flops, logger
 from .configuration import Qwen2Config
 
 try:
@@ -202,8 +202,15 @@ def scaled_dot_product_attention(
         key_states = paddle.transpose(key_states, [0, 2, 1, 3])
         value_states = paddle.transpose(value_states, [0, 2, 1, 3])
 
-        # matmul and divide by sqrt(head_dim)
-        attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
+        # Add pre divided factor to fix nan under float16.
+        if paddle.in_dynamic_mode() and query_states.dtype == paddle.float16:
+            pre_divided_factor = 32
+        else:
+            pre_divided_factor = 1
+
+        attn_weights = paddle.matmul(
+            query_states / (math.sqrt(head_dim) * pre_divided_factor), key_states.transpose([0, 1, 3, 2])
+        )
 
         if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
             raise ValueError(
@@ -213,6 +220,7 @@ def scaled_dot_product_attention(
 
         if attention_mask is None:
             attention_mask = get_triangle_upper_mask(attn_weights)
+
         attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len])
         if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]:
             raise ValueError(
@@ -220,11 +228,16 @@ def scaled_dot_product_attention(
             )
 
         attn_weights = attn_weights + attention_mask
+
         if not paddle.in_dynamic_mode():
-            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+            attn_weights = F.softmax(attn_weights * pre_divided_factor, axis=-1, dtype="float32").astype(
+                query_states.dtype
+            )
         else:
             with paddle.amp.auto_cast(False):
-                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+                attn_weights = F.softmax(
+                    attn_weights.astype("float32") * pre_divided_factor, axis=-1, dtype="float32"
+                ).astype(query_states.dtype)
 
         attn_weights = F.dropout(attn_weights, p=config.attention_dropout, training=training)
 
@@ -1013,6 +1026,37 @@ def _get_fuse_or_split_param_mappings(cls, config: Qwen2Config, is_fuse=False):
                     final_actions[keys] = partial(fn, split_nums=2)
         return final_actions
 
+    def _get_model_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=False,
+        )
+
+    def _get_hardware_flops(self):
+        if hasattr(self.config, "seq_length"):
+            seq_length = self.config.seq_length
+        else:
+            seq_length = 2048
+
+        return caculate_llm_per_token_flops(
+            hidden_size=self.config.hidden_size,
+            intermediate_size=self.config.intermediate_size,
+            layer_num=self.config.num_hidden_layers,
+            vocab_size=self.config.vocab_size,
+            seq_length=seq_length,
+            recompute=self.config.recompute,
+            recompute_granularity=self.config.recompute_granularity,
+        )
+
     def _init_weights(self, layer):
         """Initialization hook"""
         if self.config.tensor_parallel_degree > 1:
@@ -1113,39 +1157,6 @@ def __init__(self, config: Qwen2Config):
         )
         self.norm = Qwen2RMSNorm(config)
 
-    def get_model_flops(self, batch_size=1, seq_length=None, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=False,
-        )
-
-    def get_hardware_flops(self, batch_size=1, seq_length=None, recompute=False, **kwargs):
-        if seq_length is None:
-            if hasattr(self.config, "seq_length"):
-                seq_length = self.config.seq_length
-            else:
-                seq_length = 2048
-
-        return caculate_llm_flops(
-            hidden_size=self.config.hidden_size,
-            intermediate_size=self.config.intermediate_size,
-            layer_num=self.config.num_hidden_layers,
-            vocab_size=self.config.vocab_size,
-            seq_length=seq_length,
-            recompute=recompute,
-            recompute_granularity=self.config.recompute_granularity,
-        )
-
     def get_input_embeddings(self):
         return self.embed_tokens
 
@@ -1493,7 +1504,6 @@ def prepare_inputs_for_generation(
     ):
         batch_size, seq_length = input_ids.shape
         position_ids = kwargs.get("position_ids", paddle.arange(seq_length).expand((batch_size, seq_length)))
-        attention_mask = kwargs.get("attention_mask", None)
         if past_key_values:
             input_ids = input_ids[:, -1].unsqueeze(axis=-1)
             position_ids = position_ids[:, -1].unsqueeze(-1)
@@ -1624,11 +1634,30 @@ def forward(
         # tensor_parallel_output is together with ParallelCrossEntropy
         tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
 
-        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
+
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         if not return_dict:
             output = (logits,) + outputs[1:]
diff --git a/paddlenlp/transformers/qwen2/modeling_pp.py b/paddlenlp/transformers/qwen2/modeling_pp.py
index bab8c25e7965..a60a4db257ad 100644
--- a/paddlenlp/transformers/qwen2/modeling_pp.py
+++ b/paddlenlp/transformers/qwen2/modeling_pp.py
@@ -234,6 +234,9 @@ class Qwen2ForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
     _get_tensor_parallel_mappings = Qwen2PretrainedModel._get_tensor_parallel_mappings
     _init_weights = Qwen2PretrainedModel._init_weights
     _keys_to_ignore_on_load_unexpected = Qwen2PretrainedModel._keys_to_ignore_on_load_unexpected
+    _get_model_flops = Qwen2PretrainedModel._get_model_flops
+    _get_hardware_flops = Qwen2PretrainedModel._get_hardware_flops
+
     _tied_weights_keys = ["lm_head.weight"]
 
     # DONOT Add base_model_prefix !!!!
diff --git a/paddlenlp/transformers/qwen2_moe/__init__.py b/paddlenlp/transformers/qwen2_moe/__init__.py
index 2f2acfa9b339..d68171b98ec8 100644
--- a/paddlenlp/transformers/qwen2_moe/__init__.py
+++ b/paddlenlp/transformers/qwen2_moe/__init__.py
@@ -15,3 +15,4 @@
 from ..qwen2.tokenizer import *
 from .configuration import *
 from .modeling import *
+from .modeling_pp import *
diff --git a/paddlenlp/transformers/qwen2_moe/modeling.py b/paddlenlp/transformers/qwen2_moe/modeling.py
index 501e79673c8e..76d84d26f0cb 100644
--- a/paddlenlp/transformers/qwen2_moe/modeling.py
+++ b/paddlenlp/transformers/qwen2_moe/modeling.py
@@ -12,13 +12,14 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Paddle Qwen2Moe model."""
+"""Paddle Qwen2Moe model."""
+
 from __future__ import annotations
 
 import math
 import warnings
 from functools import partial
-from typing import Optional, Tuple
+from typing import List, Optional, Tuple, Union
 
 import paddle
 import paddle.distributed.fleet.meta_parallel as mpu
@@ -28,14 +29,19 @@
 from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
 from paddle.distributed.fleet.utils import recompute
 
-from ...utils.log import logger
+from paddlenlp.utils.tools import get_env_device
+
 from .. import linear_utils
 from ..activations import ACT2FN
 from ..conversion_utils import StateDictNameMapping, init_name_mappings
+from ..linear_utils import Linear
+from ..llama import fusion_ops
+from ..llama.modeling import get_use_casual_mask
 from ..model_outputs import MoECausalLMOutputWithPast, MoEModelOutputWithPast
 from ..model_utils import PretrainedModel, register_base_model
 from ..moe_gate import PretrainedMoEGate
 from ..moe_layer import MoELayer
+from ..utils import logger
 from .configuration import Qwen2MoeConfig
 
 try:
@@ -219,8 +225,10 @@ def scaled_dot_product_attention(
     value_states,
     attention_mask,
     output_attentions,
+    attn_mask_startend_row_indices=None,
     training=True,
     sequence_parallel=False,
+    skip_recompute=False,
 ):
     bsz, q_len, num_heads, head_dim = query_states.shape
     _, kv_seq_len, _, _ = value_states.shape
@@ -229,40 +237,25 @@ def scaled_dot_product_attention(
         # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
         # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
 
-        version = paddle.version.full_version
-        if version != "0.0.0" and version <= "2.5.2":
-            attn_output, attn_weights = flash_attention(
-                query_states,
-                key_states,
-                value_states,
-                causal=True,
-                return_softmax=output_attentions,
-            )
-        else:
-            attn_output = F.scaled_dot_product_attention(
-                query_states,
-                key_states,
-                value_states,
-                attn_mask=attention_mask,
-                is_causal=attention_mask is None,
-                dropout_p=config.attention_dropout if training else 0.0,
-                training=training,
-            )
-            attn_weights = None
-
-        if sequence_parallel:
-            attn_output = attn_output.reshape([bsz * q_len, head_dim * num_heads])
-        else:
-            attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads])
-        return (attn_output, attn_weights) if output_attentions else attn_output
+        return fusion_ops.fusion_flash_attention(
+            query_states,
+            config,
+            key_states,
+            value_states,
+            attention_mask,
+            output_attentions,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            sequence_parallel=sequence_parallel,
+            skip_recompute=skip_recompute,
+        )
     else:
         #  [ bz, seqlen, nhead, head_dim] -> [bs, nhead, seq_len, head_dim]
         query_states = paddle.transpose(query_states, [0, 2, 1, 3])
-        # merge with the next tranpose
+        # merge with the next transpose
         key_states = paddle.transpose(key_states, [0, 2, 1, 3])
         value_states = paddle.transpose(value_states, [0, 2, 1, 3])
 
-        # matmul and devide by sqrt(head_dim)
+        # matmul and divide by sqrt(head_dim)
         attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
 
         if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
@@ -356,14 +349,15 @@ def __init__(self, config: Qwen2MoeConfig):
             mark_as_sequence_parallel_parameter(self.weight)
 
     def forward(self, hidden_states):
+        if self.config.use_fused_rms_norm:
+            return fusion_ops.fusion_rms_norm(hidden_states, self.weight, self.variance_epsilon, False)
+
         if paddle.in_dynamic_mode():
             with paddle.amp.auto_cast(False):
-                hidden_states = hidden_states.astype("float32")
-                variance = hidden_states.pow(2).mean(-1, keepdim=True)
+                variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
                 hidden_states = paddle.rsqrt(variance + self.variance_epsilon) * hidden_states
         else:
-            hidden_states = hidden_states.astype("float32")
-            variance = hidden_states.pow(2).mean(-1, keepdim=True)
+            variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
             hidden_states = paddle.rsqrt(variance + self.variance_epsilon) * hidden_states
 
         if self.weight.dtype in [paddle.float16, paddle.bfloat16]:
@@ -436,6 +430,8 @@ def __init__(self, config: Qwen2MoeConfig, is_shared=False):
         self.intermediate_size = (
             config.moe_intermediate_size if not is_shared else config.shared_expert_intermediate_size
         )
+        self.fuse_attention_ffn = config.fuse_attention_ffn
+
         self.tensor_parallel_degree = config.tensor_parallel_degree
 
         if config.sequence_parallel:
@@ -446,18 +442,26 @@ def __init__(self, config: Qwen2MoeConfig, is_shared=False):
             RowParallelLinear = linear_utils.RowParallelLinear
 
         if config.tensor_parallel_degree > 1:
-            self.gate_proj = ColumnParallelLinear(
-                self.hidden_size,
-                self.intermediate_size,
-                gather_output=False,
-                has_bias=False,
-            )
-            self.up_proj = ColumnParallelLinear(
-                self.hidden_size,
-                self.intermediate_size,
-                gather_output=False,
-                has_bias=False,
-            )
+            if self.fuse_attention_ffn:
+                self.gate_up_fused_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size * 2,
+                    gather_output=False,
+                    has_bias=False,
+                )
+            else:
+                self.gate_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size,
+                    gather_output=False,
+                    has_bias=False,
+                )
+                self.up_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size,
+                    gather_output=False,
+                    has_bias=False,
+                )
             self.down_proj = RowParallelLinear(
                 self.intermediate_size,
                 self.hidden_size,
@@ -465,14 +469,36 @@ def __init__(self, config: Qwen2MoeConfig, is_shared=False):
                 has_bias=False,
             )
         else:
-            self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w1
-            self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w3
-            self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias_attr=False)  # w2
+            if self.fuse_attention_ffn:
+                self.gate_up_fused_proj = Linear(self.hidden_size, self.intermediate_size * 2, bias_attr=False)
+            else:
+                self.gate_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w1
+                self.up_proj = Linear(self.hidden_size, self.intermediate_size, bias_attr=False)  # w3
+            self.down_proj = Linear(self.intermediate_size, self.hidden_size, bias_attr=False)  # w2
 
-        self.act_fn = ACT2FN[config.hidden_act]
+        if config.hidden_act == "silu":
+            self.act_fn = fusion_ops.swiglu
+            self.fuse_swiglu = True
+        else:
+            self.act_fn = ACT2FN[config.hidden_act]
+            self.fuse_swiglu = False
 
     def forward(self, x):
-        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        if self.fuse_attention_ffn:
+            x = self.gate_up_fused_proj(x)
+            if self.fuse_swiglu:
+                y = None
+            else:
+                x, y = x.chunk(2, axis=-1)
+        else:
+            x, y = self.gate_proj(x), self.up_proj(x)
+
+        if self.fuse_swiglu:
+            x = self.act_fn(x, y)
+        else:
+            x = self.act_fn(x) * y
+
+        return self.down_proj(x)
 
 
 def repeat_kv(hidden_states: paddle.Tensor, n_rep: int) -> paddle.Tensor:
@@ -515,6 +541,8 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
         self.seq_length = config.seq_length
         self.sequence_parallel = config.sequence_parallel
 
+        self.fuse_attention_qkv = config.fuse_attention_qkv
+
         # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
         # Enable_recompute defaults to False and is controlled by Trainer
         self.enable_recompute = False
@@ -533,7 +561,7 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
 
         self.use_fused_rope = config.use_fused_rope
         if self.use_fused_rope:
-            if "gpu" not in paddle.device.get_device() or fused_rotary_position_embedding is None:
+            if get_env_device() not in ["gpu", "xpu"] or fused_rotary_position_embedding is None:
                 warnings.warn(
                     "Enable fuse rope in the config, but fuse rope is not available. "
                     "Will disable fuse rope. Try using latest gpu version of Paddle."
@@ -548,19 +576,30 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
             RowParallelLinear = linear_utils.RowParallelLinear
 
         if config.tensor_parallel_degree > 1:
-            self.q_proj = ColumnParallelLinear(self.hidden_size, self.hidden_size, has_bias=True, gather_output=False)
-            self.k_proj = ColumnParallelLinear(
-                self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False
-            )
-            self.v_proj = ColumnParallelLinear(
-                self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False
-            )
+            if self.fuse_attention_qkv:
+                self.qkv_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.hidden_size + 2 * self.config.num_key_value_heads * self.head_dim,
+                    has_bias=True,
+                    gather_output=False,
+                )
+            else:
+                self.q_proj = ColumnParallelLinear(
+                    self.hidden_size, self.hidden_size, has_bias=True, gather_output=False
+                )
+                self.k_proj = ColumnParallelLinear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False)  # fmt:skip
+                self.v_proj = ColumnParallelLinear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, has_bias=True, gather_output=False)  # fmt:skip
             self.o_proj = RowParallelLinear(self.hidden_size, self.hidden_size, has_bias=False, input_is_parallel=True)
         else:
-            self.q_proj = nn.Linear(self.hidden_size, self.hidden_size, bias_attr=True)
-            self.k_proj = nn.Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
-            self.v_proj = nn.Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
-            self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias_attr=False)
+            if self.fuse_attention_qkv:
+                self.qkv_proj = Linear(
+                    self.hidden_size, self.hidden_size + 2 * self.config.num_key_value_heads * self.head_dim
+                )
+            else:
+                self.q_proj = Linear(self.hidden_size, self.hidden_size, bias_attr=True)
+                self.k_proj = Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
+                self.v_proj = Linear(self.hidden_size, self.config.num_key_value_heads * self.head_dim, bias_attr=True)
+            self.o_proj = Linear(self.hidden_size, self.hidden_size, bias_attr=False)
 
         self.rotary_emb = Qwen2MoeRotaryEmbedding(
             self.head_dim,
@@ -568,6 +607,8 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = True):
             base=self.rope_theta,
         )
 
+        self.attn_func = scaled_dot_product_attention
+
     def forward(
         self,
         hidden_states,
@@ -576,26 +617,45 @@ def forward(
         attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: bool = False,
         use_cache: bool = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
         """Input shape: Batch x Time x Channel"""
         # [bs, seq_len, num_head * head_dim] -> [seq_len / n, bs, num_head * head_dim] (n is model parallelism)
 
-        batch_size, seq_len, _ = hidden_states.shape
-
-        query_states = self.q_proj(hidden_states)
-        key_states = self.k_proj(hidden_states)
-        value_states = self.v_proj(hidden_states)
-
-        if self.sequence_parallel:
-            target_query_shape = [-1, self.seq_length, self.num_heads, self.head_dim]
-            target_key_value_shape = [-1, self.seq_length, self.num_key_value_heads, self.head_dim]
+        if self.fuse_attention_qkv:
+            mix_layer = self.qkv_proj(hidden_states)
+            if self.sequence_parallel:
+                target_shape = [
+                    -1,
+                    self.seq_length,
+                    self.num_key_value_heads,
+                    (self.num_key_value_groups + 2) * self.head_dim,
+                ]
+            else:
+                target_shape = [0, 0, self.num_key_value_heads, (self.num_key_value_groups + 2) * self.head_dim]
+            mix_layer = paddle.reshape_(mix_layer, target_shape)
+            query_states, key_states, value_states = paddle.split(
+                mix_layer,
+                num_or_sections=[self.num_key_value_groups * self.head_dim, self.head_dim, self.head_dim],
+                axis=-1,
+            )
+            if self.gqa_or_mqa:
+                query_states = paddle.reshape_(query_states, [0, 0, self.num_heads, self.head_dim])
         else:
-            target_query_shape = [0, 0, self.num_heads, self.head_dim]
-            target_key_value_shape = [0, 0, self.num_key_value_heads, self.head_dim]
-        query_states = query_states.reshape(shape=target_query_shape)
-        key_states = key_states.reshape(shape=target_key_value_shape)
-        value_states = value_states.reshape(shape=target_key_value_shape)
+            query_states = self.q_proj(hidden_states)
+            key_states = self.k_proj(hidden_states)
+            value_states = self.v_proj(hidden_states)
+
+            if self.sequence_parallel:
+                target_query_shape = [-1, self.seq_length, self.num_heads, self.head_dim]
+                target_key_value_shape = [-1, self.seq_length, self.num_key_value_heads, self.head_dim]
+            else:
+                target_query_shape = [0, 0, self.num_heads, self.head_dim]
+                target_key_value_shape = [0, 0, self.num_key_value_heads, self.head_dim]
+            query_states = query_states.reshape(shape=target_query_shape)
+            key_states = key_states.reshape(shape=target_key_value_shape)
+            value_states = value_states.reshape(shape=target_key_value_shape)
 
         kv_seq_len = key_states.shape[-3]
 
@@ -626,8 +686,10 @@ def forward(
 
         # TODO(wj-Mcat): use broadcast strategy when n_kv_heads = 1
         # repeat k/v heads if n_kv_heads < n_heads
-        key_states = repeat_kv(key_states, self.num_key_value_groups)
-        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        paddle_version = float(paddle.__version__[:3])
+        if not self.config.use_flash_attention or ((paddle_version != 0.0) and (paddle_version <= 2.6)):
+            key_states = repeat_kv(key_states, self.num_key_value_groups)
+            value_states = repeat_kv(value_states, self.num_key_value_groups)
 
         has_gradient = not (query_states.stop_gradient and key_states.stop_gradient and value_states.stop_gradient)
         if (
@@ -637,27 +699,29 @@ def forward(
             and self.recompute_granularity == "core_attn"
         ):
             outputs = recompute(
-                scaled_dot_product_attention,
+                self.attn_func,
                 query_states,
                 self.config,
                 key_states,
                 value_states,
                 attention_mask,
                 output_attentions,
-                self.training,
-                self.sequence_parallel,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                training=self.training,
+                sequence_parallel=self.sequence_parallel,
                 use_reentrant=self.config.recompute_use_reentrant,
             )
         else:
-            outputs = scaled_dot_product_attention(
+            outputs = self.attn_func(
                 query_states,
                 self.config,
                 key_states,
                 value_states,
                 attention_mask,
                 output_attentions,
-                self.training,
-                self.sequence_parallel,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                training=self.training,
+                sequence_parallel=self.sequence_parallel,
             )
         if output_attentions:
             attn_output, attn_weights = outputs
@@ -729,7 +793,7 @@ def __init__(self, config: Qwen2MoeConfig):
             config,
             moe_num_experts=config.num_experts,
             expert_class=Qwen2MoeMLP,
-            expert_kwargs=config,
+            expert_kwargs={"config": config},
             gate=gate,
             capacity=2.0,
         )
@@ -776,12 +840,13 @@ def __init__(self, config: Qwen2MoeConfig, layerwise_recompute: bool = False):
     def forward(
         self,
         hidden_states: paddle.Tensor,
-        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        position_ids: Optional[paddle.Tensor] = None,
         attention_mask: Optional[paddle.Tensor] = None,
         output_attentions: Optional[bool] = False,
         output_router_logits: Optional[bool] = False,
         past_key_value: Optional[Tuple[paddle.Tensor]] = None,
         use_cache: Optional[bool] = False,
+        attn_mask_startend_row_indices: Optional[paddle.Tensor] = None,
         **kwargs,
     ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
         """
@@ -822,6 +887,7 @@ def forward(
                 attention_mask,
                 output_attentions,
                 use_cache,
+                attn_mask_startend_row_indices,
                 use_reentrant=self.config.recompute_use_reentrant,
             )
         else:
@@ -832,6 +898,7 @@ def forward(
                 attention_mask,
                 output_attentions,
                 use_cache,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
             )
 
         if type(outputs) is tuple:
@@ -999,6 +1066,66 @@ def get_tensor_parallel_split_mappings(num_layers, num_experts):
 
         return mappings
 
+    @classmethod
+    def _get_fuse_or_split_param_mappings(cls, config: Qwen2MoeConfig, is_fuse=False):
+        # return parameter fuse utils
+        from paddlenlp.transformers.conversion_utils import split_or_fuse_func
+
+        fn = split_or_fuse_func(is_fuse=is_fuse)
+
+        # last key is fused key, other keys are to be fused.
+        fuse_qkv_keys = [
+            (
+                "layers.0.self_attn.q_proj.weight",
+                "layers.0.self_attn.k_proj.weight",
+                "layers.0.self_attn.v_proj.weight",
+                "layers.0.self_attn.qkv_proj.weight",
+            ),
+            (
+                "layers.0.self_attn.q_proj.bias",
+                "layers.0.self_attn.k_proj.bias",
+                "layers.0.self_attn.v_proj.bias",
+                "layers.0.self_attn.qkv_proj.bias",
+            ),
+        ]
+
+        fuse_gate_up_keys = (
+            "layers.0.mlp.gate_proj.weight",
+            "layers.0.mlp.up_proj.weight",
+            "layers.0.mlp.gate_up_fused_proj.weight",
+        )
+        num_heads = config.num_attention_heads
+        num_key_value_heads = getattr(config, "num_key_value_heads", num_heads)
+        fuse_attention_qkv = getattr(config, "fuse_attention_qkv", False)
+        fuse_attention_ffn = getattr(config, "fuse_attention_ffn", False)
+
+        final_actions = {}
+        if is_fuse:
+            if fuse_attention_qkv:
+                for i in range(config.num_hidden_layers):
+                    for fuse_keys in fuse_qkv_keys:
+                        keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_keys])
+                        final_actions[keys] = partial(
+                            fn, is_qkv=True, num_heads=num_heads, num_key_value_heads=num_key_value_heads
+                        )
+            if fuse_attention_ffn:
+                for i in range(config.num_hidden_layers):
+                    keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_gate_up_keys])
+                    final_actions[keys] = fn
+        else:
+            if not fuse_attention_qkv:
+                for i in range(config.num_hidden_layers):
+                    for fuse_keys in fuse_qkv_keys:
+                        keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_keys])
+                        final_actions[keys] = partial(
+                            fn, split_nums=3, is_qkv=True, num_heads=num_heads, num_key_value_heads=num_key_value_heads
+                        )
+            if not fuse_attention_ffn:
+                for i in range(config.num_hidden_layers):
+                    keys = tuple([key.replace("layers.0.", f"layers.{i}.") for key in fuse_gate_up_keys])
+                    final_actions[keys] = partial(fn, split_nums=2)
+        return final_actions
+
     def _init_weights(self, layer):
         """Initialization hook"""
         if self.config.tensor_parallel_degree > 1:
@@ -1009,11 +1136,11 @@ def _init_weights(self, layer):
                 nn.Linear,
                 nn.Embedding,
                 mpu.VocabParallelEmbedding,
-                mpu.ColumnParallelLinear,
                 mpu.RowParallelLinear,
-                Qwen2MoeLMHead,
-                linear_utils.ColumnSequenceParallelLinear,
+                mpu.ColumnParallelLinear,
                 linear_utils.RowSequenceParallelLinear,
+                linear_utils.ColumnSequenceParallelLinear,
+                Qwen2MoeLMHead,
             ),
         ):
             # In the dygraph mode, use the `set_value` to reset the parameter directly,
@@ -1087,7 +1214,10 @@ def __init__(self, config: Qwen2MoeConfig):
 
         self.layers = nn.LayerList(
             [
-                Qwen2MoeDecoderLayer(config, layerwise_recompute=layer_idx not in self.no_recompute_layers)
+                Qwen2MoeDecoderLayer(
+                    config=config,
+                    layerwise_recompute=layer_idx not in self.no_recompute_layers,
+                )
                 for layer_idx in range(config.num_hidden_layers)
             ]
         )
@@ -1138,6 +1268,7 @@ def recompute_training_full(
         output_router_logits: bool,
         past_key_value: Tensor,
         use_cache: bool,
+        attn_mask_startend_row_indices=None,
     ):
         def create_custom_forward(module):
             def custom_forward(*inputs):
@@ -1154,6 +1285,7 @@ def custom_forward(*inputs):
             output_router_logits,
             past_key_value,
             use_cache,
+            attn_mask_startend_row_indices,
             use_reentrant=self.config.recompute_use_reentrant,
         )
 
@@ -1161,21 +1293,19 @@ def custom_forward(*inputs):
 
     def forward(
         self,
-        input_ids=None,
-        position_ids=None,
-        attention_mask=None,
-        inputs_embeds=None,
-        use_cache=None,
-        past_key_values=None,
-        output_attentions=False,
-        output_hidden_states=None,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
         output_router_logits: Optional[bool] = None,
-        return_dict=False,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
         **kwargs,
-    ):
-        if self.sequence_parallel and use_cache:
-            raise ValueError("We currently only support sequence parallel without cache.")
-
+    ) -> Union[Tuple, MoEModelOutputWithPast]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
 
         output_router_logits = (
@@ -1185,7 +1315,6 @@ def forward(
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
         )
         use_cache = use_cache if use_cache is not None else self.config.use_cache
-
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
         # retrieve input_ids and inputs_embeds
@@ -1209,6 +1338,7 @@ def forward(
             cache_length = past_key_values[0][0].shape[1]
             seq_length_with_past += cache_length
         if inputs_embeds is None:
+            # [bs, seq_len, dim]
             inputs_embeds = self.embed_tokens(input_ids)
 
         if self.sequence_parallel:
@@ -1219,20 +1349,24 @@ def forward(
             inputs_embeds = ScatterOp.apply(inputs_embeds)
 
         # embed positions
-        if attention_mask is None:
+        if attn_mask_startend_row_indices is not None or get_use_casual_mask():
+            attention_mask = None
+        else:
             # [bs, seq_len]
-            attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+            attention_mask = (
+                paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+                if attention_mask is None
+                else attention_mask
+            )
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype
+            )  # [bs, 1, seq_len, seq_len]
+            if self.config.use_flash_attention:
+                attention_mask = None if is_casual_mask(attention_mask) else attention_mask
 
         if position_ids is None:
             position_ids = paddle.arange(seq_length, dtype="int64").expand((batch_size, seq_length))
 
-        attention_mask = self._prepare_decoder_attention_mask(
-            attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype
-        )  # [bs, 1, seq_len, seq_len]
-        if self.config.use_flash_attention:
-            is_casual = is_casual_mask(attention_mask)
-            if is_casual:
-                attention_mask = None
         hidden_states = inputs_embeds
 
         # decoder layers
@@ -1262,6 +1396,7 @@ def forward(
                     output_router_logits,
                     past_key_value,
                     use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
             else:
                 layer_outputs = decoder_layer(
@@ -1272,6 +1407,7 @@ def forward(
                     output_router_logits,
                     past_key_value,
                     use_cache,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                 )
 
             # NOTE: clear outdate cache after it has been used for memory saving
@@ -1334,7 +1470,7 @@ def forward(self, prediction_scores, masked_lm_labels):
         if self.enable_parallel_cross_entropy:
             if prediction_scores.shape[-1] == self.config.vocab_size:
                 warnings.warn(
-                    f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
+                    f"enable_parallel_cross_entropy, the vocab_size should be splitted: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
                 )
                 self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=self.ignore_index)
 
@@ -1342,8 +1478,16 @@ def forward(self, prediction_scores, masked_lm_labels):
             masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
 
             # skip ignore_index which loss == 0
-            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
-            loss = paddle.mean(masked_lm_loss)
+            # masked_lm_loss = masked_lm_loss[masked_lm_loss > 0]
+            # loss = paddle.mean(masked_lm_loss)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
 
         return loss
 
@@ -1429,7 +1573,6 @@ def prepare_inputs_for_generation(
     ):
         batch_size, seq_length = input_ids.shape
         position_ids = kwargs.get("position_ids", paddle.arange(seq_length).expand((batch_size, seq_length)))
-        attention_mask = kwargs.get("attention_mask", None)
         if past_key_values:
             input_ids = input_ids[:, -1].unsqueeze(axis=-1)
             position_ids = position_ids[:, -1].unsqueeze(-1)
@@ -1473,26 +1616,35 @@ def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder
             model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
 
         if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            # TODO: support attention mask for other models
             attention_mask = model_kwargs["attention_mask"]
-            model_kwargs["attention_mask"] = paddle.concat(
-                [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)], axis=-1
-            )
+            if len(attention_mask.shape) == 2:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )
+            elif len(attention_mask.shape) == 4:
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([*attention_mask.shape[:3], 1], dtype=attention_mask.dtype)],
+                    axis=-1,
+                )[:, :, -1:, :]
 
         return model_kwargs
 
     def forward(
         self,
-        input_ids=None,
-        position_ids=None,
-        attention_mask=None,
-        inputs_embeds=None,
-        labels=None,
-        use_cache=False,
-        past_key_values=None,
-        output_attentions=None,
-        output_hidden_states=None,
+        input_ids: paddle.Tensor = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
         output_router_logits: Optional[bool] = None,
-        return_dict=None,
+        return_dict: Optional[bool] = None,
+        attn_mask_startend_row_indices=None,
     ):
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
@@ -1503,6 +1655,13 @@ def forward(
         )
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
+        if attn_mask_startend_row_indices is not None and attention_mask is not None:
+            logger.warning(
+                "You have provided both attn_mask_startend_row_indices and attention_mask. "
+                "The attn_mask_startend_row_indices will be used."
+            )
+            attention_mask = None
+
         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
         outputs = self.qwen2_moe(
             input_ids=input_ids,  # [bs, seq_len]
@@ -1515,19 +1674,39 @@ def forward(
             output_hidden_states=output_hidden_states,
             output_router_logits=output_router_logits,
             return_dict=return_dict,
+            attn_mask_startend_row_indices=attn_mask_startend_row_indices,
         )
 
         hidden_states = outputs[0]  # [bs, seq_len, dim]
 
         # if labels is None，means we need full output, instead of tensor_parallel_output
-        # tensor_parallel_output is togather with ParallelCrossEntropy
+        # tensor_parallel_output is together with ParallelCrossEntropy
         tensor_parallel_output = self.config.tensor_parallel_output and self.config.tensor_parallel_degree > 1
 
-        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+        if labels is not None and self.config.use_fused_linear_cross_entropy:
+            from paddlenlp_kernel.triton.cut_cross_entropy import linear_cross_entropy
+
+            assert (
+                self.config.tensor_parallel_degree <= 1
+            ), "The argument `use_fused_linear_cross_entropy` is imcompatiable with tensor parallel "
+
+            masked_lm_loss = linear_cross_entropy(hidden_states, self.lm_head.weight, targets=labels)
 
-        loss = None
-        if labels is not None:
-            loss = self.criterion(logits, labels)
+            binary_sequence = paddle.where(
+                masked_lm_loss > 0, paddle.ones_like(masked_lm_loss), paddle.zeros_like(masked_lm_loss)
+            )
+            count = paddle.sum(binary_sequence)
+            if count == 0:
+                loss = paddle.sum(masked_lm_loss * binary_sequence)
+            else:
+                loss = paddle.sum(masked_lm_loss * binary_sequence) / count
+            logits = None
+        else:
+            logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+            loss = None
+            if labels is not None:
+                loss = self.criterion(logits, labels)
 
         aux_loss = None
         if output_router_logits:
diff --git a/paddlenlp/transformers/qwen2_moe/modeling_pp.py b/paddlenlp/transformers/qwen2_moe/modeling_pp.py
new file mode 100644
index 000000000000..a4194a9d0c69
--- /dev/null
+++ b/paddlenlp/transformers/qwen2_moe/modeling_pp.py
@@ -0,0 +1,354 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import OrderedDict
+
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+)
+from paddle.distributed.fleet.recompute.recompute import recompute
+
+from ...utils.tools import get_env_device
+from ..model_utils import PipelinePretrainedModel
+from .modeling import (
+    Qwen2MoeConfig,
+    Qwen2MoeDecoderLayer,
+    Qwen2MoeLMHead,
+    Qwen2MoeModel,
+    Qwen2MoePretrainedModel,
+    Qwen2MoePretrainingCriterion,
+    Qwen2MoeRMSNorm,
+)
+
+__all__ = [
+    "Qwen2MoeForCausalLMPipe",
+]
+
+
+def parse_args(args):
+    if isinstance(args, tuple):
+        if len(args) == 4:
+            hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = args
+        elif len(args) == 3:
+            hidden_states, attention_mask, attn_mask_startend_row_indices = args
+            position_ids = None
+        elif len(args) == 2:
+            hidden_states, attention_mask = args
+            attn_mask_startend_row_indices, position_ids = None, None
+    else:
+        hidden_states = args
+        attention_mask, attn_mask_startend_row_indices, position_ids = None, None, None
+
+    if position_ids is not None:
+        position_ids.stop_gradient = True
+
+    if attention_mask is not None:
+        attention_mask.stop_gradient = True
+
+    if attn_mask_startend_row_indices is not None:
+        attn_mask_startend_row_indices.stop_gradient = True
+
+    return hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids
+
+
+def return_args(hidden_states, attention_mask=None, attn_mask_startend_row_indices=None, position_ids=None):
+    ret = (hidden_states,)
+
+    if attention_mask is not None:
+        ret += (attention_mask.clone(),)
+    if attn_mask_startend_row_indices is not None:
+        ret += (attn_mask_startend_row_indices.clone(),)
+    if position_ids is not None:
+        ret += (position_ids.clone(),)
+    if len(ret) == 1:
+        ret = ret[0]
+
+    return ret
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+class Qwen2MoeEmbeddingPipe(nn.Layer):
+    """Extends QWenEmbeddings to forward attention_mask through the pipeline."""
+
+    def __init__(self, config: Qwen2MoeConfig):
+        super(Qwen2MoeEmbeddingPipe, self).__init__()
+        self.config = config
+        self.sequence_parallel = config.sequence_parallel
+        self.hidden_size = config.hidden_size
+        if config.tensor_parallel_degree > 1 and config.vocab_size % config.tensor_parallel_degree == 0:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self.embed_tokens, "weight")
+
+    def forward(self, args):
+        """_summary_
+
+        Args:
+            input (_type_): _description_
+
+        Returns:
+            _type_: _description_
+        """
+        input_ids, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        input_embeds = self.embed_tokens(input_ids)
+        if self.config.sequence_parallel:
+            from paddlenlp.transformers import ScatterOp
+
+            # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
+            bs, seq_len, hidden_size = input_embeds.shape
+            input_embeds = paddle.reshape_(input_embeds, [bs * seq_len, hidden_size])
+            # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism)
+            input_embeds = ScatterOp.apply(input_embeds)
+
+        batch_size, seq_length = input_ids.shape
+
+        if attention_mask is not None:
+            assert (
+                attn_mask_startend_row_indices is None
+            ), "attention_mask and attn_mask_startend_row_indices can not be set at same time"
+
+            attention_mask = Qwen2MoeModel._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), 0, input_embeds.dtype
+            )
+            attention_mask.stop_gradient = True
+            if get_env_device() == "npu":
+                attention_mask = attention_mask.astype("bool")
+        elif get_env_device() == "npu":
+            attention_mask = paddle.tril(paddle.ones((seq_length, seq_length), dtype="bool"))
+            attention_mask.stop_gradient = True
+
+        return return_args(input_embeds, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class Qwen2MoeDecoderLayerPipe(Qwen2MoeDecoderLayer):
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+
+        has_gradient = not hidden_states.stop_gradient
+
+        if attention_mask is not None and attention_mask.dtype == paddle.int32:
+            attention_mask, attn_mask_startend_row_indices, position_ids = (
+                None,
+                attention_mask,
+                attn_mask_startend_row_indices,
+            )
+        elif attention_mask is not None and attention_mask.dtype == paddle.int64:
+            attention_mask, attn_mask_startend_row_indices, position_ids = None, None, attention_mask
+        elif attn_mask_startend_row_indices is not None and attn_mask_startend_row_indices.dtype == paddle.int64:
+            attn_mask_startend_row_indices, position_ids = None, attn_mask_startend_row_indices
+
+        if self.enable_recompute and self.config.recompute_granularity == "full" and has_gradient:
+            if attention_mask is not None or attn_mask_startend_row_indices is not None:
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attention_mask=attention_mask,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=False,
+                )
+            else:
+                # for pretrain
+                hidden_states = recompute(
+                    super().forward,
+                    hidden_states,
+                    position_ids=position_ids,
+                    attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+                    use_reentrant=self.config.recompute_use_reentrant,
+                )
+        else:
+            hidden_states = super().forward(
+                hidden_states,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
+            )
+
+        return return_args(hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids)
+
+
+class Qwen2MoeRMSNormPipe(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.norm = Qwen2MoeRMSNorm(config)
+
+    def forward(self, args):
+        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)
+        return self.norm(hidden_states)
+
+
+class Qwen2MoeLMHeadPipe(Qwen2MoeLMHead):
+    def __init__(self, config, transpose_y=False):
+        super(Qwen2MoeLMHeadPipe, self).__init__(config)
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self, "weight")
+
+
+class Qwen2MoeForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
+    """QWenForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the QWenModel class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = Qwen2MoeConfig
+
+    _get_tensor_parallel_mappings = Qwen2MoePretrainedModel._get_tensor_parallel_mappings
+    _init_weights = Qwen2MoePretrainedModel._init_weights
+    _keys_to_ignore_on_load_unexpected = Qwen2MoePretrainedModel._keys_to_ignore_on_load_unexpected
+    _tied_weights_keys = ["lm_head.weight"]
+
+    # DONOT Add base_model_prefix !!!!
+
+    @classmethod
+    def _prepare_pipeline_inputs_func(cls, inputs):
+        first_stage_keys = ["input_ids", "attention_mask", "attn_mask_startend_row_indices", "position_ids"]
+        last_stage_keys = ["labels"]
+
+        def get_expected_keys(inputs, keys):
+            ret = tuple([inputs.pop(k) if k in inputs else None for k in keys])
+            if len(ret) == 1:
+                ret = ret[0]
+            return ret
+
+        if type(inputs) is dict or type(inputs) is OrderedDict:
+            return [
+                get_expected_keys(inputs, first_stage_keys),
+                get_expected_keys(inputs, last_stage_keys),
+            ]
+
+        keys = list(inputs[0].keys())
+        inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys}
+        return [
+            get_expected_keys(inputs_batch, first_stage_keys),
+            get_expected_keys(inputs_batch, last_stage_keys),
+        ]
+
+    def __init__(self, config: Qwen2MoeConfig):
+        self.config = config
+
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.recompute_granularity = self.config.recompute_granularity
+        self.pp_recompute_interval = self.config.pp_recompute_interval
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+        if self.recompute_granularity == "full":
+            assert len(self.no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support"
+
+        virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1)
+
+        def get_hcg():
+            return fleet.get_hybrid_communicate_group()
+
+        hcg = get_hcg()
+        tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1)
+        tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0)
+
+        # TODO: fix tensor_parallel_degree rewrite in here
+        config.tensor_parallel_degree = tensor_parallel_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "qwen2moe_shared_weight",
+                    Qwen2MoeEmbeddingPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                ),
+                "qwen2_moe",
+            )
+        else:
+            self.add_sequential_layer(LayerDesc(Qwen2MoeEmbeddingPipe, config=config), "qwen2_moe")
+
+        for i in range(config.num_hidden_layers):
+            self.add_sequential_layer(
+                LayerDesc(
+                    Qwen2MoeDecoderLayerPipe,
+                    config=config,
+                    layerwise_recompute=i not in self.no_recompute_layers,
+                ),
+                f"qwen2_moe.layers.{i}",
+            )
+        self.add_sequential_layer(LayerDesc(Qwen2MoeRMSNormPipe, config=config), "qwen2_moe")
+
+        if config.tie_word_embeddings:
+            self.add_sequential_layer(
+                SharedLayerDesc(
+                    "qwen2moe_shared_weight",
+                    Qwen2MoeLMHeadPipe,
+                    shared_weight_attr="embedding_weight",
+                    config=config,
+                    **{"transpose_y": True},
+                ),
+                "lm_head",
+            )
+        else:
+            self.add_sequential_layer(LayerDesc(Qwen2MoeLMHeadPipe, config=config), "lm_head")
+
+        recompute_interval = 0
+        if self.enable_recompute and self.recompute_granularity == "full":
+            assert self.config.pp_recompute_interval <= config.num_hidden_layers // (
+                virtual_pp_degree * get_hcg().topology().get_dim_size("pipe")
+            ), "pp recompute interval should smaller than num layers of each pp chunk"
+            recompute_interval = self.config.pp_recompute_interval
+
+        seg_method = "layer:Qwen2MoeDecoderLayer"
+        if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        PipelineLayer.__init__(
+            self,
+            layers=self.get_sequential_layers(),
+            loss_fn=self.get_loss_fn(config),
+            topology=get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+        # You should call init here, since there is a  diamond inheritance problem
+        self.apply(self._init_weights)
+        # DON'T init PipelinePretrainedModel
+        # PipelinePretrainedModel.__init__(self.super(), config=config)
+
+    def get_loss_fn(self, config):
+        return Qwen2MoePretrainingCriterion(config)
diff --git a/paddlenlp/transformers/roberta/tokenizer.py b/paddlenlp/transformers/roberta/tokenizer.py
index 4ef53d5c6fa8..c399630f01cf 100644
--- a/paddlenlp/transformers/roberta/tokenizer.py
+++ b/paddlenlp/transformers/roberta/tokenizer.py
@@ -21,14 +21,9 @@
 
 from paddlenlp.utils.download import resolve_file_path
 
-from .. import (
-    AddedToken,
-    BasicTokenizer,
-    GPTTokenizer,
-    PretrainedTokenizer,
-    WordpieceTokenizer,
-)
-from ..gpt.tokenizer import bytes_to_unicode
+from ..bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from ..gpt.tokenizer import GPTTokenizer, bytes_to_unicode
+from ..tokenizer_utils import AddedToken, PretrainedTokenizer
 
 __all__ = ["RobertaTokenizer", "RobertaChineseTokenizer", "RobertaBPETokenizer"]
 
diff --git a/paddlenlp/transformers/semantic_search/modeling.py b/paddlenlp/transformers/semantic_search/modeling.py
index c16808e21770..0ba34bd94641 100644
--- a/paddlenlp/transformers/semantic_search/modeling.py
+++ b/paddlenlp/transformers/semantic_search/modeling.py
@@ -282,8 +282,9 @@ def matching_v2(self, input_ids, token_type_ids=None, position_ids=None, attenti
             input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
         )
         pooled_output = self.ernie.dropout(sequence_output[:, 0])
-        probs = self.ernie.classifier(pooled_output)
-        return probs
+        cls_embedding = self.ernie.classifier(pooled_output)
+        probs = F.softmax(cls_embedding, axis=1)
+        return probs[:, 1]
 
     def matching_v3(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
         """Use the pooled_output as the feature for listwise prediction, eg. ERNIE-Search"""
diff --git a/paddlenlp/transformers/tokenizer_utils.py b/paddlenlp/transformers/tokenizer_utils.py
index cd9c16e2f280..bdccbeae8529 100644
--- a/paddlenlp/transformers/tokenizer_utils.py
+++ b/paddlenlp/transformers/tokenizer_utils.py
@@ -788,7 +788,9 @@ def _encode_chat_inputs(
             ans.append(ans_roundi)
 
         non_learnable_parts = self._extract_non_learnable_parts(origin_msg, ans)
-        assert len(non_learnable_parts) == len(ans)
+        assert len(non_learnable_parts) == len(
+            ans
+        ), f"Get non_learnable_parts len: {len(non_learnable_parts)}, but ans len: {len(ans)}."
 
         conversation_ids = []
         for i in range(len(non_learnable_parts)):
@@ -1879,33 +1881,6 @@ def _decode(
         else:
             return text
 
-    def decode_token(
-        self,
-        all_input_ids: List[int],
-        prefix_offset: int = 0,
-        read_offset: int = 0,
-    ) -> Tuple[str, int, int]:
-        """tokenizer decoding for the streaming generation use case. This method can be overrided for tokenizer that doesn't follow this API"""
-        # The prefix text is necessary only to defeat cleanup algorithms in the decode
-        # which decide to add a space or not depending on the surrounding ids.
-        prefix_text = self.decode(
-            all_input_ids[prefix_offset:read_offset], skip_special_tokens=False, clean_up_tokenization_spaces=False
-        )
-        new_text = self.decode(
-            all_input_ids[prefix_offset:], skip_special_tokens=False, clean_up_tokenization_spaces=False
-        )
-
-        if len(new_text) > len(prefix_text) and not prefix_text.endswith("�") and not new_text.endswith("�"):
-            # utf-8 char at the end means it's a potential unfinished byte sequence
-            # from byte fallback tokenization.
-            # If it's in the middle, it's probably a real invalid id generated
-            # by the model
-            prefix_index = new_text.index(prefix_text)
-            new_text = new_text[prefix_index + len(prefix_text) :]
-            return new_text, read_offset, len(all_input_ids)
-        else:
-            return "", prefix_offset, read_offset
-
 
 class BPETokenizer(PretrainedTokenizer):
     """
diff --git a/paddlenlp/transformers/tokenizer_utils_base.py b/paddlenlp/transformers/tokenizer_utils_base.py
index cf1d3391b5c1..79f1d490988c 100644
--- a/paddlenlp/transformers/tokenizer_utils_base.py
+++ b/paddlenlp/transformers/tokenizer_utils_base.py
@@ -967,6 +967,11 @@ def add_tokens(
 
         return self._add_tokens(new_tokens, special_tokens=special_tokens)
 
+    @classmethod
+    def _add_extra_special_tokens(cls, extra_sp_token: Union[str, AddedToken]):
+        if extra_sp_token not in cls.SPECIAL_TOKENS_ATTRIBUTES:
+            cls.SPECIAL_TOKENS_ATTRIBUTES.append(extra_sp_token)
+
     def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
         raise NotImplementedError
 
@@ -1213,7 +1218,13 @@ def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]:
         """
         set_attr = {}
         for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
-            attr_value = getattr(self, "_" + attr)
+            try:
+                attr_value = getattr(self, "_" + attr)
+            except:
+                try:
+                    attr_value = getattr(self, attr)
+                except:
+                    continue
             if attr_value:
                 set_attr[attr] = (
                     type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
@@ -1233,7 +1244,13 @@ def special_tokens_map_extended(self) -> Dict[str, Union[str, AddedToken, List[U
         """
         set_attr = {}
         for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
-            attr_value = getattr(self, "_" + attr)
+            try:
+                attr_value = getattr(self, "_" + attr)
+            except:
+                try:
+                    attr_value = getattr(self, attr)
+                except:
+                    continue
             if attr_value:
                 set_attr[attr] = attr_value
         return set_attr
@@ -1744,6 +1761,7 @@ def convert_added_tokens(obj):
                 elif isinstance(value, list):
                     value = [AddedToken(**token) if isinstance(token, dict) else token for token in value]
                 setattr(tokenizer, key, value)
+                cls._add_extra_special_tokens(key)
 
         # Add supplementary tokens.
         special_tokens = tokenizer.all_special_tokens
@@ -1858,8 +1876,8 @@ def convert_added_tokens(obj: Union[AddedToken, Any], add_type_field=True):
         # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
         tokenizer_class = self.__class__.__name__
         # Remove the Fast at the end unless we have a special `PreTrainedTokenizerFast`
-        if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast":
-            tokenizer_class = tokenizer_class[:-4]
+        # if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast":
+        #     tokenizer_class = tokenizer_class[:-4]
         tokenizer_config["tokenizer_class"] = tokenizer_class
 
         with io.open(tokenizer_config_file, "w", encoding="utf-8") as f:
@@ -3426,6 +3444,33 @@ def convert_tokens_to_string(self, tokens: List[str]) -> str:
         """
         raise NotImplementedError
 
+    def decode_token(
+        self,
+        all_input_ids: List[int],
+        prefix_offset: int = 0,
+        read_offset: int = 0,
+    ) -> Tuple[str, int, int]:
+        """tokenizer decoding for the streaming generation use case. This method can be overrided for tokenizer that doesn't follow this API"""
+        # The prefix text is necessary only to defeat cleanup algorithms in the decode
+        # which decide to add a space or not depending on the surrounding ids.
+        prefix_text = self.decode(
+            all_input_ids[prefix_offset:read_offset], skip_special_tokens=False, clean_up_tokenization_spaces=False
+        )
+        new_text = self.decode(
+            all_input_ids[prefix_offset:], skip_special_tokens=False, clean_up_tokenization_spaces=False
+        )
+
+        if len(new_text) > len(prefix_text) and not prefix_text.endswith("�") and not new_text.endswith("�"):
+            # utf-8 char at the end means it's a potential unfinished byte sequence
+            # from byte fallback tokenization.
+            # If it's in the middle, it's probably a real invalid id generated
+            # by the model
+            prefix_index = new_text.index(prefix_text)
+            new_text = new_text[prefix_index + len(prefix_text) :]
+            return new_text, read_offset, len(all_input_ids)
+        else:
+            return "", prefix_offset, read_offset
+
     def batch_decode(
         self,
         sequences: Union[List[int], List[List[int]], "np.ndarray", "paddle.Tensor"],
diff --git a/paddlenlp/transformers/utils.py b/paddlenlp/transformers/utils.py
index cac920da240b..7970ba752d67 100644
--- a/paddlenlp/transformers/utils.py
+++ b/paddlenlp/transformers/utils.py
@@ -962,12 +962,11 @@ def __repr__(self):
         return msg
 
 
-def caculate_llm_flops(
+def caculate_llm_per_token_flops(
     hidden_size,
     intermediate_size,
     layer_num,
     vocab_size,
-    batch_size=1,
     seq_length=None,
     recompute=False,
     recompute_granularity=None,
@@ -1002,4 +1001,4 @@ def caculate_llm_flops(
 
     # 2 for mul + add in matmul
     # 1 for forward, 2 for backwards since we caluate gradients for input_x and input_y
-    return 2 * batch_size * (layer_num * (flops_per_transformer * 3 + flops_recompute_transformer) + 3 * flops_loggits)
+    return 2 * (layer_num * (flops_per_transformer * 3 + flops_recompute_transformer) + 3 * flops_loggits) / seq_length
diff --git a/paddlenlp/transformers/xlm_roberta/__init__.py b/paddlenlp/transformers/xlm_roberta/__init__.py
new file mode 100644
index 000000000000..4c08fc6b9a63
--- /dev/null
+++ b/paddlenlp/transformers/xlm_roberta/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration import *
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/xlm_roberta/configuration.py b/paddlenlp/transformers/xlm_roberta/configuration.py
new file mode 100644
index 000000000000..dcbf46079bac
--- /dev/null
+++ b/paddlenlp/transformers/xlm_roberta/configuration.py
@@ -0,0 +1,160 @@
+# coding=utf-8
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XLM-RoBERTa configuration"""
+
+from ..model_utils import PretrainedConfig
+
+__all__ = ["PRETRAINED_INIT_CONFIGURATION", "XLMRobertaConfig"]
+
+PRETRAINED_INIT_CONFIGURATION = {
+    "hf-internal-testing/tiny-random-onnx-xlm-roberta": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 0,
+        "classifier_dropout": None,
+        "eos_token_id": 2,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 4,
+        "initializer_range": 0.02,
+        "intermediate_size": 37,
+        "layer_norm_eps": 1e-05,
+        "max_position_embeddings": 514,
+        "model_type": "xlm-roberta",
+        "num_attention_heads": 4,
+        "num_hidden_layers": 5,
+        "output_past": True,
+        "pad_token_id": 1,
+        "position_embedding_type": "absolute",
+        "dtype": "float32",
+        "type_vocab_size": 1,
+        "use_cache": True,
+        "vocab_size": 250002,
+    },
+}
+
+
+class XLMRobertaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`XLMRobertaModel`] or a [`TFXLMRobertaModel`]. It
+    is used to instantiate a XLM-RoBERTa model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the XLMRoBERTa
+    [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the XLM-RoBERTa model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`XLMRobertaModel`] or [`TFXLMRobertaModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`XLMRobertaModel`] or
+            [`TFXLMRobertaModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import XLMRobertaConfig, XLMRobertaModel
+
+    >>> # Initializing a XLM-RoBERTa xlm-roberta-base style configuration
+    >>> configuration = XLMRobertaConfig()
+
+    >>> # Initializing a model (with random weights) from the xlm-roberta-base style configuration
+    >>> model = XLMRobertaModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "xlm-roberta"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        position_embedding_type="absolute",
+        use_cache=True,
+        classifier_dropout=None,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", False)
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
+        self.classifier_dropout = classifier_dropout
diff --git a/paddlenlp/transformers/xlm_roberta/modeling.py b/paddlenlp/transformers/xlm_roberta/modeling.py
new file mode 100644
index 000000000000..31feb37785ef
--- /dev/null
+++ b/paddlenlp/transformers/xlm_roberta/modeling.py
@@ -0,0 +1,1618 @@
+# coding=utf-8
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle XLM-RoBERTa model."""
+
+import math
+from typing import List, Optional, Tuple, Union
+
+import paddle
+from paddle import nn
+from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...utils import logger
+from ...utils.converter import StateDictNameMapping
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from ..model_utils import (
+    PretrainedModel,
+    apply_chunking_to_forward,
+    register_base_model,
+)
+from .configuration import PRETRAINED_INIT_CONFIGURATION, XLMRobertaConfig
+
+__all__ = [
+    "XLMRobertaModel",
+    "XLMRobertaPretrainedModel",
+    "XLMRobertaForSequenceClassification",
+    "XLMRobertaForTokenClassification",
+    "XLMRobertaForQuestionAnswering",
+    "XLMRobertaForMaskedLM",
+    "XLMRobertaForMultipleChoice",
+    "XLMRobertaForCausalLM",
+]
+
+
+class XLMRobertaEmbeddings(nn.Layer):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size
+        )  # padding_idx=config.pad_token_id  NOTE, donot set padding_idx
+        self.word_embeddings.padding_idx = config.pad_token_id
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.register_buffer(
+            "position_ids",
+            paddle.arange(config.max_position_embeddings, dtype=paddle.int64).expand((1, -1)),
+            persistable=False,
+        )
+        self.register_buffer(
+            "token_type_ids", paddle.zeros(self.position_ids.shape, dtype=paddle.int64), persistable=False
+        )
+
+        # End copy
+        self.padding_idx = config.pad_token_id
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings,
+            config.hidden_size,  # padding_idx=self.padding_idx
+        )
+        self.position_embeddings.padding_idx = config.pad_token_id
+
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if position_ids is None:
+            if input_ids is not None:
+                # Create the position ids from the input token ids. Any padded tokens remain padded.
+                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)
+            else:
+                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
+
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+
+        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
+        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
+        # issue #5664
+        if token_type_ids is None:
+            if hasattr(self, "token_type_ids"):
+                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand([input_shape[0], seq_length])
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+    def create_position_ids_from_inputs_embeds(self, inputs_embeds):
+        """
+        We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
+
+        Args:
+            inputs_embeds: paddle.Tensor
+
+        Returns: paddle.Tensor
+        """
+        input_shape = inputs_embeds.shape[:-1]
+        sequence_length = input_shape[1]
+
+        position_ids = paddle.arange(
+            self.padding_idx + 1,
+            sequence_length + self.padding_idx + 1,
+            dtype=paddle.int64,
+        )
+        return position_ids.unsqueeze(0).expand(input_shape)
+
+
+class XLMRobertaSelfAttention(nn.Layer):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = position_embedding_type or getattr(
+            config, "position_embedding_type", "absolute"
+        )
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+
+        self.is_decoder = config.is_decoder
+        self.scale = math.sqrt(self.attention_head_size)
+
+    def transpose_for_scores(self, x: paddle.Tensor) -> paddle.Tensor:
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_layer = past_key_value[0]
+            value_layer = past_key_value[1]
+            attention_mask = encoder_attention_mask
+        elif is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = paddle.concat([past_key_value[0], key_layer], axis=2)
+            value_layer = paddle.concat([past_key_value[1], value_layer], axis=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        use_cache = past_key_value is not None
+        if self.is_decoder:
+            # if cross_attention save Tuple(paddle.Tensor, paddle.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(paddle.Tensor, paddle.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
+            if use_cache:
+                position_ids_l = paddle.to_tensor(
+                    key_length - 1,
+                    dtype=paddle.int64,
+                ).reshape([-1, 1])
+            else:
+                position_ids_l = paddle.arange(
+                    query_length,
+                    dtype=paddle.int64,
+                ).reshape([-1, 1])
+            position_ids_r = paddle.arange(
+                key_length,
+                dtype=paddle.int64,
+            ).reshape([1, -1])
+            distance = position_ids_l - position_ids_r
+
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.cast(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = paddle.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / self.scale
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in XLMRobertaModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        if self.is_decoder:
+            outputs = outputs + (past_key_value,)
+        return outputs
+
+
+class XLMRobertaSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class XLMRobertaAttention(nn.Layer):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__()
+        self.self = XLMRobertaSelfAttention(config, position_embedding_type=position_embedding_type)
+        self.output = XLMRobertaSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class XLMRobertaIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class XLMRobertaOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class XLMRobertaLayer(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = XLMRobertaAttention(config)
+        self.is_decoder = config.is_decoder
+        self.add_cross_attention = config.add_cross_attention
+        if self.add_cross_attention:
+            if not self.is_decoder:
+                raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
+            self.crossattention = XLMRobertaAttention(config, position_embedding_type="absolute")
+        self.intermediate = XLMRobertaIntermediate(config)
+        self.output = XLMRobertaOutput(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+
+        # if decoder, the last output is tuple of self-attn cache
+        if self.is_decoder:
+            outputs = self_attention_outputs[1:-1]
+            present_key_value = self_attention_outputs[-1]
+        else:
+            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        cross_attn_present_key_value = None
+        if self.is_decoder and encoder_hidden_states is not None:
+            if not hasattr(self, "crossattention"):
+                raise ValueError(
+                    f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
+                    " by setting `config.add_cross_attention=True`"
+                )
+
+            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
+            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                cross_attn_past_key_value,
+                output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights
+
+            # add cross-attn cache to positions 3,4 of present_key_value tuple
+            cross_attn_present_key_value = cross_attention_outputs[-1]
+            present_key_value = present_key_value + cross_attn_present_key_value
+
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+
+        # if decoder, return the attn key/values as the last output
+        if self.is_decoder:
+            outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+class XLMRobertaEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList([XLMRobertaLayer(config) for _ in range(config.num_hidden_layers)])
+        self.enable_recompute = False
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple[paddle.Tensor], BaseModelOutputWithPastAndCrossAttentions]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
+
+        if self.enable_recompute and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        next_decoder_cache = () if use_cache else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if self.enable_recompute and not hidden_states.stop_gradient:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    attention_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if self.config.add_cross_attention:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class XLMRobertaPooler(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        pooler_act = getattr(config, "pooler_act", "tanh")
+        self.activation = ACT2FN[pooler_act]
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class XLMRobertaPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = XLMRobertaConfig
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = {
+        "model_state": {
+            "hf-internal-testing/tiny-random-onnx-xlm-roberta": "https://bj.bcebos.com/paddlenlp/models/community/hf-internal-testing/tiny-random-onnx-xlm-roberta/model.safetensors",
+        }
+    }
+    base_model_prefix = "roberta"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["XLMRobertaEmbeddings", "XLMRobertaSelfAttention"]
+
+    def can_generate(self) -> bool:
+        return False
+
+    @classmethod
+    def _get_name_mappings(cls, config):
+        architectures = config.architectures + [cls.__name__]
+        mappings = []
+        model_mappings = [
+            ["embeddings.word_embeddings.weight", "embeddings.word_embeddings.weight"],
+            ["embeddings.position_ids", "embeddings.position_ids"],
+            ["embeddings.position_embeddings.weight", "embeddings.position_embeddings.weight"],
+            ["embeddings.token_type_embeddings.weight", "embeddings.token_type_embeddings.weight"],
+            ["embeddings.LayerNorm.weight", "embeddings.LayerNorm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.LayerNorm.bias"],
+            ["pooler.dense.weight", "pooler.dense.weight", "transpose"],
+            ["pooler.dense.bias", "pooler.dense.bias"],
+            # for TokenClassification
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            for name in [
+                "attention.self.query",
+                "attention.self.key",
+                "attention.self.value",
+                "attention.output.dense",
+                "attention.output.LayerNorm",
+                "intermediate.dense",
+                "output.dense",
+                "output.LayerNorm",
+            ]:
+                action = None if "LayerNorm" in name else "transpose"
+                model_mappings.extend(
+                    [
+                        [
+                            f"encoder.layer.{layer_index}.{name}.weight",
+                            f"encoder.layer.{layer_index}.{name}.weight",
+                            action,
+                        ],
+                        [
+                            f"encoder.layer.{layer_index}.{name}.bias",
+                            f"encoder.layer.{layer_index}.{name}.bias",
+                        ],
+                    ]
+                )
+
+        # base-model prefix "XLMRobertaModel"
+        torch_prefix = ""
+        paddle_prefix = ""
+        if "XLMRobertaModel" not in config.architectures:
+            torch_prefix = "roberta."
+        if "XLMRobertaModel" not in [cls.__name__]:
+            paddle_prefix = "roberta."
+
+        # add prefix
+        for mapping in model_mappings:
+            mapping[0] = torch_prefix + mapping[0]
+            mapping[1] = paddle_prefix + mapping[1]
+
+        if "XLMRobertaForCausalLM" in architectures:
+            model_mappings.extend(
+                [
+                    ["lm_head.dense.weight", "lm_head.dense.weight", "transpose"],
+                    ["lm_head.dense.bias", "lm_head.dense.bias"],
+                    ["lm_head.layer_norm.weight", "lm_head.layer_norm.weight"],
+                    ["lm_head.layer_norm.bias", "lm_head.layer_norm.bias"],
+                    ["lm_head.bias", "lm_head.bias"],
+                ]
+            )
+
+        # downstream mappings
+        if "XLMRobertaForQuestionAnswering" in architectures:
+            model_mappings.extend(
+                [
+                    ["qa_outputs.weight", "qa_outputs.weight", "transpose"],
+                    ["qa_outputs.bias", "qa_outputs.bias"],
+                ]
+            )
+        if "XLMRobertaForSequenceClassification" in architectures:
+            model_mappings.extend(
+                [
+                    ["classifier.dense.weight", "classifier.dense.weight", "transpose"],
+                    ["classifier.dense.bias", "classifier.dense.bias"],
+                    ["classifier.out_proj.weight", "classifier.out_proj.weight", "transpose"],
+                    ["classifier.out_proj.bias", "classifier.out_proj.bias"],
+                ]
+            )
+
+        if "XLMRobertaForMultipleChoice" in architectures or "XLMRobertaForTokenClassification" in architectures:
+            model_mappings.extend(
+                [
+                    ["classifier.weight", "classifier.weight", "transpose"],
+                    ["classifier.bias", "classifier.bias"],
+                ]
+            )
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    @paddle.no_grad()
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if hasattr(module, "padding_idx") and module.padding_idx is not None:
+                module.weight[module.padding_idx] = 0
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.zero_()
+            module.weight.fill_(1.0)
+
+
+@register_base_model
+class XLMRobertaModel(XLMRobertaPretrainedModel):
+    """
+
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in *Attention is
+    all you need*_ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
+    Kaiser and Illia Polosukhin.
+
+    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
+    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
+    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
+
+    .. _*Attention is all you need*: https://arxiv.org/abs/1706.03762
+
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = XLMRobertaEmbeddings(config)
+        self.encoder = XLMRobertaEncoder(config)
+
+        self.pooler = XLMRobertaPooler(config) if add_pooling_layer else None
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    @property
+    def dtype(self) -> paddle.dtype:
+        """
+        `paddle.dtype`: The dtype of the module (assuming that all the module parameters have the same dtype).
+        """
+        try:
+            return next(self.named_parameters())[1].dtype
+        except StopIteration:
+            try:
+                return next(self.named_buffers())[1].dtype
+            except StopIteration:
+                return self._dtype
+
+    def invert_attention_mask(self, encoder_attention_mask: paddle.Tensor) -> paddle.Tensor:
+        """
+        Invert an attention mask (e.g., switches 0. and 1.).
+
+        Args:
+            encoder_attention_mask (`paddle.Tensor`): An attention mask.
+
+        Returns:
+            `paddle.Tensor`: The inverted attention mask.
+        """
+        if encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
+        if encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
+        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
+        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow
+        # /transformer/transformer_layers.py#L270
+        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==
+        # encoder_extended_attention_mask.transpose(-1, -2))
+        encoder_extended_attention_mask = encoder_extended_attention_mask.cast(dtype=self.dtype)  # fp16 compatibility
+        encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * paddle.finfo(self.dtype).min
+
+        return encoder_extended_attention_mask
+
+    @staticmethod
+    def create_extended_attention_mask_for_decoder(input_shape, attention_mask):
+        batch_size, seq_length = input_shape
+        seq_ids = paddle.arange(seq_length)
+        causal_mask = seq_ids[None, None, :].tile([batch_size, seq_length, 1]) <= seq_ids[None, :, None]
+        # in case past_key_values are used we need to add a prefix ones mask to the causal mask
+        # causal and attention masks must have same type with pytorch version < 1.3
+        causal_mask = causal_mask.cast(dtype=attention_mask.dtype)
+
+        if causal_mask.shape[1] < attention_mask.shape[1]:
+            prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+            causal_mask = paddle.concat(
+                [
+                    paddle.ones((batch_size, seq_length, prefix_seq_len), dtype=causal_mask.dtype),
+                    causal_mask,
+                ],
+                axis=-1,
+            )
+
+        extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
+        return extended_attention_mask
+
+    def get_extended_attention_mask(
+        self, attention_mask: paddle.Tensor, input_shape: Tuple[int], dtype: paddle.dtype = None
+    ) -> paddle.Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+
+        Arguments:
+            attention_mask (`paddle.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (`Tuple[int]`):
+                The shape of the input to the model.
+
+        Returns:
+            `paddle.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+        """
+        if dtype is None:
+            dtype = self.dtype
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.ndim == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.ndim == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if self.config.is_decoder:
+                extended_attention_mask = XLMRobertaModel.create_extended_attention_mask_for_decoder(
+                    input_shape, attention_mask
+                )
+            else:
+                extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and the dtype's smallest value for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.cast(dtype=dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * paddle.finfo(dtype).min
+        return extended_attention_mask
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
+        r"""
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if self.config.is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        batch_size, seq_length = input_shape
+
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(
+                ((batch_size, seq_length + past_key_values_length)),
+            )
+
+        if token_type_ids is None:
+            if hasattr(self.embeddings, "token_type_ids"):
+                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand([batch_size, seq_length])
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = paddle.zeros(
+                    input_shape,
+                    dtype=paddle.int64,
+                )
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: paddle.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.config.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.shape
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(encoder_hidden_shape)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class XLMRobertaForCausalLM(XLMRobertaPretrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        if not config.is_decoder:
+            logger.warning("If you want to use `XLMRobertaLMHeadModel` as a standalone, add `is_decoder=True.`")
+
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+
+        if config.tie_word_embeddings:
+            input_embeddings = self.roberta.embeddings.word_embeddings.weight
+        else:
+            input_embeddings = None
+        self.lm_head = XLMRobertaLMHead(config, input_embeddings=input_embeddings)
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def get_output_embeddings(self):
+        if self.config.tie_word_embeddings:
+            return None
+        else:
+            return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        if self.config.tie_word_embeddings:
+            logger.warning(
+                "`set_output_embeddings` method is called when `config.tie_word_embeddings=True`. This is not expected. We will do nothing!"
+            )
+        else:
+            self.lm_head.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        past_key_values: Tuple[Tuple[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
+            ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from paddlenlp.transformers import AutoTokenizer, XLMRobertaForCausalLM, AutoConfig
+        >>> import paddle
+
+        >>> tokenizer = AutoTokenizer.from_pretrained("roberta-base")
+        >>> config = AutoConfig.from_pretrained("roberta-base")
+        >>> config.is_decoder = True
+        >>> model = XLMRobertaForCausalLM.from_pretrained("roberta-base", config=config)
+
+        >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pd")
+        >>> outputs = model(**inputs)
+
+        >>> prediction_logits = outputs.logits
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        lm_loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :]
+            labels = labels[:, 1:]
+            loss_fct = CrossEntropyLoss()
+            lm_loss = loss_fct(
+                shifted_prediction_scores.reshape([-1, self.config.vocab_size]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape, dtype=input_ids.dtype)
+
+        # cut decoder_input_ids if past_key_values is used
+        if past_key_values is not None:
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values}
+
+    def _reorder_cache(self, past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (tuple(past_state.index_select(axis=0, index=beam_idx) for past_state in layer_past),)
+        return reordered_past
+
+
+class XLMRobertaForMaskedLM(XLMRobertaPretrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        if config.is_decoder:
+            logger.warning(
+                "If you want to use `XLMRobertaForMaskedLM` make sure `config.is_decoder=False` for "
+                "bi-directional self-attention."
+            )
+
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+
+        if config.tie_word_embeddings:
+            input_embeddings = self.roberta.embeddings.word_embeddings.weight
+        else:
+            input_embeddings = None
+        self.lm_head = XLMRobertaLMHead(config, input_embeddings=input_embeddings)
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def get_output_embeddings(self):
+        if self.config.tie_word_embeddings:
+            return None
+        else:
+            return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        if self.config.tie_word_embeddings:
+            logger.warning(
+                "`set_output_embeddings` method is called when `config.tie_word_embeddings=True`. This is not expected. We will do nothing!"
+            )
+        else:
+            self.lm_head.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], MaskedLMOutput]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        kwargs (`Dict[str, any]`, optional, defaults to *{}*):
+            Used to hide legacy arguments that have been deprecated.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape([-1, self.config.vocab_size]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class XLMRobertaLMHead(nn.Layer):
+    """Roberta Head for masked language modeling."""
+
+    def __init__(self, config, input_embeddings=None):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        if input_embeddings is None:
+            # The output weights are the same as the input embeddings, but there is
+            # an output-only bias for each token.
+            self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias_attr=False)
+            # self.bias = nn.Parameter(paddle.zeros((config.vocab_size,)))
+            data = paddle.zeros((config.vocab_size,))
+            self.bias = paddle.create_parameter(
+                data.shape, dtype=data.dtype, default_initializer=nn.initializer.Assign(data)
+            )
+            # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+            self.decoder.bias = self.bias
+        else:
+            # self.bias = nn.Parameter(paddle.zeros((config.vocab_size,)))
+            data = paddle.zeros((config.vocab_size,))
+            self.bias = paddle.create_parameter(
+                data.shape, dtype=data.dtype, default_initializer=nn.initializer.Assign(data)
+            )
+            decoder_weight = input_embeddings.weight if hasattr(input_embeddings, "weight") else input_embeddings
+            self.decoder = lambda x: paddle.matmul(x, decoder_weight, transpose_y=True) + self.bias
+
+    def forward(self, features, **kwargs):
+        x = self.dense(features)
+        x = nn.functional.gelu(x)
+        x = self.layer_norm(x)
+
+        # project back to size of vocabulary with bias
+        x = self.decoder(x)
+
+        return x
+
+
+class XLMRobertaForSequenceClassification(XLMRobertaPretrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+        self.classifier = XLMRobertaClassificationHead(config)
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], SequenceClassifierOutput]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.reshape([-1, self.num_labels]), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class XLMRobertaForMultipleChoice(XLMRobertaPretrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.roberta = XLMRobertaModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], MultipleChoiceModelOutput]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+            num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            `input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+        flat_input_ids = input_ids.reshape([-1, input_ids.shape[-1]]) if input_ids is not None else None
+        flat_position_ids = position_ids.reshape([-1, position_ids.shape[-1]]) if position_ids is not None else None
+        flat_token_type_ids = (
+            token_type_ids.reshape([-1, token_type_ids.shape[-1]]) if token_type_ids is not None else None
+        )
+        flat_attention_mask = (
+            attention_mask.reshape([-1, attention_mask.shape[-1]]) if attention_mask is not None else None
+        )
+        flat_inputs_embeds = (
+            inputs_embeds.reshape([-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]])
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.roberta(
+            flat_input_ids,
+            position_ids=flat_position_ids,
+            token_type_ids=flat_token_type_ids,
+            attention_mask=flat_attention_mask,
+            inputs_embeds=flat_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape([-1, num_choices])
+
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class XLMRobertaForTokenClassification(XLMRobertaPretrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], TokenClassifierOutput]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                logits.reshape([-1, self.num_labels]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class XLMRobertaClassificationHead(nn.Layer):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+        pooler_act = getattr(config, "pooler_act", "tanh")
+        self.activation = ACT2FN[pooler_act]
+
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = self.activation(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+class XLMRobertaForQuestionAnswering(XLMRobertaPretrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+        # Initialize weights and apply final processing
+        self._post_init(self.__init__)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        start_positions: Optional[paddle.Tensor] = None,
+        end_positions: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.chunk(2, axis=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.shape) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.shape) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
+    """
+    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
+    are ignored. This is modified from fairseq's `utils.make_positions`.
+
+    Args:
+        x: paddle.Tensor x:
+
+    Returns: paddle.Tensor
+    """
+    mask = (input_ids != padding_idx).cast("int64")
+    incremental_indices = (paddle.cumsum(mask, axis=1) + past_key_values_length) * mask
+    return incremental_indices.cast("int64") + padding_idx
diff --git a/paddlenlp/transformers/xlm_roberta/tokenizer.py b/paddlenlp/transformers/xlm_roberta/tokenizer.py
new file mode 100644
index 000000000000..fcd13236878d
--- /dev/null
+++ b/paddlenlp/transformers/xlm_roberta/tokenizer.py
@@ -0,0 +1,305 @@
+# coding=utf-8
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+""" Tokenization classes for XLM-RoBERTa model."""
+
+
+import os
+from shutil import copyfile
+from typing import Any, Dict, List, Optional, Tuple
+
+import sentencepiece as spm
+
+from ...utils import logger
+from ..tokenizer_utils import AddedToken, PretrainedTokenizer
+
+SPIECE_UNDERLINE = "▁"
+
+__all__ = ["XLMRobertaTokenizer"]
+
+
+class XLMRobertaTokenizer(PretrainedTokenizer):
+    """
+    Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
+    [SentencePiece](https://github.com/google/sentencepiece).
+
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        bos_token (`str`, *optional*, defaults to `"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+
+            <Tip>
+
+            When building a sequence using special tokens, this is not the token that is used for the beginning of
+            sequence. The token used is the `cls_token`.
+
+            </Tip>
+
+        eos_token (`str`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
+
+            <Tip>
+
+            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            The token used is the `sep_token`.
+
+            </Tip>
+
+        sep_token (`str`, *optional*, defaults to `"</s>"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+        cls_token (`str`, *optional*, defaults to `"<s>"`):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        mask_token (`str`, *optional*, defaults to `"<mask>"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+        sp_model_kwargs (`dict`, *optional*):
+            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
+            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
+            to set:
+
+            - `enable_sampling`: Enable subword regularization.
+            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
+
+              - `nbest_size = {0,1}`: No sampling is performed.
+              - `nbest_size > 1`: samples from the nbest_size results.
+              - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
+                using forward-filtering-and-backward-sampling algorithm.
+
+            - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
+              BPE-dropout.
+
+    Attributes:
+        sp_model (`SentencePieceProcessor`):
+            The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
+    """
+
+    resource_files_names = {"vocab_file": "sentencepiece.bpe.model"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "BAAI/bge-m3": "https://bj.bcebos.com/paddlenlp/models/community/BAAI/bge-m3/sentencepiece.bpe.model",
+        }
+    }
+    pretrained_init_configuration = {}
+    max_model_input_sizes = {
+        "BAAI/bge-m3": 8192,
+    }
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        bos_token="<s>",
+        eos_token="</s>",
+        sep_token="</s>",
+        cls_token="<s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        sp_model_kwargs: Optional[Dict[str, Any]] = None,
+        **kwargs,
+    ) -> None:
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, special=True) if isinstance(mask_token, str) else mask_token
+
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(str(vocab_file))
+        self.vocab_file = vocab_file
+
+        # Original fairseq vocab and spm vocab must be "aligned":
+        # Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
+        # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
+        # fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
+        # spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'
+
+        # Mimic fairseq token-to-id alignment for the first 4 token
+        self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
+
+        # The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
+        self.fairseq_offset = 1
+
+        self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + self.fairseq_offset
+        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
+
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            **kwargs,
+        )
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        state["sp_model_proto"] = self.sp_model.serialized_model_proto()
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+
+        # for backward compatibility
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. An XLM-RoBERTa sequence has the following format:
+
+        - single sequence: `<s> X </s>`
+        - pair of sequences: `<s> A </s></s> B </s>`
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. XLM-RoBERTa does
+        not make use of token type ids, therefore a list of zeros is returned.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of zeros.
+
+        """
+
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model) + self.fairseq_offset + 1  # Add the <mask> token
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text: str) -> List[str]:
+        # TODO check if the t5/llama PR also applies here
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token in self.fairseq_tokens_to_ids:
+            return self.fairseq_tokens_to_ids[token]
+        spm_id = self.sp_model.PieceToId(token)
+
+        # Need to return unknown token if the SP model returned 0
+        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index in self.fairseq_ids_to_tokens:
+            return self.fairseq_ids_to_tokens[index]
+        return self.sp_model.IdToPiece(index - self.fairseq_offset)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory,
+            (filename_prefix + "-" if filename_prefix else "") + self.resource_files_names["vocab_file"],
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
diff --git a/paddlenlp/trl/embedding_trainer.py b/paddlenlp/trl/embedding_trainer.py
index c50f19738bed..610db8151fd3 100644
--- a/paddlenlp/trl/embedding_trainer.py
+++ b/paddlenlp/trl/embedding_trainer.py
@@ -26,6 +26,7 @@
     SimpleInfclLoss,
 )
 from paddlenlp.transformers.embedding_utils import dist_gather_tensor_with_gradient
+from paddlenlp.utils import empty_device_cache
 
 __all__ = ["EmbeddingTrainer"]
 
@@ -63,7 +64,7 @@ def __init__(self, model_args, **kwargs):
     def clear_memory(self):
         self.accum_q_features.clear()
         self.accum_p_features.clear()
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
 
     def clear_state(self):
         self.accum_data.clear()
diff --git a/paddlenlp/trl/llm_utils.py b/paddlenlp/trl/llm_utils.py
index d5fa8dc76354..c19496909295 100644
--- a/paddlenlp/trl/llm_utils.py
+++ b/paddlenlp/trl/llm_utils.py
@@ -34,9 +34,12 @@
 from paddlenlp.transformers import (
     AutoTokenizer,
     ChatGLMv2Tokenizer,
+    DeepseekV2ForCausalLMPipe,
+    DeepseekV3ForCausalLMPipe,
     LlamaForCausalLMPipe,
     PretrainedConfig,
     Qwen2ForCausalLMPipe,
+    Qwen2MoeForCausalLMPipe,
 )
 from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
 from paddlenlp.utils.log import logger
@@ -210,7 +213,7 @@ def get_lora_target_modules(model):
             ".*w2.*",
             ".*w3.*",
         ]
-    elif model.base_model_prefix == "qwen2_moe":
+    elif model.base_model_prefix == "qwen2_moe" or isinstance(model, Qwen2MoeForCausalLMPipe):
         target_modules = [
             ".*q_proj.*",
             ".*k_proj.*",
@@ -221,6 +224,21 @@ def get_lora_target_modules(model):
             ".*up_proj.*",
             ".*down_proj.*",
         ]
+    elif model.base_model_prefix in ["deepseek_v2", "deepseek_v3"] or isinstance(
+        model, (DeepseekV2ForCausalLMPipe, DeepseekV3ForCausalLMPipe)
+    ):
+        target_modules = [
+            ".*q_proj.*",
+            ".*q_a_proj.*",
+            ".*q_b_proj.*",
+            ".*kv_a_proj_with_mqa.*",
+            ".*kv_b_proj.*",
+            ".*kv_b_proj.*",
+            ".*o_proj.*",
+            ".*mlp.gate_proj.*",
+            ".*mlp.up_proj.*",
+            ".*mlp.down_proj.*",
+        ]
     elif model.base_model_prefix == "yuan":
         target_modules = [
             ".*q_proj.*",
@@ -597,7 +615,9 @@ def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]
 
 
 def read_res(model_name_or_path: str, tensor_queue: mp.Queue, result_queue: mp.Queue, done_event: mp.Event):
-    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    from paddlenlp.utils.env import USE_FAST_TOKENIZER
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="left", use_fast=USE_FAST_TOKENIZER)
 
     paddle.device.set_device("cpu")
     paddle.disable_static()
@@ -628,7 +648,9 @@ def read_res(model_name_or_path: str, tensor_queue: mp.Queue, result_queue: mp.Q
 
 
 def speculate_read_res(model_name_or_path: str, tensor_queue: mp.Queue, result_queue: mp.Queue, done_event: mp.Event):
-    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    from paddlenlp.utils.env import USE_FAST_TOKENIZER
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=USE_FAST_TOKENIZER)
     paddle.device.set_device("cpu")
     paddle.disable_static()
     outputs = []
diff --git a/paddlenlp/trl/sft_auto_trainer.py b/paddlenlp/trl/sft_auto_trainer.py
index e56c86bf7f9d..82ed798bc2a5 100644
--- a/paddlenlp/trl/sft_auto_trainer.py
+++ b/paddlenlp/trl/sft_auto_trainer.py
@@ -27,8 +27,7 @@
     parallelize_optimizer,
 )
 
-from paddlenlp.trl import SFTTrainer
-
+from .sft_trainer import SFTTrainer
 from ..data import DataCollatorForSeq2Seq
 from ..trainer.argparser import strtobool
 from ..trainer.trainer import (
diff --git a/paddlenlp/utils/__init__.py b/paddlenlp/utils/__init__.py
index 3b5950b0d701..b4fb779c5abb 100644
--- a/paddlenlp/utils/__init__.py
+++ b/paddlenlp/utils/__init__.py
@@ -21,6 +21,7 @@
 from .import_utils import *
 from .infohub import infohub
 from .initializer import to
+from .memory_utils import empty_device_cache
 from .optimizer import *
 from .serialization import load_torch
 
diff --git a/paddlenlp/utils/downloader.py b/paddlenlp/utils/downloader.py
index a382a4dd265b..66cf6f7ab23e 100644
--- a/paddlenlp/utils/downloader.py
+++ b/paddlenlp/utils/downloader.py
@@ -24,6 +24,7 @@
 from collections import OrderedDict
 from typing import Optional, Union
 
+import paddle.distributed as dist
 import requests
 from filelock import FileLock
 from huggingface_hub import get_hf_file_metadata, hf_hub_url
@@ -33,7 +34,13 @@
 from .env import DOWNLOAD_SERVER, FAILED_STATUS, SUCCESS_STATUS
 from .fault_tolerance import PDC_DOWNLOAD_ERROR
 from .log import logger
-from .pdc_sdk import PDCErrorCode, PDCErrorMessageMap, pdc_tool
+from .pdc_sdk import (
+    FLASH_DEVICE,
+    PDCErrorCode,
+    PDCErrorMessageMap,
+    pdc_flash_device_available,
+    pdc_tool,
+)
 
 __all__ = ["get_weights_path_from_url"]
 
@@ -487,7 +494,7 @@ def download_from_pdc(remote_path, local_path, timeout):
     """
 
     try:
-        base_dir, _ = os.path.split(os.path.normpath(remote_path))
+        base_dir, _ = os.path.split(os.path.normpath(local_path))
         if not os.path.exists(base_dir) and base_dir != "":
             os.makedirs(base_dir, exist_ok=True)
     except Exception as e:
@@ -505,3 +512,81 @@ def download_from_pdc(remote_path, local_path, timeout):
         raise RuntimeError(
             f"{PDC_DOWNLOAD_ERROR}; Error occurred when trying to download object from PDC, remote_path: {remote_path}, local_path: {local_path}, timeout: {timeout}; error details: {PDCErrorMessageMap[result]}"
         )
+
+
+def get_static_model_on_pdc(remote_path, local_path, timeout, enable_flash_device=False):
+    """
+    Get static model from PDC. Use flash device if possible.
+    This function has to be called after distributed env is initialized in distributed mode.
+    Args:
+        remote_path (`str`):
+            remote path url for download
+        local_path (`str`):
+            local path to place downloaded object
+        timeout (`int`):
+            max wait time for download
+        enable_flash_device (`bool`):
+            Whether to use flash device
+    Returns:
+        str: path to load static model
+    """
+    try:
+        base_dir, target_dir = os.path.split(os.path.normpath(local_path))
+        if not os.path.exists(base_dir) and base_dir != "":
+            os.makedirs(base_dir, exist_ok=True)
+    except Exception as e:
+        raise RuntimeError(f"{PDC_DOWNLOAD_ERROR}; Failed to parse checkpoint path, details: {e}")
+
+    assert target_dir != ".", f"{PDC_DOWNLOAD_ERROR}, illegal local_path: {local_path}."
+
+    flash_path = os.path.join(FLASH_DEVICE, target_dir)
+    persistent_path = local_path
+
+    device_id = int(os.getenv("FLAGS_selected_gpus", "0"))
+    if device_id != 0:
+        logger.info("Waiting local process 0...")
+        dist.barrier()
+        return flash_path if (enable_flash_device and os.path.exists(flash_path)) else persistent_path
+
+    # step 1: load from flash device if possible
+    need_download_from_remote = True
+    need_backup_to_flash = False
+    if enable_flash_device and pdc_flash_device_available():
+        logger.info(f"flash device is available, checking status on {flash_path}...")
+        # skip download SC as default when flash device is available
+        need_download_from_remote = False
+        if os.path.exists(flash_path) and pdc_tool.pdc_flash_do_check(flash_path) == PDCErrorCode.Success:
+            logger.info("Static model checked successfully on flash device, ready to load...")
+        else:
+            logger.warning(
+                "flash device is available but no valid static model found on flash device, need to download from remote."
+            )
+            need_download_from_remote = True
+            need_backup_to_flash = True
+    else:
+        logger.info("Flash device is not enabled or available, will download static model from remote.")
+
+    # step 2: download from remote if neccesary
+    if need_download_from_remote:
+        logger.info("Beging download static model from remote...")
+        download_from_pdc(remote_path, persistent_path, timeout)
+        logger.info(f"downloaded static model from remote, path:{persistent_path}")
+
+    # step 3: backup to flash device if flash device is available
+    if enable_flash_device and need_backup_to_flash:
+        result = pdc_tool.pdc_backup_to_flash_device(persistent_path, flash_path)
+        if result == PDCErrorCode.Success:
+            logger.info(f"Backup static model to flash device {flash_path} successfully.")
+        else:
+            logger.error(f"Backup static model to flash device failed, error details: {PDCErrorMessageMap[result]}.")
+
+    # step 4: return flash path if available, otherwise return persistent path
+    if dist.get_world_size() > 1:
+        logger.info("Local node process done, waiting other nodes...")
+        dist.barrier()
+    if enable_flash_device and os.path.exists(flash_path):
+        logger.info(f"static model is ready on flash device, path: {flash_path}")
+        return flash_path
+    else:
+        logger.info(f"static model is only ready on persistent storage, path: {persistent_path}")
+        return persistent_path
diff --git a/paddlenlp/utils/env.py b/paddlenlp/utils/env.py
index c139327b9ebd..ac7396a48828 100644
--- a/paddlenlp/utils/env.py
+++ b/paddlenlp/utils/env.py
@@ -20,6 +20,13 @@
 """
 import os
 
+try:
+    from paddle.base.framework import use_pir_api
+
+    pir_enabled = use_pir_api()
+except ImportError:
+    pir_enabled = False
+
 
 def _get_user_home():
     return os.path.expanduser("~")
@@ -132,3 +139,12 @@ def _get_bool_env(env_key: str, default_value: str) -> bool:
 MAX_BSZ = 512
 SPECULATE_MAX_BSZ = 256
 MAX_DRAFT_TOKENS = 6
+
+if pir_enabled:
+    PADDLE_INFERENCE_MODEL_SUFFIX = ".json"
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX = ".pdiparams"
+else:
+    PADDLE_INFERENCE_MODEL_SUFFIX = ".pdmodel"
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX = ".pdiparams"
+
+USE_FAST_TOKENIZER: bool = _get_bool_env("USE_FAST_TOKENIZER", "false")
diff --git a/paddlenlp/utils/memory_utils.py b/paddlenlp/utils/memory_utils.py
new file mode 100644
index 000000000000..05e9a9fe3fef
--- /dev/null
+++ b/paddlenlp/utils/memory_utils.py
@@ -0,0 +1,39 @@
+# coding:utf-8
+# Copyright (c) 2025  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from .log import logger
+from .tools import get_env_device
+
+__all__ = [
+    "empty_device_cache",
+]
+
+
+def empty_device_cache():
+    device = get_env_device()
+    if device == "gpu":
+        paddle.device.cuda.empty_cache()
+    elif device == "xpu":
+        paddle.device.xpu.empty_cache()
+    else:
+        if not getattr(empty_device_cache, "has_warned", False):
+            logger.warning(
+                "The current device ({}) does not support empty cache, calling empty_device_cache() will have no effect.".format(
+                    device
+                )
+            )
+            setattr(empty_device_cache, "has_warned", True)
diff --git a/paddlenlp/utils/pdc_sdk.py b/paddlenlp/utils/pdc_sdk.py
index 028850e4d388..c306eedd92c7 100644
--- a/paddlenlp/utils/pdc_sdk.py
+++ b/paddlenlp/utils/pdc_sdk.py
@@ -15,9 +15,11 @@
 import json
 import os
 import queue
+import shutil
 import subprocess
 import threading
 import time
+from distutils.dir_util import copy_tree
 from enum import Enum
 from typing import List
 
@@ -28,6 +30,13 @@
 TRAIN_CONFIG = "/root/paddlejob/workspace/env_run/longjob/train.conf"
 TAR_BIN = "tar"
 
+FLASH_DEVICE = os.getenv("PDC_FLASH_DEVICE", "/shared/dev/shm/flash")
+
+
+def pdc_flash_device_available():
+    # TODO(@gexiao): need better check
+    return os.path.exists(FLASH_DEVICE)
+
 
 class PDCErrorCode(Enum):
     """Error Code For PDCTools usage"""
@@ -48,6 +57,7 @@ class PDCErrorCode(Enum):
     InvalidArgument = 1503
     CommandTimeout = 1504
     CheckSumCommandFail = 1505
+    CopyTreeFailed = 1506
 
     UnknownError = 1999
 
@@ -493,14 +503,60 @@ def _download_file(self, remote_path: str, local_path: str) -> PDCErrorCode:
             raise Exception(f"exec cmd {download_cmd_args} with error: {e}")
         return error_code
 
-    def pdc_fc_generate_checksum(self, path: str) -> PDCErrorCode:
+    def _pdc_backup_failed_directory(self, path):
+        base_dir, target_path = os.path.split(os.path.normpath(path))
+        failed_path = os.path.join(base_dir, f"{target_path}_failed")
+        if os.path.exists(path):
+            if os.path.exists(failed_path):
+                shutil.rmtree(failed_path)
+            # Backup failed files for debug
+            os.rename(path, failed_path)
+
+    def pdc_backup_to_flash_device(self, persistent_path: str, flash_device_path: str) -> PDCErrorCode:
+        """backup data to flash device
+
+        Args:
+        persistent_path str: persistent path
+        flash_device_path str: flash device path
+        """
+        if not os.path.exists(persistent_path):
+            logger.error(f"{persistent_path} not exist")
+            return PDCErrorCode.LocalPathNotExist
+
+        logger.info("starting backup to flash device...")
+
+        # step 1: generate checksum for recovery
+        result = self.pdc_generate_dir_checksum(persistent_path)
+        if result != PDCErrorCode.Success:
+            logger.error(f"[Error] [pdc_sdk] generating checksum for {persistent_path} failed")
+            return result
+
+        # step 2: copy persistent data to flash device
+        try:
+            copy_tree(persistent_path, flash_device_path)
+            logger.info(f"backup {persistent_path} to {flash_device_path} successed.")
+        except Exception as e:
+            logger.error(f"[Error] [pdc_sdk] copy tree {persistent_path} to {flash_device_path} failed, error: {e}")
+            self._pdc_backup_failed_directory(flash_device_path)
+            return PDCErrorCode.CopyTreeFailed
+
+        # step 3: do checksum for storage on flash device
+        result = self.pdc_flash_do_check(flash_device_path)
+        if result == PDCErrorCode.Success:
+            return result
+
+        logger.error(f"[Error] [pdc_sdk] checksum failed on {flash_device_path} after copy, backup for debug")
+        self._pdc_backup_failed_directory(flash_device_path)
+        return result
+
+    def pdc_generate_dir_checksum(self, path: str) -> PDCErrorCode:
         """
         Args
         :param localPath:
         :return:
         """
         if not os.path.exists(path):
-            logger.error(f"pdc_fc_generate_checksum gi{path} not exist")
+            logger.error(f"pdc_generate_dir_checksum gi{path} not exist")
             return PDCErrorCode.CommandFail
         generate_checksum_args = [self._pdc_agent_bin, "-mode", "command", "-type", "generateSum", "-path", f"{path}"]
         error_code = PDCErrorCode.Success
@@ -514,14 +570,14 @@ def pdc_fc_generate_checksum(self, path: str) -> PDCErrorCode:
             return PDCErrorCode.CheckSumCommandFail
         return error_code
 
-    def pdc_fc_do_check(self, path: str) -> PDCErrorCode:
+    def pdc_flash_do_check(self, path: str) -> PDCErrorCode:
         """
         Args
         :param localPath:
         :return:
         """
         if not os.path.exists(path):
-            logger.error(f"pdc_fc_do_check {path} not exist")
+            logger.error(f"pdc_flash_do_check {path} not exist")
             return PDCErrorCode.CommandFail
         generate_checksum_args = [self._pdc_agent_bin, "-mode", "command", "-type", "checkSum", "-path", f"{path}"]
         error_code = PDCErrorCode.Success
@@ -530,8 +586,12 @@ def pdc_fc_do_check(self, path: str) -> PDCErrorCode:
             res, error_code = self._exec_cmd(generate_checksum_args)
             if error_code == PDCErrorCode.Success:
                 logger.info(f"check_sum {path} successfully")
+            else:
+                logger.error(f"[Error] [pdc_sdk] check_sum {path} failed, error code: {error_code}")
+                self._pdc_backup_failed_directory(path)
         except Exception as e:
-            logger.error(f"exec cmd {generate_checksum_args} with error: {e}")
+            logger.error(f"[Error] [pdc_sdk] exec cmd {generate_checksum_args} with error: {e}")
+            self._pdc_backup_failed_directory(path)
             return PDCErrorCode.CheckSumCommandFail
         return error_code
 
@@ -560,8 +620,10 @@ def _clean_tmp_files(self, tmp_files: List[str]):
     PDCErrorCode.AFSToolsNotExist: "afs tools not exist",
     PDCErrorCode.TrainConfigNotExist: "train config not exist",
     PDCErrorCode.LocalPathNotExist: "local path not exist",
-    PDCErrorCode.CommandFail: "download command fail",
+    PDCErrorCode.CommandFail: "pdc agent command fail",
     PDCErrorCode.CalculateHashFail: "calculate hash fail",
     PDCErrorCode.InvalidArgument: "invalid argument",
-    PDCErrorCode.CommandTimeout: "command timeout",
+    PDCErrorCode.CommandTimeout: "pdc agent command timeout",
+    PDCErrorCode.CheckSumCommandFail: "checksum command fail",
+    PDCErrorCode.CopyTreeFailed: "copy directory failed",
 }
diff --git a/paddlenlp/utils/serialization.py b/paddlenlp/utils/serialization.py
index 9b467ec14389..d928a37de375 100644
--- a/paddlenlp/utils/serialization.py
+++ b/paddlenlp/utils/serialization.py
@@ -160,6 +160,24 @@ def find_class(self, mod_name, name):
         return super().find_class(mod_name, name)
 
 
+class SafeUnpickler(pickle.Unpickler):
+    """
+    A safe unpickler that only allows loading of built-in basic data types.
+    """
+
+    def find_class(self, module, name):
+        """
+        Overrides the find_class method to only allow loading of built-in basic data types.
+
+        :param module: The module name.
+        :param name: The class name.
+        :return: The class if allowed, otherwise raises UnpicklingError.
+        """
+        if module == "builtins" and name in {"int", "float", "str", "tuple", "list", "dict", "set"}:
+            return super().find_class(module, name)
+        raise pickle.UnpicklingError(f"Unsafe object loading is prohibited: {module}.{name}")
+
+
 def _rebuild_tensor_stage(storage, storage_offset, size, stride, requires_grad, backward_hooks):
     # if a tensor has shape [M, N] and stride is [1, N], it's column-wise / fortran-style
     # if a tensor has shape [M, N] and stride is [M, 1], it's row-wise / C-style
diff --git a/scripts/codestyle/check_dead_links.py b/scripts/codestyle/check_dead_links.py
index 1bf7ea85f8f0..343d69384429 100644
--- a/scripts/codestyle/check_dead_links.py
+++ b/scripts/codestyle/check_dead_links.py
@@ -35,6 +35,8 @@ def find_dead_links(directory):
     dead_links = []
 
     for root, dirs, files in os.walk(directory):
+        if "third_party" in root:
+            continue
         for file in files:
             if file.endswith((".md", ".rst")):
                 file_path = os.path.join(root, file)
diff --git a/scripts/distribute/ci_case_auto.sh b/scripts/distribute/ci_case_auto.sh
index 452d1f7ef5ac..e860cd2f3627 100755
--- a/scripts/distribute/ci_case_auto.sh
+++ b/scripts/distribute/ci_case_auto.sh
@@ -612,7 +612,7 @@ function llama_dy2st_auto_bs4_bf16_DP1-MP1-PP4-SD2() {
             --sharding "stage2" \
             --pipeline_parallel_config "enable_send_recv_overlap" \
             --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate" \
-            --sharding_parallel_config "enable_stage2_overlap" \
+            --sharding_parallel_config "enable_overlap" \
             --tensor_parallel_config "enable_mp_async_allreduce" \
             --to_static 1 \
             --amp_custom_black_list "reduce_sum" "c_softmax_with_cross_entropy" \
@@ -712,7 +712,7 @@ function llama_dy2st_auto_bs4_bf16_DP1-MP1-PP4-SD2-VPP3_split_bw() {
             --sharding "stage2" \
             --pipeline_parallel_config "enable_send_recv_overlap enable_split_backward" \
             --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate" \
-            --sharding_parallel_config "enable_stage2_overlap" \
+            --sharding_parallel_config "enable_overlap" \
             --tensor_parallel_config "enable_mp_async_allreduce" \
             --to_static 1 \
             --amp_custom_black_list "reduce_sum" "c_softmax_with_cross_entropy" \
@@ -1806,7 +1806,7 @@ function llama_baichuan_pir_auto_fuse_ffn_attention_qkv_DP2_MP2_PP2(){
         --sequence_parallel false \
         --sharding "stage1" \
         --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate " \
-        --sharding_parallel_config "enable_stage1_overlap" \
+        --sharding_parallel_config "enable_overlap" \
         --tensor_parallel_config "enable_mp_async_allreduce" \
         --pipeline_parallel_config "enable_send_recv_overlap" \
         --auto_parallel_resume_form_hybrid_parallel true \
@@ -1878,7 +1878,7 @@ function llama_baichuan_pir_auto_fuse_ffn_attention_qkv_DP2_MP2_PP2_intermediate
         --sequence_parallel false \
         --sharding "stage1" \
         --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate " \
-        --sharding_parallel_config "enable_stage1_overlap" \
+        --sharding_parallel_config "enable_overlap" \
         --tensor_parallel_config "enable_mp_async_allreduce" \
         --pipeline_parallel_config "enable_send_recv_overlap" \
         --auto_parallel_resume_form_hybrid_parallel true \
@@ -2430,7 +2430,7 @@ function llm_gpt_pir_auto_bs8_DP2_TP2_PP2(){
         --fp16_opt_level "O2" \
         --num_hidden_layers 2 \
         --intermediate_size 1024 \
-        --sharding_parallel_config "enable_stage1_tensor_fusion enable_stage1_overlap" \
+        --sharding_parallel_config "enable_tensor_fusion enable_overlap" \
         --tensor_parallel_config "enable_mp_async_allreduce" \
         --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate" \
         --pipeline_parallel_config "enable_send_recv_overlap enable_split_backward" \
@@ -2497,7 +2497,7 @@ function llm_gpt_pir_auto_bs8_DP2_TP2_PP2_intermediate(){
         --fp16_opt_level "O2" \
         --num_hidden_layers 2 \
         --intermediate_size 1024 \
-        --sharding_parallel_config "enable_stage1_tensor_fusion enable_stage1_overlap" \
+        --sharding_parallel_config "enable_tensor_fusion enable_overlap" \
         --tensor_parallel_config "enable_mp_async_allreduce" \
         --data_parallel_config "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate" \
         --pipeline_parallel_config "enable_send_recv_overlap enable_split_backward" \
diff --git a/scripts/regression/ci_case.sh b/scripts/regression/ci_case.sh
index d9c233c31a1f..cca666c82ec9 100644
--- a/scripts/regression/ci_case.sh
+++ b/scripts/regression/ci_case.sh
@@ -22,6 +22,17 @@ export CXX_COMPILER_PATH=$(which g++)
 export CC=$(which gcc)
 export CXX=$(which g++)
 
+export PADDLE_INFERENCE_MODEL_SUFFIX=$(python -c "
+import paddle
+try:
+    from paddle.base.framework import use_pir_api
+    pir_enabled = use_pir_api()
+except ImportError:
+    pir_enabled = False
+model_suffix = '.json' if pir_enabled else '.pdmodel'
+print(model_suffix)
+")
+
 if [ ! -d "model_logs" ]; then
     mkdir model_logs
 fi
@@ -363,7 +374,7 @@ lexical_analysis(){
     print_info $? lexical_analysis_predict
     # deploy
     time (python deploy/predict.py \
-        --model_file=infer_model/static_graph_params.pdmodel \
+        --model_file=infer_model/static_graph_params${PADDLE_INFERENCE_MODEL_SUFFIX} \
         --params_file=infer_model/static_graph_params.pdiparams \
         --data_dir lexical_analysis_dataset_tiny >${log_path}/lexical_analysis_deploy) >>${log_path}/lexical_analysis_deploy 2>&1
     print_info $? lexical_analysis_deploy
@@ -467,7 +478,7 @@ ernie-csc() {
     python export_model.py --params_path ./checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params >${log_path}/ernie-csc_export >>${log_path}/ernie-csc_export 2>&1
     print_info $? ernie-csc_export
     #python deploy
-    python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1
+    python predict.py --model_file infer_model/static_graph_params${PADDLE_INFERENCE_MODEL_SUFFIX} --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1
     print_info $? ernie-csc_deploy
 }
 
diff --git a/scripts/regression/test_taskflow.py b/scripts/regression/test_taskflow.py
index e8e6c69e4461..686bedca5efe 100644
--- a/scripts/regression/test_taskflow.py
+++ b/scripts/regression/test_taskflow.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 """Test taskflow."""
 import os
+import unittest
 
 from paddlenlp import Taskflow
 
@@ -68,6 +69,7 @@ def test_corrector():
     corrector("遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。")
 
 
+@unittest.skip("dependency_parsing is not support for Paddle >= 2.6.1")
 def test_dependency_parsing():
     """
     test_dependency_parsing
diff --git a/slm/examples/RLHF/ppo_trainer.py b/slm/examples/RLHF/ppo_trainer.py
index c2c72d6c5cd1..bdec462411e0 100644
--- a/slm/examples/RLHF/ppo_trainer.py
+++ b/slm/examples/RLHF/ppo_trainer.py
@@ -66,6 +66,7 @@
     speed_metrics,
 )
 from paddlenlp.transformers import PretrainedModel, PretrainedTokenizer
+from paddlenlp.utils import empty_device_cache
 
 
 class StepTrainer(Trainer):
@@ -1032,7 +1033,7 @@ def gen_epoch_data():
                     ptx_batches = [None for _ in range(len(rl_batches))]
                 self.timers and self.timers("ptx-batch").stop()
 
-                paddle.device.cuda.empty_cache()
+                empty_device_cache()
 
                 self.set_train()
                 for _ in range(self.args.update_iters):
@@ -1152,7 +1153,7 @@ def train(
 
         # ##### model and optimizer related setting #####
         policy_model, value_model = self.init_train_model_opt(max_steps, resume_from_checkpoint)
-        paddle.device.cuda.empty_cache()
+        empty_device_cache()
 
         # ##### traing statistic logging #####
         # Number of trainable parameters only account for policy_model
@@ -1208,7 +1209,7 @@ def train(
                     # with self.enable(self.value_trainer.optimizer):
                     with self.enable():  # put value optimizer guard in rl_step
                         rl_info = self.rl_step(rl_batch)
-                    paddle.device.cuda.empty_cache()
+                    empty_device_cache()
                     self.timers and self.timers("rl_step").stop()
 
                     if self.use_ptx:
@@ -1224,7 +1225,7 @@ def train(
                             ptx_info = self.ptx_step(ptx_batch)
                         rl_info.update(ptx_info)
                         self.timers and self.timers("ptx_step").stop()
-                paddle.device.cuda.empty_cache()
+                empty_device_cache()
 
                 self.state.global_step += 1
                 self.state.epoch = epoch + (step + 1) / steps_in_epoch
diff --git a/slm/examples/machine_reading_comprehension/SQuAD/run_squad.py b/slm/examples/machine_reading_comprehension/SQuAD/run_squad.py
index 9e2419ad0255..51216464b366 100644
--- a/slm/examples/machine_reading_comprehension/SQuAD/run_squad.py
+++ b/slm/examples/machine_reading_comprehension/SQuAD/run_squad.py
@@ -250,7 +250,7 @@ def run(args):
             partial(prepare_train_features, tokenizer=tokenizer, args=args),
             batched=True,
             remove_columns=column_names,
-            num_proc=4,
+            num_proc=1,
         )
         train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
         train_batchify_fn = DataCollatorWithPadding(tokenizer)
@@ -332,7 +332,7 @@ def run(args):
             partial(prepare_validation_features, tokenizer=tokenizer, args=args),
             batched=True,
             remove_columns=column_names,
-            num_proc=4,
+            num_proc=1,
         )
         dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
         dev_ds_for_model = dev_ds.remove_columns(["example_id", "offset_mapping"])
diff --git a/slm/model_zoo/chinesebert/utils.py b/slm/model_zoo/chinesebert/utils.py
index a58fc7edb214..712d1a65ccbc 100755
--- a/slm/model_zoo/chinesebert/utils.py
+++ b/slm/model_zoo/chinesebert/utils.py
@@ -29,6 +29,7 @@
     LinearDecayWithWarmup,
     PolyDecayWithWarmup,
 )
+from paddlenlp.utils.serialization import SafeUnpickler
 
 scheduler_type2cls = {
     "linear": LinearDecayWithWarmup,
@@ -121,7 +122,7 @@ def save_pickle(data, file_path):
 
 def load_pickle(input_file):
     with open(str(input_file), "rb") as f:
-        data = pickle.load(f)
+        data = SafeUnpickler(f).load()
     return data
 
 
diff --git a/slm/model_zoo/ernie-3.0/README.md b/slm/model_zoo/ernie-3.0/README.md
index f7334127c56d..c8682d7e3b93 100644
--- a/slm/model_zoo/ernie-3.0/README.md
+++ b/slm/model_zoo/ernie-3.0/README.md
@@ -1329,6 +1329,7 @@ batch_size=32 和 1，预测精度为 FP16 时，GPU 下的效果-时延图：
 - paddlepaddle >= 2.3
 - paddlenlp >= 2.4
 - paddleslim >= 2.4
+- evaluate
 
 ### 数据准备
 此次微调数据主要是以 CLUE benchmark 数据集为主, CLUE benchmark 包括了文本分类、实体抽取、问答三大类数据集，而 CLUE benchmark 数据目前已经集成在 PaddleNLP 的 datasets 里面，可以通过下面的方式来使用数据集
diff --git a/slm/model_zoo/ernie-3.0/run_qa.py b/slm/model_zoo/ernie-3.0/run_qa.py
index 6c3b03d70c49..e1d6a73847a5 100644
--- a/slm/model_zoo/ernie-3.0/run_qa.py
+++ b/slm/model_zoo/ernie-3.0/run_qa.py
@@ -105,7 +105,7 @@ def main():
             train_dataset = train_dataset.map(
                 partial(prepare_train_features, tokenizer=tokenizer, args=data_args),
                 batched=True,
-                num_proc=4,
+                num_proc=1,
                 batch_size=4,
                 remove_columns=column_names,
                 load_from_cache_file=not data_args.overwrite_cache,
@@ -118,7 +118,7 @@ def main():
             eval_dataset = eval_examples.map(
                 partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
                 batched=True,
-                num_proc=4,
+                num_proc=1,
                 batch_size=4,
                 remove_columns=column_names,
                 load_from_cache_file=not data_args.overwrite_cache,
@@ -132,7 +132,7 @@ def main():
             predict_dataset = predict_examples.map(
                 partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
                 batched=True,
-                num_proc=4,
+                num_proc=1,
                 batch_size=4,
                 remove_columns=column_names,
                 load_from_cache_file=not data_args.overwrite_cache,
diff --git a/slm/model_zoo/ernie-3.0/run_token_cls.py b/slm/model_zoo/ernie-3.0/run_token_cls.py
index f805ad307c6f..a9022966526c 100644
--- a/slm/model_zoo/ernie-3.0/run_token_cls.py
+++ b/slm/model_zoo/ernie-3.0/run_token_cls.py
@@ -19,7 +19,7 @@
 import numpy as np
 import paddle
 import paddle.nn as nn
-from datasets import load_metric
+from evaluate import load as load_metric
 from utils import DataArguments, ModelArguments, load_config, token_convert_example
 
 import paddlenlp
diff --git a/slm/model_zoo/t5/utils.py b/slm/model_zoo/t5/utils.py
index 303f70af2a71..558feca47ec2 100644
--- a/slm/model_zoo/t5/utils.py
+++ b/slm/model_zoo/t5/utils.py
@@ -28,6 +28,7 @@
     LinearDecayWithWarmup,
     PolyDecayWithWarmup,
 )
+from paddlenlp.utils.serialization import SafeUnpickler
 
 
 def accuracy(targets, predictions):
@@ -158,5 +159,5 @@ def save_pickle(data, file_path):
 
 def load_pickle(input_file):
     with open(str(input_file), "rb") as f:
-        data = pickle.load(f)
+        data = SafeUnpickler(f).load()
     return data
diff --git a/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py b/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py
index cd60ae8ec765..0b1e9b6017a9 100644
--- a/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py
+++ b/slm/pipelines/examples/contrastive_training/evaluation/eval_mteb.py
@@ -14,16 +14,12 @@
 
 import argparse
 import logging
-import sys
 
 import mteb
-import paddle
-from models.modeling import BiEncoderModel
-from models.modeling_nv import NVEncodeModel
 from mteb import MTEB
 
 from paddlenlp.peft import LoRAConfig, LoRAModel
-from paddlenlp.transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
+from paddlenlp.transformers import AutoTokenizer, BiEncoderModel, NVEncodeModel
 
 
 class MTEB_EvalModel:
diff --git a/slm/pipelines/examples/contrastive_training/train.py b/slm/pipelines/examples/contrastive_training/train.py
index d9d27e5fa01b..c18263dff95d 100644
--- a/slm/pipelines/examples/contrastive_training/train.py
+++ b/slm/pipelines/examples/contrastive_training/train.py
@@ -17,12 +17,10 @@
 from arguments import DataArguments, ModelArguments
 from arguments import RetrieverTrainingArguments as TrainingArguments
 from data import EmbedCollator, TrainDatasetForEmbedding
-from models.modeling import BiEncoderModel
-from models.modeling_nv import NVEncodeModel
 
 from paddlenlp.peft import LoRAConfig, LoRAModel
 from paddlenlp.trainer import PdArgumentParser, Trainer, get_last_checkpoint, set_seed
-from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.transformers import AutoTokenizer, BiEncoderModel, NVEncodeModel
 from paddlenlp.utils.log import logger
 
 
diff --git a/tests/llm/test_gradio.py b/tests/llm/test_gradio.py
index 731c5f9bf6d3..2ab830583b25 100644
--- a/tests/llm/test_gradio.py
+++ b/tests/llm/test_gradio.py
@@ -1,3 +1,4 @@
+#!/usr/bin/env python
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -46,10 +47,9 @@ def setUp(self):
         self.model_path = "__internal_testing__/micro-random-llama"
         command = (
             "cd ./llm && PYTHONPATH=../:$PYTHONPATH"
-            + ' {python} predict/flask_server.py --model_name_or_path {model_path} --port {port} --flask_port {flask_port} --src_length 1024 --dtype "float16"'.format(
-                flask_port=self.flask_port, port=self.port, model_path=self.model_path, python=sys.executable
-            )
-        )
+            + " {python} predict/flask_server.py --model_name_or_path {model_path} "
+            + '--port {port} --flask_port {flask_port} --src_length 1024 --dtype "float16"'
+        ).format(flask_port=self.flask_port, port=self.port, model_path=self.model_path, python=sys.executable)
         current_env = copy.copy(os.environ.copy())
         current_env.pop("http_proxy", None)
         current_env.pop("https_proxy", None)
@@ -58,7 +58,6 @@ def setUp(self):
 
         self.ui_process = subprocess.Popen(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=current_env)
         self.tokenizer = LlamaTokenizer.from_pretrained(self.model_path)
-
         return super().setUp()
 
     def tearDown(self):
@@ -79,13 +78,11 @@ def wait_until_server_is_ready(self):
         while True:
             if is_port_in_use(self.flask_port) and is_port_in_use(self.port):
                 break
-
             print("waiting for server ...")
             time.sleep(1)
 
     def get_gradio_ui_result(self, *args, **kwargs):
         _, _, file = self.client.predict(*args, **kwargs)
-
         with open(file, "r", encoding="utf-8") as f:
             content = json.load(f)
         return content[-1]["utterance"]
@@ -95,64 +92,57 @@ def test_argument(self):
         self.wait_until_server_is_ready()
 
         def get_response(data):
-            res = requests.post(f"http://localhost:{self.flask_port}/api/chat", json=data, stream=True)
+            res = requests.post(f"http://localhost:{self.flask_port}/v1/chat/completions", json=data, stream=True)
             result_ = ""
             for line in res.iter_lines():
-                print(line)
-                result = json.loads(line)
-                bot_response = result["result"]["response"]
-
-                if bot_response["utterance"].endswith("[END]"):
-                    bot_response["utterance"] = bot_response["utterance"][:-5]
-
-                result_ += bot_response["utterance"]
-
+                if not line:
+                    continue
+                decoded_line = line.decode("utf-8").strip()
+                # 如果返回行以 "data:" 开头，则去除该前缀
+                if decoded_line.startswith("data:"):
+                    data_str = decoded_line[len("data:") :].strip()
+                else:
+                    data_str = decoded_line
+                if data_str == "[DONE]":
+                    break
+                chunk = json.loads(data_str)
+                # 根据 OpenAI 的流式返回，每个 chunk 在 choices[0]["delta"] 中包含回复增量
+                delta = chunk["choices"][0]["delta"].get("content", "")
+                result_ += delta
             return result_
 
+        # 测试用例1：greedy search 模式（top_p 为1.0）
         data = {
-            "context": "你好",
-            "top_k": 1,
-            "top_p": 1.0,
+            "messages": [{"role": "user", "content": "你好"}],
             "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_length": 20,
-            "min_length": 1,
+            "max_tokens": 20,
+            "top_p": 1.0,
+            "stream": True,
         }
-        # Case 1: greedy search
-        # result_0 = get_response(data)
         result_1 = get_response(data)
 
-        # TODO(wj-Mcat): enable logit-comparision later
-        # assert result_0 == result_1
-
+        # 测试用例2：采样模式（top_p 为 0.7）
         data = {
-            "context": "你好",
-            "top_k": 0,
-            "top_p": 0.7,
+            "messages": [{"role": "user", "content": "你好"}],
             "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_length": 20,
-            "min_length": 1,
+            "max_tokens": 20,
+            "top_p": 0.7,
+            "stream": True,
         }
-
-        # Case 2: sampling
         result_2 = get_response(data)
-        # assert result_1 != result_2
 
-        # 测试长度应该保持一致
+        # 对生成文本的长度进行简单检测
         assert 10 <= len(self.tokenizer.tokenize(result_1)) <= 50
         assert 10 <= len(self.tokenizer.tokenize(result_2)) <= 50
 
+        # 测试用例3：更长的 max_tokens 参数
         data = {
-            "context": "你好",
-            "top_k": 1,
-            "top_p": 0.7,
+            "messages": [{"role": "user", "content": "你好"}],
             "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_length": 100,
-            "min_length": 1,
+            "max_tokens": 100,
+            "top_p": 0.7,
+            "stream": True,
         }
-        # Case 3: max_length
         result_3 = get_response(data)
         assert result_3 != result_2
         assert 70 <= len(self.tokenizer.tokenize(result_3)) <= 150
diff --git a/tests/mergekit/test_merge_config.py b/tests/mergekit/test_merge_config.py
index 7a0dafea8f57..5bdc1abc2b36 100644
--- a/tests/mergekit/test_merge_config.py
+++ b/tests/mergekit/test_merge_config.py
@@ -30,10 +30,6 @@ def test_save_load(self):
                 MergeConfig.from_pretrained("./rand")
 
     def test_raise_exception(self):
-        with self.assertRaises(ValueError):
-            MergeConfig(
-                tensor_type="pd",
-            )
         with self.assertRaises(ValueError):
             MergeConfig(merge_method="linear1")
         with self.assertRaises(ValueError):
diff --git a/tests/mergekit/test_merge_method.py b/tests/mergekit/test_merge_method.py
index 22c90c110d7b..10ae98fb3f43 100644
--- a/tests/mergekit/test_merge_method.py
+++ b/tests/mergekit/test_merge_method.py
@@ -15,6 +15,7 @@
 import unittest
 
 import numpy as np
+import paddle
 
 from paddlenlp.mergekit import MergeConfig, MergeMethod
 
@@ -72,6 +73,10 @@ def setUpClass(cls):
         )
         cls.tensor_list = [t1, t2]
 
+    @classmethod
+    def to_paddle_tensor(cls, numpy_tensors):
+        return [paddle.to_tensor(tensor, dtype="float32") for tensor in numpy_tensors]
+
     def test_linear(self):
         merge_config = MergeConfig(
             merge_type="linear",
@@ -186,3 +191,133 @@ def test_ties(self):
             dtype="float32",
         )
         self.assertTrue(np.array_equal(merged_tensor, expected_result))
+
+    def test_linear_paddle(self):
+        paddle_tensor_list = self.to_paddle_tensor(self.tensor_list)
+        merge_config = MergeConfig(
+            merge_type="linear",
+            weight_list=[2, 8],
+            normalize=True,
+            tensor_type="pd",
+        )
+        merge_method = MergeMethod(merge_config=merge_config)
+        merged_tensor = merge_method.merge(paddle_tensor_list)
+        self.assertEqual(merged_tensor.shape, [4, 5])
+        expected_result = paddle.to_tensor(
+            [
+                [
+                    0.2324269860982895,
+                    -0.23574663698673248,
+                    0.490084171295166,
+                    -0.27267152070999146,
+                    0.5471049547195435,
+                ],
+                [
+                    0.3380300998687744,
+                    -0.8378675580024719,
+                    0.6158430576324463,
+                    0.05510251224040985,
+                    -0.13901998102664948,
+                ],
+                [
+                    0.9436129927635193,
+                    -0.17925502359867096,
+                    0.17603228986263275,
+                    0.581262469291687,
+                    0.17480896413326263,
+                ],
+                [
+                    0.5418604016304016,
+                    -0.2867910861968994,
+                    0.022852152585983276,
+                    0.6011121273040771,
+                    0.2119656205177307,
+                ],
+            ],
+            dtype="float32",
+        )
+        self.assertTrue(
+            paddle.allclose(merged_tensor, expected_result, atol=1e-6),
+            "Paddle linear merge result does not match expected result.",
+        )
+
+    def test_slerp_paddle(self):
+        paddle_tensor_list = self.to_paddle_tensor(self.tensor_list)
+        merge_config = MergeConfig(
+            merge_type="slerp",
+            slerp_alpha=0.5,
+            tensor_type="pd",
+        )
+        merge_method = MergeMethod(merge_config=merge_config)
+        merged_tensor = merge_method.merge(paddle_tensor_list)
+        self.assertEqual(merged_tensor.shape, [4, 5])
+        expected_result = paddle.to_tensor(
+            [
+                [
+                    -0.241766095161438,
+                    0.20225590467453003,
+                    -0.08889424800872803,
+                    0.19946154952049255,
+                    0.5972206592559814,
+                ],
+                [0.704862117767334, -0.9960722923278809, 1.0701193809509277, 0.08988308906555176, 0.17587755620479584],
+                [
+                    1.3427623510360718,
+                    -0.28751814365386963,
+                    0.6157845854759216,
+                    0.6003049612045288,
+                    -0.30050763487815857,
+                ],
+                [0.8112550973892212, 0.2528044283390045, -0.1691504418849945, 0.8349930644035339, 0.5639800429344177],
+            ],
+            dtype="float32",
+        )
+        self.assertTrue(
+            paddle.allclose(merged_tensor, expected_result, atol=1e-6),
+            "Paddle slerp merge result does not match expected result.",
+        )
+        with self.assertRaises(ValueError):
+            merge_method.merge(paddle_tensor_list + paddle_tensor_list)
+
+    def test_ties_paddle(self):
+        paddle_tensor_list = self.to_paddle_tensor(self.tensor_list)
+        merge_config = MergeConfig(
+            merge_type="ties",
+            weight_list=[2, 8],
+            normalize=True,
+            tensor_type="pd",
+        )
+        merge_method = MergeMethod(merge_config=merge_config)
+        merged_tensor = merge_method.merge(paddle_tensor_list)
+        self.assertEqual(merged_tensor.shape, [4, 5])
+        expected_result = paddle.to_tensor(
+            [
+                [0.49925723671913147, -0.4865064024925232, 0.8579433560371399, -0.546754777431488, 0.5471049547195435],
+                [
+                    0.3380300998687744,
+                    -0.8378675580024719,
+                    0.6158429980278015,
+                    0.05510251596570015,
+                    -0.3130885660648346,
+                ],
+                [
+                    0.9436129331588745,
+                    -0.17925502359867096,
+                    0.17603227496147156,
+                    0.5812624096870422,
+                    0.43041032552719116,
+                ],
+                [
+                    0.5418604016304016,
+                    -0.5949721932411194,
+                    0.11636247485876083,
+                    0.6011120676994324,
+                    0.21196560561656952,
+                ],
+            ],
+            dtype="float32",
+        )
+        self.assertTrue(
+            paddle.allclose(merged_tensor, expected_result, atol=1e-6),
+            "Paddle ties merge result does not match expected result.",
+        )
diff --git a/tests/mergekit/test_merge_model.py b/tests/mergekit/test_merge_model.py
index c87bd522c4f6..9deb25a20292 100644
--- a/tests/mergekit/test_merge_model.py
+++ b/tests/mergekit/test_merge_model.py
@@ -24,7 +24,7 @@
 
 class TestMergeModel(unittest.TestCase):
     @parameterized.expand([("slerp",), ("della",), ("dare_linear",), ("ties",)])
-    def test_merge_model(self, merge_method):
+    def test_merge_model_np(self, merge_method):
         with TemporaryDirectory() as tempdir:
             model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert", dtype="bfloat16")
             pd_path = os.path.join(tempdir, "pd_model")
@@ -66,3 +66,51 @@ def test_merge_model(self, merge_method):
             )
             mergekit = MergeModel(merge_config)
             mergekit.merge_model()
+
+    @parameterized.expand([("slerp",), ("della",), ("dare_linear",), ("ties",)])
+    def test_merge_model_pd(self, merge_method):
+        with TemporaryDirectory() as tempdir:
+            model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert", dtype="bfloat16")
+            pd_path = os.path.join(tempdir, "pd_model")
+            model.save_pretrained(pd_path)
+            safe_path = os.path.join(tempdir, "safe_model")
+            model.save_pretrained(safe_path, safe_serialization="safetensors")
+
+            # test mix
+            merge_config = MergeConfig(
+                merge_method=merge_method, model_path_list=[safe_path, pd_path], output_path=tempdir, tensor_type="pd"
+            )
+            mergekit = MergeModel(merge_config)
+            mergekit.merge_model()
+
+            # test mix with base model
+            merge_config = MergeConfig(
+                merge_method=merge_method,
+                model_path_list=[safe_path, pd_path],
+                output_path=tempdir,
+                base_model_path=safe_path,
+                tensor_type="pd",
+            )
+            mergekit = MergeModel(merge_config)
+            mergekit.merge_model()
+
+            # test safetensor only
+            merge_config = MergeConfig(
+                merge_method=merge_method,
+                model_path_list=[safe_path, safe_path],
+                output_path=tempdir,
+                tensor_type="pd",
+            )
+            mergekit = MergeModel(merge_config)
+            mergekit.merge_model()
+
+            # test safetensor only with base model
+            merge_config = MergeConfig(
+                merge_method=merge_method,
+                model_path_list=[safe_path, safe_path],
+                output_path=tempdir,
+                tensor_type="pd",
+                base_model_path=safe_path,
+            )
+            mergekit = MergeModel(merge_config)
+            mergekit.merge_model()
diff --git a/tests/mergekit/test_sparsify_method.py b/tests/mergekit/test_sparsify_method.py
index 28054ac0958b..331f2740ede0 100644
--- a/tests/mergekit/test_sparsify_method.py
+++ b/tests/mergekit/test_sparsify_method.py
@@ -15,6 +15,7 @@
 import unittest
 
 import numpy as np
+import paddle
 
 from paddlenlp.mergekit import MergeConfig, SparsifyMethod
 
@@ -81,3 +82,57 @@ def test_trim(self):
             dtype="float32",
         )
         self.assertTrue(np.array_equal(sparsify_tensor, expected_result))
+
+    @classmethod
+    def to_paddle_tensor(cls, numpy_tensor):
+        """Convert a numpy array to a paddle tensor."""
+        return paddle.to_tensor(numpy_tensor, dtype="float32")
+
+    def test_none_paddle(self):
+        paddle_tensor = self.to_paddle_tensor(self.tensor)
+        merge_config = MergeConfig(sparsify_type=None, tensor_type="pd")
+        sparsify_method = SparsifyMethod(merge_config=merge_config)
+        sparsify_tensor = sparsify_method.sparsify(paddle_tensor)
+        self.assertEqual(sparsify_tensor.shape, paddle_tensor.shape)
+        self.assertTrue(
+            paddle.allclose(sparsify_tensor, paddle_tensor, atol=1e-6),
+            "Paddle tensor sparsify (none) failed to match input tensor.",
+        )
+
+    def test_dare_paddle(self):
+        paddle.seed(42)  # Fix random seed for reproducibility
+        paddle_tensor = self.to_paddle_tensor(self.tensor)
+        merge_config = MergeConfig(sparsify_type="dare", rescale=True, reserve_p=0.7, tensor_type="pd")
+        sparsify_method = SparsifyMethod(merge_config=merge_config)
+        sparsify_tensor = sparsify_method.sparsify(paddle_tensor)
+        self.assertEqual(sparsify_tensor.shape, paddle_tensor.shape)
+
+    def test_magprune_paddle(self):
+        paddle.seed(42)  # Fix random seed for reproducibility
+        paddle_tensor = self.to_paddle_tensor(self.tensor)
+        merge_config = MergeConfig(sparsify_type="magprune", rescale=True, reserve_p=0.7, tensor_type="pd")
+        sparsify_method = SparsifyMethod(merge_config=merge_config)
+        sparsify_tensor = sparsify_method.sparsify(paddle_tensor)
+        self.assertEqual(sparsify_tensor.shape, paddle_tensor.shape)
+
+    def test_trim_paddle(self):
+        paddle.seed(42)  # Fix random seed for reproducibility
+        paddle_tensor = self.to_paddle_tensor(self.tensor)
+        merge_config = MergeConfig(sparsify_type="trim", rescale=True, reserve_p=0.7, tensor_type="pd")
+        sparsify_method = SparsifyMethod(merge_config=merge_config)
+        sparsify_tensor = sparsify_method.sparsify(paddle_tensor)
+        self.assertEqual(sparsify_tensor.shape, paddle_tensor.shape)
+
+        expected_result = paddle.to_tensor(
+            [
+                [-0.9439255595207214, 0.867495596408844, -1.1095106601715088, 0.9312260150909424, 0.0],
+                [0.8381496071815491, 0.0, 1.0790561437606812, 0.0, 0.6300279498100281],
+                [1.0320085287094116, 0.0, 0.956987738609314, 0.0, -0.958286702632904],
+                [0.6767225861549377, 1.0694657564163208, 0.0, 0.6147512197494507, 0.7808632254600525],
+            ],
+            dtype="float32",
+        )
+        self.assertTrue(
+            paddle.allclose(sparsify_tensor, expected_result, atol=1e-6),
+            "Paddle tensor sparsify (trim) result does not match expected result.",
+        )
diff --git a/tests/taskflow/test_text_classification.py b/tests/taskflow/test_text_classification.py
index eb2469d6b099..43ac3c0c361c 100644
--- a/tests/taskflow/test_text_classification.py
+++ b/tests/taskflow/test_text_classification.py
@@ -24,7 +24,6 @@
     PromptModelForSequenceClassification,
     SoftVerbalizer,
 )
-from paddlenlp.taskflow import Taskflow
 from paddlenlp.taskflow.text_classification import TextClassificationTask
 from paddlenlp.transformers import (
     AutoModelForMaskedLM,
@@ -145,60 +144,60 @@ def test_classification_task(self, batch_size, problem_type, model):
                 if model == "multi_label":
                     self.assertGreater(dygraph_pred["score"], dygraph_taskflow.multilabel_threshold)
 
-    @unittest.skip("numerical error")
-    @parameterized.expand(
-        [
-            (1, "multi_class", "finetune"),
-            (1, "multi_class", "prompt"),
-            (1, "multi_label", "finetune"),
-            (1, "multi_label", "prompt"),
-        ]
-    )
-    def test_taskflow_task(self, batch_size, problem_type, mode):
-        input_text = ["百度", "深度学习框架", "飞桨", "PaddleNLP"]
-        id2label = {
-            0: "negative",
-            1: "positive",
-        }
-        if mode == "finetune":
-            dygraph_model_path = self.finetune_dygraph_model_path
-            static_model_path = self.finetune_static_model_path
-        else:
-            dygraph_model_path = self.prompt_dygraph_model_path
-            static_model_path = self.prompt_static_model_path
-
-        dygraph_taskflow = Taskflow(
-            mode=mode,
-            task="text_classification",
-            task_path=dygraph_model_path,
-            id2label=id2label,
-            batch_size=batch_size,
-            device_id=0,
-            problem_type=problem_type,
-        )
-
-        dygraph_results = dygraph_taskflow(input_text)
-
-        self.assertEqual(len(dygraph_results), len(input_text))
-
-        static_taskflow = Taskflow(
-            mode=mode,
-            task="text_classification",
-            is_static_model=True,
-            task_path=static_model_path,
-            id2label=id2label,
-            batch_size=batch_size,
-            device_id=0,
-            problem_type=problem_type,
-        )
-
-        static_results = static_taskflow(input_text)
-        self.assertEqual(len(static_results), len(input_text))
-
-        for dygraph_result, static_result in zip(dygraph_results, static_results):
-            for dygraph_pred, static_pred in zip(dygraph_result["predictions"], static_result["predictions"]):
-                self.assertEqual(dygraph_pred["label"], static_pred["label"])
-                self.assertAlmostEqual(dygraph_pred["score"], static_pred["score"], delta=1e-6)
-                # if multi_label, all predictions should be greater than the threshold
-                if mode == "multi_label":
-                    self.assertGreater(dygraph_pred["score"], dygraph_taskflow.task_instance.multilabel_threshold)
+    # @unittest.skip("numerical error")
+    # @parameterized.expand(
+    #     [
+    #         (1, "multi_class", "finetune"),
+    #         (1, "multi_class", "prompt"),
+    #         (1, "multi_label", "finetune"),
+    #         (1, "multi_label", "prompt"),
+    #     ]
+    # )
+    # def test_taskflow_task(self, batch_size, problem_type, mode):
+    #     input_text = ["百度", "深度学习框架", "飞桨", "PaddleNLP"]
+    #     id2label = {
+    #         0: "negative",
+    #         1: "positive",
+    #     }
+    #     if mode == "finetune":
+    #         dygraph_model_path = self.finetune_dygraph_model_path
+    #         static_model_path = self.finetune_static_model_path
+    #     else:
+    #         dygraph_model_path = self.prompt_dygraph_model_path
+    #         static_model_path = self.prompt_static_model_path
+
+    #     dygraph_taskflow = Taskflow(
+    #         mode=mode,
+    #         task="text_classification",
+    #         task_path=dygraph_model_path,
+    #         id2label=id2label,
+    #         batch_size=batch_size,
+    #         device_id=0,
+    #         problem_type=problem_type,
+    #     )
+
+    #     dygraph_results = dygraph_taskflow(input_text)
+
+    #     self.assertEqual(len(dygraph_results), len(input_text))
+
+    #     static_taskflow = Taskflow(
+    #         mode=mode,
+    #         task="text_classification",
+    #         is_static_model=True,
+    #         task_path=static_model_path,
+    #         id2label=id2label,
+    #         batch_size=batch_size,
+    #         device_id=0,
+    #         problem_type=problem_type,
+    #     )
+
+    #     static_results = static_taskflow(input_text)
+    #     self.assertEqual(len(static_results), len(input_text))
+
+    #     for dygraph_result, static_result in zip(dygraph_results, static_results):
+    #         for dygraph_pred, static_pred in zip(dygraph_result["predictions"], static_result["predictions"]):
+    #             self.assertEqual(dygraph_pred["label"], static_pred["label"])
+    #             self.assertAlmostEqual(dygraph_pred["score"], static_pred["score"], delta=1e-6)
+    #             # if multi_label, all predictions should be greater than the threshold
+    #             if mode == "multi_label":
+    #                 self.assertGreater(dygraph_pred["score"], dygraph_taskflow.task_instance.multilabel_threshold)
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP1_MP2_PP4_VPP5_Sharding4_Stage1.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP4_MP2_PP4.sh
similarity index 94%
rename from tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP1_MP2_PP4_VPP5_Sharding4_Stage1.sh
rename to tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP4_MP2_PP4.sh
index 60a09cda3c51..4de325035360 100644
--- a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP1_MP2_PP4_VPP5_Sharding4_Stage1.sh
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/N4C32/gpt-3-13b_pretrain_bs128_bf16_DP4_MP2_PP4.sh
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 param="model_item=gpt-3-13b_pretrain "
-param+="run_mode=DP1_MP2_PP4_VPP5_Sharding4_Stage1 "
+param+="run_mode=DP4_MP2_PP4 "
 param+="device_num=N4C32 "
 param+="global_batch_size=128 "
 param+="nnodes=4 "
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json
index 67504a1f8518..8a354b585224 100644
--- a/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt3/auto_config_gpt3_13b/pretrain-gpt3_13b-config.json
@@ -8,15 +8,18 @@
     "gradient_accumulation_steps": 32,
     "tensor_parallel_degree": 2,
     "pipeline_parallel_degree": 4,
-    "virtual_pp_degree": 5,
-    "sequence_parallel": 0,
+    "virtual_pp_degree": 1,
+    "sequence_parallel": 1,
     "sharding": "stage1",
     "pipeline_parallel_config": "enable_sharding_comm_overlap enable_release_grads ",
     "tensor_parallel_config": "enable_mp_async_allreduce enable_sp_async_reduce_scatter enable_mp_skip_c_identity enable_mp_fused_linear_param_grad_add",
     "per_device_train_batch_size": 1,
     "use_flash_attention": true,
+    "use_fast_layer_norm": true,
     "use_fused_rms_norm": true,
     "fuse_attention_qkv": true,
+    "use_fused_linear": true,
+    "use_fused_dropout_add": true,
     "use_fused_rope": true,
     "fuse_attention_ffn": true,
     "enable_linear_fused_grad_add": true,
@@ -25,12 +28,12 @@
     "scale_loss": 1024,
     "learning_rate": 1e-05,
     "min_learning_rate": 5e-06,
-    "max_steps": 200,
+    "max_steps": 500,
     "save_steps": 5000,
     "weight_decay": 0.01,
     "warmup_ratio": 0.01,
     "max_grad_norm": 1.0,
-    "logging_steps": 2,
+    "logging_steps": 5,
     "dataloader_num_workers": 1,
     "eval_steps": 1000,
     "disable_tqdm": true,
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama2/benchmark_common/run_benchmark.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama2/benchmark_common/run_benchmark.sh
index 9afb2a0902c8..c45ea93451c5 100644
--- a/tests/test_tipc/dygraph/hybrid_parallelism/llama2/benchmark_common/run_benchmark.sh
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama2/benchmark_common/run_benchmark.sh
@@ -57,6 +57,61 @@ function _set_params(){
     OUTPUT_PATH=${run_log_path}/output
 }
 
+# 循环监控文件写入状态和进程状态
+monitor_log_file() {
+    local log_file="$1"  # 获取日志文件路径
+    local training_pid="$2"  # 获取训练进程的 PID
+    local no_update_duration=0  # 初始化无更新时长计数
+    local last_size=0
+    local kill_flag_file="/tmp/monitor_killed_$training_pid"
+
+    echo "$(date '+%Y-%m-%d %H:%M:%S') 开始监控进程 $training_pid 和日志文件 $log_file..."
+
+    while true; do
+        sleep 5  # 每隔 5 秒检查一次日志文件
+
+        # 判断日志文件是否存在
+        if [ ! -f "$log_file" ]; then
+            echo "日志文件 $log_file 不存在，检查进程状态..."
+            # 如果日志文件不存在，直接判断进程是否结束
+            if ! ps -p $training_pid > /dev/null; then
+                echo "$(date '+%Y-%m-%d %H:%M:%S') 进程 $training_pid 已经结束。"
+                break
+            fi
+            continue  # 如果文件不存在，跳过后续逻辑，继续循环
+        fi
+
+        # 获取当前日志文件的大小
+        new_size=$(stat -c %s "$log_file")
+
+        if [ "$last_size" -eq "$new_size" ]; then
+            # 文件大小未变化，增加无更新时长计数
+            no_update_duration=$((no_update_duration + 5))
+            echo "$(date '+%Y-%m-%d %H:%M:%S') 文件未写入..."
+            if [ "$no_update_duration" -ge 180 ]; then
+                echo "$(date '+%Y-%m-%d %H:%M:%S') 文件在过去的 3 分钟内没有继续写入，准备杀掉进程 $training_pid."
+                # 创建标志文件
+                touch "$kill_flag_file"
+                ls -l "$kill_flag_file"
+                kill -9 $training_pid  # 杀掉进程
+                echo "$(date '+%Y-%m-%d %H:%M:%S') 进程 $training_pid 已经被杀掉。"
+                break
+            fi
+        else
+            # 文件大小有变化，重置无更新时长计数
+            echo "$(date '+%Y-%m-%d %H:%M:%S') 文件仍在写入..."
+            no_update_duration=0
+            last_size=$new_size
+        fi
+
+        # 如果训练进程已经结束，退出监控
+        if ! ps -p $training_pid > /dev/null; then
+            echo "$(date '+%Y-%m-%d %H:%M:%S') 进程 $training_pid 已经结束。"
+            break
+        fi
+    done
+}
+
 function _train(){
     batch_size=${per_device_train_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
 
@@ -134,16 +189,43 @@ function _train(){
     rm -rf ./auto_config_${MODEL_TYPE}/*csv
     rm -rf ./auto_config_${MODEL_TYPE}/best_*
     rm -rf mylog && rm -rf checkpoints
-    
+
     echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
-    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    timeout 40m ${train_cmd} > ${log_file} 2>&1 &
+    training_pid=$!  # 获取后台进程的 PID
+
+    # 监控进程和日志的更新状态
+    monitor_log_file "$log_file" "$training_pid" & 
+    monitor_log_file_pid=$!  # 获取日志监控进程的 PID
 
-    if [ $? -ne 0 ];then
+    # 等待训练进程完成
+    wait $training_pid
+    exit_code=$?
+
+    # 获取训练进程的退出码
+    echo "训练进程 $training_pid 的退出码是 $exit_code"
+
+    # 清理后台日志监控进程
+    kill $monitor_log_file_pid
+
+
+    if [ ${exit_code} -ne 0 ];then
         echo -e "${model_name}, FAIL"
+        # 如果程序是主动报错退出，不是monitor_log_file函数kill掉的情况下，需要等待其它机器被kill
+        # 标志文件位置
+        kill_flag_file="/tmp/monitor_killed_$training_pid"
+        if [ -f "$kill_flag_file" ]; then
+            echo "$(date '+%Y-%m-%d %H:%M:%S') 训练进程 $training_pid 是被 monitor_log_file 函数杀掉的。"
+            rm -f "$kill_flag_file"  # 清理标志文件
+        else
+            echo "$(date '+%Y-%m-%d %H:%M:%S') 训练进程 $training_pid 是主动报错退出的。"
+            sleep 120
+        fi
     else
         echo -e "${model_name}, SUCCESS"
     fi
 
+
     #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
     if [ ${device_num} != "N1C1" -a -d ./auto_config_${MODEL_TYPE}/best_cfg ]; then
         case_path=$PWD && cd - && mkdir -p mylog      # PaddleNLP/tests/mylog
diff --git a/tests/test_tipc/llm/llama2/benchmark_common/benchmark_json/llama2-70b/dpo.json b/tests/test_tipc/llm/llama2/benchmark_common/benchmark_json/llama2-70b/dpo.json
index c95540903f4e..979009513c98 100644
--- a/tests/test_tipc/llm/llama2/benchmark_common/benchmark_json/llama2-70b/dpo.json
+++ b/tests/test_tipc/llm/llama2/benchmark_common/benchmark_json/llama2-70b/dpo.json
@@ -15,7 +15,7 @@
     "max_seq_len": 4096,
     "max_prompt_len": 2048,
     "pipeline_parallel_config": "disable_partial_send_recv enable_clear_every_step_cache",
-    "sequence_parallel": 1,   
+    "sequence_parallel": 0,   
     "bf16": true,
     "fp16_opt_level": "O2",
     "do_train": true,
diff --git a/tests/test_tipc/llm/llama2/benchmark_common/prepare.sh b/tests/test_tipc/llm/llama2/benchmark_common/prepare.sh
index a1b917731589..ccfd8b76b67e 100644
--- a/tests/test_tipc/llm/llama2/benchmark_common/prepare.sh
+++ b/tests/test_tipc/llm/llama2/benchmark_common/prepare.sh
@@ -24,9 +24,9 @@ python setup.py install
 cd -
 
 # install paddlenlp_ops
-cd ../csrc/
-python setup_cuda.py install
-cd -
+# cd ../csrc/
+# python setup_cuda.py install
+# cd -
 
 cd ../llm
 cp -r ../tests/test_tipc/llm/llama2/benchmark_common/benchmark_json ./
diff --git a/tests/test_tipc/llm/llama2/benchmark_common/run_benchmark.sh b/tests/test_tipc/llm/llama2/benchmark_common/run_benchmark.sh
index 9e0c259520b2..32a03ee11a55 100644
--- a/tests/test_tipc/llm/llama2/benchmark_common/run_benchmark.sh
+++ b/tests/test_tipc/llm/llama2/benchmark_common/run_benchmark.sh
@@ -36,7 +36,7 @@ function _set_params(){
     skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
     keyword="Effective_Tokens_per_second_per_gpu:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
     is_large_model=True           # (可选)普通模型默认为False，如果添加大模型且只取一条ips设置为True
-    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    convergence_key="Total_Tokens_per_second_per_gpu:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
 
     fp_item="bf16"
     # 以下为通用执行命令，无特殊可不用修改
@@ -105,18 +105,25 @@ function _train(){
         ;;
     esac
     cd ../llm/
+    export no_proxy=bcebos.com
     echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
     python -c "import paddlenlp"
     if [[ ${model_name_or_path} =~ "CE" ]];then # CE精度-不限制执行时间
         ${train_cmd} > ${log_file} 2>&1
     else
-        timeout 30m ${train_cmd} > ${log_file} 2>&1
+        timeout 60m ${train_cmd} > ${log_file} 2>&1
         # echo ${train_cmd}
         Effective_Tokens_per_second=`cat ${log_file} | grep -E 'Effective_Tokens_per_second|Effective tokens per second:' \
                                             |awk -F': ' '{print $2}' |awk -F' ' '{print $1}'`
         num_gpu=$(echo "$device_num" | sed 's/^.*C//')
-        ips=$(awk -v a="$Effective_Tokens_per_second" -v b="$num_gpu" 'BEGIN {printf "%.2f\n", a / b}')
-        echo "Effective_Tokens_per_second_per_gpu: ${ips}" >> ${log_file}
+        Effective_Tokens_per_second_per_gpu=$(awk -v a="$Effective_Tokens_per_second" -v b="$num_gpu" 'BEGIN {printf "%.2f\n", a / b}')
+        echo "Effective_Tokens_per_second_per_gpu: ${Effective_Tokens_per_second_per_gpu}" >> ${log_file}
+        Train_samples_per_second=`cat ${log_file} | grep 'train_samples_per_second' \
+                                            |awk -F'train_samples_per_second: ' '{print $2}' |awk -F', ' '{print $1}'`
+        length=4096
+        Total_Tokens_per_second=$(awk -v a="$Train_samples_per_second" -v b="$length" 'BEGIN {printf "%.2f\n", a * b}')
+        Total_Tokens_per_second_per_gpu=$(awk -v a="$Total_Tokens_per_second" -v b="$num_gpu" 'BEGIN {printf "%.2f\n", a / b}')
+        echo "Total_Tokens_per_second_per_gpu: ${Total_Tokens_per_second_per_gpu}" >> ${log_file}
     fi
     if [ $? -ne 0 ];then
         echo -e "${model_name}, FAIL"
diff --git a/tests/test_tipc/llm/qwen2_5/benchmark_common/benchmark_json/qwen-qwen2_5-72b/dpo.json b/tests/test_tipc/llm/qwen2_5/benchmark_common/benchmark_json/qwen-qwen2_5-72b/dpo.json
index 0972a78141c0..e78e43649fff 100644
--- a/tests/test_tipc/llm/qwen2_5/benchmark_common/benchmark_json/qwen-qwen2_5-72b/dpo.json
+++ b/tests/test_tipc/llm/qwen2_5/benchmark_common/benchmark_json/qwen-qwen2_5-72b/dpo.json
@@ -15,7 +15,7 @@
     "max_seq_len": 4096,
     "max_prompt_len": 2048,
     "pipeline_parallel_config": "disable_partial_send_recv enable_clear_every_step_cache",
-    "sequence_parallel": 1,   
+    "sequence_parallel": 0,   
     "bf16": true,
     "fp16_opt_level": "O2",
     "do_train": true,
diff --git a/tests/test_tipc/llm/qwen2_5/benchmark_common/prepare.sh b/tests/test_tipc/llm/qwen2_5/benchmark_common/prepare.sh
index 416ded186efb..92d9f0a5061f 100644
--- a/tests/test_tipc/llm/qwen2_5/benchmark_common/prepare.sh
+++ b/tests/test_tipc/llm/qwen2_5/benchmark_common/prepare.sh
@@ -24,9 +24,9 @@ python setup.py install
 cd -
 
 # install paddlenlp_ops
-cd ../csrc/
-python setup_cuda.py install
-cd -
+# cd ../csrc/
+# python setup_cuda.py install
+# cd -
 
 cd ../llm
 cp -r ../tests/test_tipc/llm/qwen2_5/benchmark_common/benchmark_json ./
diff --git a/tests/test_tipc/llm/qwen2_5/benchmark_common/run_benchmark.sh b/tests/test_tipc/llm/qwen2_5/benchmark_common/run_benchmark.sh
index 9e0c259520b2..32a03ee11a55 100644
--- a/tests/test_tipc/llm/qwen2_5/benchmark_common/run_benchmark.sh
+++ b/tests/test_tipc/llm/qwen2_5/benchmark_common/run_benchmark.sh
@@ -36,7 +36,7 @@ function _set_params(){
     skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
     keyword="Effective_Tokens_per_second_per_gpu:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
     is_large_model=True           # (可选)普通模型默认为False，如果添加大模型且只取一条ips设置为True
-    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    convergence_key="Total_Tokens_per_second_per_gpu:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
 
     fp_item="bf16"
     # 以下为通用执行命令，无特殊可不用修改
@@ -105,18 +105,25 @@ function _train(){
         ;;
     esac
     cd ../llm/
+    export no_proxy=bcebos.com
     echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
     python -c "import paddlenlp"
     if [[ ${model_name_or_path} =~ "CE" ]];then # CE精度-不限制执行时间
         ${train_cmd} > ${log_file} 2>&1
     else
-        timeout 30m ${train_cmd} > ${log_file} 2>&1
+        timeout 60m ${train_cmd} > ${log_file} 2>&1
         # echo ${train_cmd}
         Effective_Tokens_per_second=`cat ${log_file} | grep -E 'Effective_Tokens_per_second|Effective tokens per second:' \
                                             |awk -F': ' '{print $2}' |awk -F' ' '{print $1}'`
         num_gpu=$(echo "$device_num" | sed 's/^.*C//')
-        ips=$(awk -v a="$Effective_Tokens_per_second" -v b="$num_gpu" 'BEGIN {printf "%.2f\n", a / b}')
-        echo "Effective_Tokens_per_second_per_gpu: ${ips}" >> ${log_file}
+        Effective_Tokens_per_second_per_gpu=$(awk -v a="$Effective_Tokens_per_second" -v b="$num_gpu" 'BEGIN {printf "%.2f\n", a / b}')
+        echo "Effective_Tokens_per_second_per_gpu: ${Effective_Tokens_per_second_per_gpu}" >> ${log_file}
+        Train_samples_per_second=`cat ${log_file} | grep 'train_samples_per_second' \
+                                            |awk -F'train_samples_per_second: ' '{print $2}' |awk -F', ' '{print $1}'`
+        length=4096
+        Total_Tokens_per_second=$(awk -v a="$Train_samples_per_second" -v b="$length" 'BEGIN {printf "%.2f\n", a * b}')
+        Total_Tokens_per_second_per_gpu=$(awk -v a="$Total_Tokens_per_second" -v b="$num_gpu" 'BEGIN {printf "%.2f\n", a / b}')
+        echo "Total_Tokens_per_second_per_gpu: ${Total_Tokens_per_second_per_gpu}" >> ${log_file}
     fi
     if [ $? -ne 0 ];then
         echo -e "${model_name}, FAIL"
diff --git a/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP2_1F1B_Sharding4_Stage1.sh b/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP1_Sharding8_Stage1.sh
similarity index 91%
rename from tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP2_1F1B_Sharding4_Stage1.sh
rename to tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP1_Sharding8_Stage1.sh
index ee1ebfa5c8de..80b151dc2658 100644
--- a/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs128_bf16_DP1_MP4_PP2_1F1B_Sharding4_Stage1.sh
+++ b/tests/test_tipc/static/auto_parallel/baichuan2/N4C32/baichuan-inc-baichuan-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP1_Sharding8_Stage1.sh
@@ -13,9 +13,9 @@
 # limitations under the License.
 
 param="model_item=baichuan-inc-baichuan-2-13b_pretrain_dy2st "
-param+="run_mode=DP1_MP4_PP2_1F1B_Sharding4_Stage1 "
+param+="run_mode=DP1_MP4_PP1_Sharding8_Stage1 "
 param+="device_num=N4C32 "
-param+="global_batch_size=128 "
+param+="global_batch_size=32 "
 param+="nnodes=4 "
 param+="model_type=baichuan2_13b "
 
diff --git a/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json b/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json
index c1f36c5a0ff8..1c95ef419434 100644
--- a/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json
+++ b/tests/test_tipc/static/auto_parallel/baichuan2/pretrain_config_baichuan2_13b/pretrain-baichuan2_13b.json
@@ -5,17 +5,16 @@
     "output_dir": "./checkpoints/baichuan2_13b_ckpts",
     "split": "949,50,1",
     "to_static": true,
-    "pipeline_parallel_degree": 2,
+    "pipeline_parallel_degree": 1,
     "tensor_parallel_degree": 4,
-    "virtual_pp_degree": 2,
-    "pipeline_schedule_mode": "1F1B",
+    "virtual_pp_degree": 1,
     "weight_decay": 0.01,
     "warmup_ratio": 0.01,
-    "max_grad_norm": 0.0,
+    "max_grad_norm": 1.0,
     "learning_rate": 0.00003,
     "min_learning_rate": 0.000003,
-    "max_steps": 100,
-    "logging_steps": 1,
+    "max_steps": 200,
+    "logging_steps": 5,
     "eval_steps": 10000,
     "save_steps": 1000,
     "continue_training": 0,
@@ -25,11 +24,11 @@
     "disable_tqdm": true,
     "save_total_limit": 2,
     "device": "gpu",
-    "dataloader_num_workers": 4,
+    "dataloader_num_workers": 1,
     "distributed_dataloader": 0,
     "enable_auto_parallel": 1,
-    "per_device_train_batch_size": 1,
-    "gradient_accumulation_steps": 32,
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 2,
     "per_device_eval_batch_size": 1,
     "recompute": false,
     "recompute_use_reentrant": true,
@@ -46,9 +45,10 @@
     "use_fused_rope": true,
     "use_fused_rms_norm": true,
     "max_seq_length": 4096,
-    "sequence_parallel": false,
+    "sequence_parallel": 1,
     "sharding": "stage1",
-    "sharding_parallel_config": "enable_stage1_tensor_fusion enable_stage1_overlap",
+    "sharding_parallel_degree": 8,
+    "sharding_parallel_config": "enable_tensor_fusion enable_overlap",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
     "pipeline_parallel_config": "enable_send_recv_overlap enable_split_backward"
diff --git a/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh b/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP4_MP2_PP4.sh
similarity index 94%
rename from tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh
rename to tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP4_MP2_PP4.sh
index 103632adc2ef..0728d03885a4 100644
--- a/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh
+++ b/tests/test_tipc/static/auto_parallel/gpt3/N4C32/gpt-3-13b_pretrain_dy2st_bs128_bf16_DP4_MP2_PP4.sh
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 param="model_item=gpt-3-13b_pretrain_dy2st "
-param+="run_mode=DP1_MP2_PP4_1F1B_Sharding4_Stage1 "
+param+="run_mode=DP4_MP2_PP4 "
 param+="device_num=N4C32 "
 param+="global_batch_size=128 "
 param+="nnodes=4 "
diff --git a/tests/test_tipc/static/auto_parallel/gpt3/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/auto_parallel/gpt3/benchmark_common/run_benchmark.sh
index 9eb16e442663..70f9db2e9faf 100644
--- a/tests/test_tipc/static/auto_parallel/gpt3/benchmark_common/run_benchmark.sh
+++ b/tests/test_tipc/static/auto_parallel/gpt3/benchmark_common/run_benchmark.sh
@@ -88,7 +88,7 @@ monitor_log_file() {
             # 文件大小未变化，增加无更新时长计数
             no_update_duration=$((no_update_duration + 5))
             echo "$(date '+%Y-%m-%d %H:%M:%S') 文件未写入..."
-            if [ "$no_update_duration" -ge 180 ]; then
+            if [ "$no_update_duration" -ge 900 ]; then
                 echo "$(date '+%Y-%m-%d %H:%M:%S') 文件在过去的 3 分钟内没有继续写入，准备杀掉进程 $training_pid."
                 # 创建标志文件
                 touch "$kill_flag_file"
diff --git a/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json b/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json
index 6f6572c5e0b4..c80f7403fe59 100644
--- a/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json
+++ b/tests/test_tipc/static/auto_parallel/gpt3/pretrain_config_gpt3_13b/pretrain-gpt3_13b.json
@@ -7,18 +7,20 @@
     "output_dir": "./checkpoints/gpt_pretrain_ckpts",
     "split": "949,50,1",
     "max_seq_length": 4096,
+    "tensor_parallel_degree": 2,
+    "pipeline_parallel_degree": 4,
     "per_device_train_batch_size": 1,
     "per_device_eval_batch_size": 1,
     "scale_loss": 1024,
     "learning_rate": 0.00001,
     "min_learning_rate": 0.000001,
-    "max_steps": 100,
+    "max_steps": 500,
     "save_steps": 50000,
     "weight_decay": 0.01,
     "warmup_ratio": 0.01,
-    "logging_steps": 1,
+    "logging_steps": 5,
     "continue_training": 0,
-    "dataloader_num_workers": 4,
+    "dataloader_num_workers": 1,
     "eval_steps": 100000,
     "report_to": "visualdl",
     "disable_tqdm": true,
@@ -26,14 +28,9 @@
     "do_eval": true,
     "device": "gpu",
     "model_type": "gpt",
-    "sharding": "stage1",
-    "tensor_parallel_degree": 2,
-    "pipeline_parallel_degree": 4,
-    "virtual_pp_degree": 2,
-    "pipeline_schedule_mode": "1F1B",
-    "virtual_pipeline_seg_method": "GPTDecoderLayerAuto",
-    "sequence_parallel": 0,
+    "sequence_parallel": 1,
     "use_flash_attention": 1,
+    "use_fast_layer_norm": 1,
     "fused_linear": 1,
     "fuse_attention_ffn": 1,
     "fuse_attention_qkv": 1,
@@ -45,13 +42,12 @@
     "recompute_granularity": "full",
     "pp_recompute_interval": 1,
     "gradient_accumulation_steps": 32,
-    "max_grad_norm": 0.1,
+    "max_grad_norm": 1.0,
     "bf16": 1,
     "fp16_opt_level": "O2",
     "amp_master_grad": true,
     "attention_probs_dropout_prob": 0.1,
     "hidden_dropout_prob": 0.1,
-    "sharding_parallel_config": "enable_stage1_tensor_fusion enable_stage1_overlap",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
     "pipeline_parallel_config": "enable_send_recv_overlap enable_split_backward"
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP1_PP4_VPP5_Sharding8_Stage2.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP1_PP4_VPP5_Sharding8_Stage2.sh
new file mode 100644
index 000000000000..5844f90fd419
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st_bs32_bf16_DP1_MP1_PP4_VPP5_Sharding8_Stage2.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_item=intermediate_api_meta-llama-Llama-2-13b_pretrain_dy2st "
+param+="run_mode=DP1_MP1_PP4_VPP5_Sharding8_Stage2 "
+param+="device_num=N4C32 "
+param+="global_batch_size=32 "
+param+="nnodes=4 "
+param+="model_type=llama2_13b "
+param+='intermediate_api=intermediate_api_ '
+
+
+cd ./tests
+bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh"
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
new file mode 100644
index 000000000000..4ae528040fbb
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
@@ -0,0 +1,28 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_item=intermediate_api_meta-llama-Llama-2-70b_pretrain_dy2st "
+param+="run_mode=DP1_MP8_PP4_VPP5 "
+param+="device_num=N4C32 "
+param+="global_batch_size=32 "
+param+="nnodes=4 "
+param+="model_type=llama2_70b "
+param+='intermediate_api=intermediate_api_ '
+
+
+cd ./tests
+bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh"
+
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st_bs32_bf16_Sharding32_Stage2.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st_bs32_bf16_Sharding32_Stage2.sh
new file mode 100644
index 000000000000..1cdc5ebb4992
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st_bs32_bf16_Sharding32_Stage2.sh
@@ -0,0 +1,26 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_item=intermediate_api_meta-llama-Llama-2-7b_pretrain_dy2st "
+param+="run_mode=Sharding32_Stage2 "
+param+="device_num=N4C32 "
+param+="global_batch_size=32 "
+param+="nnodes=4 "
+param+="model_type=llama2_7b "
+param+='intermediate_api=intermediate_api_ '
+
+cd ./tests
+bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh"
diff --git a/tests/test_tipc/static/auto_parallel/llama2/N4C32/meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP8_VPP5.sh b/tests/test_tipc/static/auto_parallel/llama2/N4C32/meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
similarity index 96%
rename from tests/test_tipc/static/auto_parallel/llama2/N4C32/meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP8_VPP5.sh
rename to tests/test_tipc/static/auto_parallel/llama2/N4C32/meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
index 6bdb2beb6cf9..e7378ae51bf0 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/N4C32/meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP4_PP8_VPP5.sh
+++ b/tests/test_tipc/static/auto_parallel/llama2/N4C32/meta-llama-Llama-2-70b_pretrain_dy2st_bs32_bf16_DP1_MP8_PP4_VPP5.sh
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 param="model_item=meta-llama-Llama-2-70b_pretrain_dy2st "
-param+="run_mode=DP1_MP4_PP8_VPP5 "
+param+="run_mode=DP1_MP8_PP4_VPP5 "
 param+="device_num=N4C32 "
 param+="global_batch_size=32 "
 param+="nnodes=4 "
diff --git a/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh
index 88b326057402..0a69e3cf54d9 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh
+++ b/tests/test_tipc/static/auto_parallel/llama2/benchmark_common/run_benchmark.sh
@@ -24,6 +24,10 @@ function _set_params(){
     fp_item="bf16"
     MODEL_TYPE=${model_type:-"llama2_7b"}
 
+    # for intermediate api
+    intermediate_api=${intermediate_api:-""}
+
+
     ip_lists=($(echo $TRAINER_INSTANCES | tr ',' ' '))
     master_ip=${ip_lists[0]}
     nnodes=${nnodes:-1}
@@ -174,17 +178,17 @@ function _train(){
         train_cmd="python -u -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 \
             --nnodes 1 --nproc_per_node 8 \
             --log_dir mylog run_pretrain_auto.py \
-            ./pretrain_config_${MODEL_TYPE}/pretrain-${MODEL_TYPE}.json"
+            ./pretrain_config_${MODEL_TYPE}/${intermediate_api}pretrain-${MODEL_TYPE}.json"
         ;;
     N4C32) echo "Run with: device_num=${device_num} run_mode=${run_mode}"
         train_cmd="python -u -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 \
             --log_dir mylog run_pretrain_auto.py \
-            ./pretrain_config_${MODEL_TYPE}/pretrain-${MODEL_TYPE}.json"
+            ./pretrain_config_${MODEL_TYPE}/${intermediate_api}pretrain-${MODEL_TYPE}.json"
         ;;
     *) echo "Run with: device_num=${device_num}, run_mode=${run_mode}"
         train_cmd="python -u -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 \
             --log_dir mylog run_pretrain_auto.py \
-            ./pretrain_config_${MODEL_TYPE}/pretrain-${MODEL_TYPE}.json"
+            ./pretrain_config_${MODEL_TYPE}/${intermediate_api}pretrain-${MODEL_TYPE}.json"
         ;;
     esac
     cd ../llm/auto_parallel/llama
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api-llama2_13b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api_pretrain-llama2_13b.json
similarity index 95%
rename from tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api-llama2_13b.json
rename to tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api_pretrain-llama2_13b.json
index 81b964a5e322..1582a6d30404 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api-llama2_13b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/intermediate_api_pretrain-llama2_13b.json
@@ -10,7 +10,7 @@
   "pipeline_parallel_degree": 4,
   "sharding": "stage1",
   "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-  "sharding_parallel_config": "enable_stage1_overlap",
+  "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
   "tensor_parallel_config": "enable_mp_async_allreduce",
   "pipeline_parallel_config": "enable_send_recv_overlap enable_split_backward",
   "pipeline_schedule_mode": "VPP", 
@@ -28,7 +28,7 @@
   "min_learning_rate": 3e-06,
   "warmup_steps": 30,
   "logging_steps": 10,
-  "max_steps": 100,
+  "max_steps": 500,
   "save_steps": 5000,
   "eval_steps": 1000,
   "weight_decay": 0.01,
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/pretrain-llama2_13b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/pretrain-llama2_13b.json
index 5df3186af3b5..aa86a1875597 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/pretrain-llama2_13b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_13b/pretrain-llama2_13b.json
@@ -10,7 +10,7 @@
   "pipeline_parallel_degree": 4,
   "sharding": "stage1",
   "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-  "sharding_parallel_config": "enable_stage1_overlap",
+  "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
   "tensor_parallel_config": "enable_mp_async_allreduce",
   "pipeline_parallel_config": "enable_send_recv_overlap enable_split_backward",
   "pipeline_schedule_mode": "VPP", 
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/intermediate_api_pretrain-llama2_70b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/intermediate_api_pretrain-llama2_70b.json
new file mode 100644
index 000000000000..f927c165788b
--- /dev/null
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/intermediate_api_pretrain-llama2_70b.json
@@ -0,0 +1,64 @@
+{
+    "model_name_or_path": "meta-llama/Llama-2-70b",
+    "tokenizer_name_or_path": "meta-llama/Llama-2-70b",
+    "input_dir": "./data",
+    "output_dir": "./checkpoints/llama2_pretrain_ckpts",
+    "weight_decay": 0.01,
+    "warmup_ratio": 0.01,
+    "max_grad_norm": 1.0,
+    "learning_rate": 3e-05,
+    "min_learning_rate": 3e-06,
+    "warmup_steps": 30,
+    "logging_steps": 10,
+    "max_steps": 500,
+    "save_steps": 5000,
+    "eval_steps": 1000,
+    "continue_training": 0,
+    "do_train": true,
+    "do_eval": false,
+    "do_predict": false,
+    "disable_tqdm": true,
+    "skip_profile_timer": true,
+    "save_total_limit": 2,
+    "device": "gpu",
+    "dataloader_num_workers": 1,
+    "distributed_dataloader": 0,
+    "enable_auto_parallel": true,
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 32,
+    "per_device_eval_batch_size": 32,
+    "recompute": false,
+    "recompute_use_reentrant": true,
+    "recompute_granularity": "full",
+    "pp_recompute_interval": 0,
+    "bf16": true,
+    "fp16_opt_level": "O2",
+    "amp_master_grad": true,
+    "amp_custom_black_list": ["reduce_sum", "c_softmax_with_cross_entropy"],
+    "amp_custom_white_list": ["lookup_table", "lookup_table_v2"],
+    "fuse_attention_ffn": true,
+    "fuse_attention_qkv": true,
+    "use_fused_rope": true,
+    "fused_linear_param_grad_add": true,
+    "fuse_sequence_parallel_allreduce": false,
+    "use_flash_attention": true,
+    "use_fused_rms_norm": true,
+    "sep_parallel_degree": 1,
+    "sequence_parallel": true,   
+    "pipeline_parallel_degree": 4,
+    "sharding_parallel_degree": 1,
+    "sharding": "stage1",
+    "tensor_parallel_degree": 8,
+    "virtual_pp_degree": 5,
+    "pipeline_schedule_mode": "VPP", 
+    "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
+    "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
+    "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
+    "max_seq_length": 4096,
+    "to_static": true,
+    "eliminate_transpose": 1,
+    "fuse_allreduce_split_to_reducescatter": 1,
+    "sequence_parallel_config": "enable_allreduce_avg_in_gradinent_scale",
+    "model_type": "llama_network",
+    "use_intermediate_api": true
+}
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/pretrain-llama2_70b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/pretrain-llama2_70b.json
index 3c8faf175b3b..f16fdacea4b6 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/pretrain-llama2_70b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_70b/pretrain-llama2_70b.json
@@ -45,14 +45,14 @@
     "use_fused_rms_norm": true,
     "sep_parallel_degree": 1,
     "sequence_parallel": true,   
-    "pipeline_parallel_degree": 8,
+    "pipeline_parallel_degree": 4,
     "sharding_parallel_degree": 1,
     "sharding": "stage1",
-    "tensor_parallel_degree": 4,
+    "tensor_parallel_degree": 8,
     "virtual_pp_degree": 5,
     "pipeline_schedule_mode": "VPP", 
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-    "sharding_parallel_config": "enable_stage1_overlap",
+    "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
     "max_seq_length": 4096,
     "to_static": true,
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api-llama2_7b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api_pretrain-llama2_7b.json
similarity index 96%
rename from tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api-llama2_7b.json
rename to tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api_pretrain-llama2_7b.json
index a32e121c3039..147b70b392f9 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api-llama2_7b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/intermediate_api_pretrain-llama2_7b.json
@@ -10,7 +10,7 @@
   "pipeline_parallel_degree": 1,
   "sharding": "stage1",
   "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-  "sharding_parallel_config": "enable_stage1_overlap",
+  "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
   "tensor_parallel_config": "enable_mp_async_allreduce",
   "pipeline_parallel_config": "",
   "virtual_pp_degree": 1,
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/pretrain-llama2_7b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/pretrain-llama2_7b.json
index d29b077f9ac4..8568e8a8451c 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/pretrain-llama2_7b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama2_7b/pretrain-llama2_7b.json
@@ -10,7 +10,7 @@
   "pipeline_parallel_degree": 1,
   "sharding": "stage1",
   "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-  "sharding_parallel_config": "enable_stage1_overlap",
+  "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
   "tensor_parallel_config": "enable_mp_async_allreduce",
   "pipeline_parallel_config": "",
   "virtual_pp_degree": 1,
diff --git a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama3.1_8b/pretrain_llama3.1_8b.json b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama3.1_8b/pretrain_llama3.1_8b.json
index 5a34d2b803d8..2e6565a96992 100644
--- a/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama3.1_8b/pretrain_llama3.1_8b.json
+++ b/tests/test_tipc/static/auto_parallel/llama2/pretrain_config_llama3.1_8b/pretrain_llama3.1_8b.json
@@ -10,7 +10,7 @@
   "pipeline_parallel_degree": 1,
   "sharding": "stage1",
   "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-  "sharding_parallel_config": "enable_stage1_overlap",
+  "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
   "tensor_parallel_config": "enable_mp_async_allreduce",
   "pipeline_parallel_config": "",
   "virtual_pp_degree": 1,
diff --git a/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh b/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs32_bf16_DP1_MP2_Sharding16_Stage1.sh
similarity index 91%
rename from tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh
rename to tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs32_bf16_DP1_MP2_Sharding16_Stage1.sh
index 50d990884957..b2e775139346 100644
--- a/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs128_bf16_DP1_MP2_PP4_1F1B_Sharding4_Stage1.sh
+++ b/tests/test_tipc/static/auto_parallel/qwen/N4C32/qwen-14b_pretrain_dy2st_bs32_bf16_DP1_MP2_Sharding16_Stage1.sh
@@ -13,9 +13,9 @@
 # limitations under the License.
 
 param="model_item=qwen-14b_pretrain_dy2st "
-param+="run_mode=DP1_MP2_PP4_1F1B_Sharding4_Stage1 "
+param+="run_mode=DP1_MP2_Sharding16_Stage1 "
 param+="device_num=N4C32 "
-param+="global_batch_size=128 "
+param+="global_batch_size=32 "
 param+="nnodes=4 "
 param+="model_type=qwen_14b "
 
diff --git a/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json b/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json
index 19ae4ffaf0c6..1950b15933ec 100644
--- a/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json
+++ b/tests/test_tipc/static/auto_parallel/qwen/pretrain_config_qwen_14b/pretrain-qwen_14b.json
@@ -4,28 +4,29 @@
     "input_dir": "./data",
     "output_dir": "./checkpoints/qwen_pretrain_ckpts",
     "per_device_train_batch_size": 1,
-    "gradient_accumulation_steps": 32,
-    "per_device_eval_batch_size": 16,
+    "gradient_accumulation_steps": 2,
+    "per_device_eval_batch_size": 1,
     "sharding": "stage1",
     "tensor_parallel_degree": 2,
-    "pipeline_parallel_degree": 4,
-    "virtual_pp_degree": 5,
-    "pipeline_schedule_mode": "1F1B",
+    "pipeline_parallel_degree": 1,
+    "sharding_parallel_degree": 16,
+    "sequence_parallel": 1,
+    "virtual_pp_degree": 1,
     "virtual_pipeline_seg_method": "QWenBlockAuto",
     "use_flash_attention": true,
-    "use_fused_rms_norm": false,
+    "use_fused_rms_norm": true,
     "use_fused_rope": true,
     "fused_linear": 1,
     "fuse_attention_ffn": 1,
     "fuse_attention_qkv": 1,
     "fused_linear_param_grad_add": 1,
     "max_seq_length": 4096,
-    "learning_rate": 0.00003,
-    "min_learning_rate": 0.000003,
+    "learning_rate": 1e-05,
+    "min_learning_rate": 5e-06,
     "scale_loss": 1024,
     "warmup_steps": 30,
-    "logging_steps": 1,
-    "max_steps": 100,
+    "logging_steps": 5,
+    "max_steps": 200,
     "save_steps": 1000,
     "eval_steps": 10000,
     "weight_decay": 0.01,
@@ -33,8 +34,8 @@
     "fp16_opt_level": "O2",
     "amp_master_grad": true,
     "warmup_ratio": 0.01,
-    "max_grad_norm": 0.0,
-    "dataloader_num_workers": 4,
+    "max_grad_norm": 1.0,
+    "dataloader_num_workers": 1,
     "continue_training": 0,
     "do_train": true,
     "do_eval": false,
@@ -47,9 +48,8 @@
     "save_total_limit": 2,
     "enable_auto_parallel": 1,
     "to_static": 1,
-    "auto_parallel_resume_form_hybrid_parallel": true,
     "data_parallel_config": "enable_allreduce_avg_in_gradinent_scale gradient_sync_after_accumulate",
-    "sharding_parallel_config": "enable_stage1_overlap",
+    "sharding_parallel_config": "enable_overlap enable_tensor_fusion",
     "tensor_parallel_config": "enable_mp_async_allreduce replace_with_parallel_cross_entropy",
     "pipeline_parallel_config": "enable_send_recv_overlap enable_split_backward"
 }
\ No newline at end of file
diff --git a/tests/testing_utils.py b/tests/testing_utils.py
index dbcfda9fe38b..59007307133b 100644
--- a/tests/testing_utils.py
+++ b/tests/testing_utils.py
@@ -511,7 +511,7 @@ def require_paddle_up_to_2_gpus(test_case):
 def require_gpu(min_gpus: int = 1):
     def actual_decorator(func):
         gpu_count = paddle.device.cuda.device_count()
-
+        print("gpu count: ", gpu_count)
         if gpu_count < min_gpus:
             return unittest.skip(f"test requires {min_gpus} GPUs")(func)
 
diff --git a/tests/trainer/test_moe_unified_checkpoint.py b/tests/trainer/test_moe_unified_checkpoint.py
new file mode 100644
index 000000000000..618e2b2f3daf
--- /dev/null
+++ b/tests/trainer/test_moe_unified_checkpoint.py
@@ -0,0 +1,176 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import pytest
+
+from paddlenlp.utils.downloader import get_path_from_url_with_filelock
+from tests.parallel_launch import TestMultipleGpus
+from tests.testing_utils import require_paddle_at_least_8_gpu, skip_for_none_ce_case
+from tests.trainer.test_unified_checkpoint import remove_ckpt, remove_logs
+from tests.trainer.trainer_utils import get_pretrain_arguments
+
+environment_variables = {
+    "NCCL_ALGO": "Tree",
+    "NVIDIA_TF32_OVERRIDE": "0",
+    "NCCL_IB_TIMEOUT": "22",
+    "NCCL_DEBUG": "INFO",
+    "FLAGS_embedding_deterministic": "1",
+    "FLAGS_cudnn_deterministic": "1",
+    "Flags_mp_aysnc_allreduce": "1",
+    "Flags_skip_mp_c_identity": "1",
+    "FLAGS_shard_norm_align_dp": "0",
+    "FLAGS_shard_use_reduce": "1",
+    "test_ci_no_save_model": "1",
+}
+
+moe_arguments = {
+    "model_name_or_path": "__internal_testing__/unified-ckpt-qwen2moe",
+    "dataset_name_or_path": "./unified_checkpoint/peft_input/data/",
+    "output_dir": "./unified_checkpoint/checkpoints/qwen2moe_sft_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps": 16,
+    "learning_rate": 3e-04,
+    "max_steps": 10,
+    "save_steps": 6,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "no",
+    "save_strategy": "steps",
+    "src_length": 1024,
+    "max_length": 2048,
+    "bf16": "true",
+    "fp16_opt_level": "O2",
+    "do_train": "true",
+    "do_eval": "false",
+    "disable_tqdm": "true",
+    "eval_with_do_generation": "false",
+    "recompute": "true",
+    "recompute_granularity": "full",
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "",
+    "lora": "false",
+    "zero_padding": "false",
+    "use_flash_attention": "false",
+    "unified_checkpoint": 1,
+    "continue_training": 0,
+    "sequence_parallel": 0,
+}
+
+
+def check_acc(log_dir="log"):
+    file_path = os.path.join(log_dir, "workerlog.n0.c0")
+    cmd = "grep -a 'global_step: 10' " + file_path + " | awk -F ','  '{print $2}' | awk  '{print $6}'"
+    import subprocess
+
+    res = subprocess.check_output(cmd, shell=True, text=True)
+    res = [float(x) for x in res.split()]
+
+    return res
+
+
+seed = 2024
+
+rng = np.random.default_rng(seed=seed)
+
+
+@pytest.mark.xdist_group(name="UC")
+class TestUnifiedCheckpointBase(TestMultipleGpus):
+    @classmethod
+    @property
+    def __test__(cls):
+        return cls != TestUnifiedCheckpointBase
+
+    def setUp(self):
+        """
+        1. update runfirst and rerun to run defined different config
+        2. update need_allclose to True if you want to check the result
+        3. update rtol to the relative value you want to check
+        """
+
+        self.configs = get_pretrain_arguments(moe_arguments)
+        os.environ.update(environment_variables)
+
+        file_ = "https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz"
+        input_dir = "unified_checkpoint/peft_input/"
+        os.makedirs(input_dir, exist_ok=True)
+        file_path = os.path.join(input_dir, "AdvertiseGen.tar.gz")
+        if not os.path.exists(file_path):
+            get_path_from_url_with_filelock(file_, root_dir=input_dir)
+
+        self.need_allclose = True
+        self.rtol = 1e-7
+
+        self.run_file = "llm/run_finetune.py"
+
+    def runfirst(self, train_args):
+        self.run_n1c8(self.run_file, **train_args)
+
+    def rerun(self, train_args):
+        self.run_n1c8(self.run_file, **train_args)
+
+    @require_paddle_at_least_8_gpu
+    def testTP4DP2(self):
+        remove_logs()
+        remove_ckpt(moe_arguments["output_dir"])
+
+        train_args = self.configs["TP4DP2"]
+        self.runfirst(train_args)
+        self.rerun(train_args)
+
+        if self.need_allclose:
+            res = check_acc()
+            assert len(res) == 2
+            np.testing.assert_allclose(res[0], res[1], self.rtol)
+
+    @skip_for_none_ce_case
+    @require_paddle_at_least_8_gpu
+    def testTP2Sharding4(self):
+        remove_logs()
+        remove_ckpt(moe_arguments["output_dir"])
+
+        train_args = self.configs["TP2Sharding4"]
+        self.runfirst(train_args)
+        self.rerun(train_args)
+
+        if self.need_allclose:
+            res = check_acc()
+            assert len(res) == 2
+            np.testing.assert_allclose(res[0], res[1], self.rtol)
+
+
+@pytest.mark.xdist_group(name="UC")
+class TestUnifiedCheckpointFull(TestUnifiedCheckpointBase):
+    @skip_for_none_ce_case
+    @require_paddle_at_least_8_gpu
+    def testTP2Sharding4V2(self):
+        remove_logs()
+        remove_ckpt(moe_arguments["output_dir"])
+
+        train_args = self.configs["TP2Sharding4"]
+        train_args.update({"sharding_parallel_config": "split_param"})
+        train_args.update({"amp_master_grad": True})
+        self.runfirst(train_args)
+        self.rerun(train_args)
+
+        if self.need_allclose:
+            res = check_acc()
+            assert len(res) == 2
+            np.testing.assert_allclose(res[0], res[1], self.rtol)
diff --git a/tests/trainer/trainer_utils.py b/tests/trainer/trainer_utils.py
index ae9a40e61d59..cda374ce1c6a 100644
--- a/tests/trainer/trainer_utils.py
+++ b/tests/trainer/trainer_utils.py
@@ -141,6 +141,14 @@ def get_pretrain_arguments(pretrain_arguments):
     train_args["gradient_accumulation_steps"] = train_args["gradient_accumulation_steps"] // 8
     configs["DP8"] = train_args
 
+    train_args = copy.deepcopy(pretrain_arguments)
+    train_args["tensor_parallel_degree"] = 2
+    train_args["pipeline_parallel_degree"] = 1
+    train_args["sharding_parallel_degree"] = 2
+    train_args["sharding"] = "stage1"
+    train_args["gradient_accumulation_steps"] = train_args["gradient_accumulation_steps"] // 4
+    configs["TP2DP2Sharding2"] = train_args
+
     return configs
 
 
diff --git a/tests/transformers/auto/test_tokenizer.py b/tests/transformers/auto/test_tokenizer.py
index 1e47267f91a3..e36aebd072b6 100644
--- a/tests/transformers/auto/test_tokenizer.py
+++ b/tests/transformers/auto/test_tokenizer.py
@@ -124,8 +124,9 @@ def test_new_tokenizer_fast_registration(self):
                 new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir, use_fast=True)
                 self.assertIsInstance(new_tokenizer, CustomTokenizerFast)
 
-                new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir, use_fast=False)
-                self.assertIsInstance(new_tokenizer, CustomTokenizer)
+                # TODO: fix this test. Now keep loaded tokenizer type
+                # new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir, use_fast=False)
+                # self.assertIsInstance(new_tokenizer, CustomTokenizer)
         finally:
             if "custom" in CONFIG_MAPPING._extra_content:
                 del CONFIG_MAPPING._extra_content["custom"]
diff --git a/tests/transformers/llm_embed/__init__.py b/tests/transformers/llm_embed/__init__.py
new file mode 100644
index 000000000000..a9cc79cc9d7f
--- /dev/null
+++ b/tests/transformers/llm_embed/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/llm_embed/test_modeling.py b/tests/transformers/llm_embed/test_modeling.py
new file mode 100644
index 000000000000..80ca5542ee37
--- /dev/null
+++ b/tests/transformers/llm_embed/test_modeling.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import AutoTokenizer, BiEncoderModel
+
+from ...testing_utils import require_gpu
+
+
+class BiEncoderModelIntegrationTest(unittest.TestCase):
+    @require_gpu(1)
+    def test_model_tiny_logits(self):
+        input_texts = [
+            "This is a test",
+            "This is another test",
+        ]
+
+        model_name_or_path = "BAAI/bge-large-en-v1.5"
+        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+        model = BiEncoderModel(model_name_or_path=model_name_or_path, dtype="float16", tokenizer=tokenizer)
+        with paddle.no_grad():
+            out = model.encode_sentences(sentences=input_texts)
+
+        print(out)
+        """
+        [[ 0.00674057  0.03396606  0.00722122 ...  0.01176453  0.00311279 -0.02825928]
+         [ 0.00708771  0.03982544 -0.00155735 ...  0.00658417  0.01318359 -0.03259277]]
+        """
+
+        del model
+        paddle.device.cuda.empty_cache()
+        gc.collect()
diff --git a/tests/transformers/nv_embed/__init__.py b/tests/transformers/nv_embed/__init__.py
new file mode 100644
index 000000000000..a9cc79cc9d7f
--- /dev/null
+++ b/tests/transformers/nv_embed/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/nv_embed/test_modeling.py b/tests/transformers/nv_embed/test_modeling.py
new file mode 100644
index 000000000000..0718389f156d
--- /dev/null
+++ b/tests/transformers/nv_embed/test_modeling.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import NVEncodeModel, PretrainedConfig
+
+from ...testing_utils import require_gpu
+
+
+class NVEncodeModelIntegrationTest(unittest.TestCase):
+    @require_gpu(1)
+    def test_model_tiny_logits(self):
+        input_texts = [
+            "This is a test",
+            "This is another test",
+        ]
+
+        config = PretrainedConfig(
+            attention_dropout=0.0,
+            bos_token_id=1,
+            dtype="float16",
+            eos_token_id=2,
+            hidden_act="silu",
+            hidden_size=4096,
+            initializer_range=0.02,
+            intermediate_size=14336,
+            max_position_embeddings=32768,
+            num_attention_heads=32,
+            num_hidden_layers=32,
+            num_key_value_heads=8,
+            rms_norm_eps=1e-05,
+            rope_theta=10000.0,
+            sliding_window=4096,
+            tie_word_embeddings=False,
+            vocab_size=32000,
+        )
+        model = NVEncodeModel(
+            config=config,
+            tokenizer_path="BAAI/bge-large-en-v1.5",
+            query_instruction="",
+            document_instruction="",
+        )
+        with paddle.no_grad():
+            out = model.encode_sentences(input_texts, instruction_len=0)
+
+        print(out)
+        """
+        [[-0.00473404  0.00711441  0.01237488 ... -0.00228691 -0.01416779 -0.00429535]
+         [-0.00343323  0.00911713  0.00894928 ... -0.00637054 -0.0165863 -0.00852966]]
+        """
+
+        del model
+        paddle.device.cuda.empty_cache()
+        gc.collect()
diff --git a/tests/transformers/test_modeling_common.py b/tests/transformers/test_modeling_common.py
index 51e8745fcb33..a32cc3bfcf26 100644
--- a/tests/transformers/test_modeling_common.py
+++ b/tests/transformers/test_modeling_common.py
@@ -38,7 +38,13 @@
 from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
 from paddlenlp.transformers.configuration_utils import PretrainedConfig
 from paddlenlp.transformers.model_utils import PretrainedModel
-from paddlenlp.utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME, MODEL_HOME
+from paddlenlp.utils.env import (
+    CONFIG_NAME,
+    LEGACY_CONFIG_NAME,
+    MODEL_HOME,
+    PADDLE_INFERENCE_MODEL_SUFFIX,
+    PADDLE_INFERENCE_WEIGHTS_SUFFIX,
+)
 
 from ..testing_utils import slow
 
@@ -968,11 +974,8 @@ def test_to_static_use_top_k(self):
                         use_top_p=False,
                     ),
                 )
-                if paddle.framework.use_pir_api():
-                    model_path = os.path.join(tempdir, "model.json")
-                else:
-                    model_path = os.path.join(tempdir, "model.pdmodel")
-                params_path = os.path.join(tempdir, "model.pdiparams")
+                model_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_MODEL_SUFFIX}")
+                params_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}")
                 config = paddle.inference.Config(model_path, params_path)
 
                 config.disable_gpu()
@@ -1040,11 +1043,8 @@ def test_to_static_use_top_p(self):
                     ),
                 )
 
-                if paddle.framework.use_pir_api():
-                    model_path = os.path.join(tempdir, "model.json")
-                else:
-                    model_path = os.path.join(tempdir, "model.pdmodel")
-                params_path = os.path.join(tempdir, "model.pdiparams")
+                model_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_MODEL_SUFFIX}")
+                params_path = os.path.join(tempdir, f"model{PADDLE_INFERENCE_WEIGHTS_SUFFIX}")
                 config = paddle.inference.Config(model_path, params_path)
 
                 config.disable_gpu()
diff --git a/tests/transformers/xlm_roberta/__init__.py b/tests/transformers/xlm_roberta/__init__.py
new file mode 100644
index 000000000000..a9cc79cc9d7f
--- /dev/null
+++ b/tests/transformers/xlm_roberta/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/xlm_roberta/test_modeling.py b/tests/transformers/xlm_roberta/test_modeling.py
new file mode 100644
index 000000000000..e5bded622082
--- /dev/null
+++ b/tests/transformers/xlm_roberta/test_modeling.py
@@ -0,0 +1,453 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    XLMRobertaConfig,
+    XLMRobertaForCausalLM,
+    XLMRobertaForMaskedLM,
+    XLMRobertaForMultipleChoice,
+    XLMRobertaForQuestionAnswering,
+    XLMRobertaForSequenceClassification,
+    XLMRobertaForTokenClassification,
+    XLMRobertaModel,
+    XLMRobertaPretrainedModel,
+)
+
+from ...testing_utils import require_package, slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class XLMRobertaModelTester:
+    def __init__(self, parent: XLMRobertaModelTest):
+        self.parent: XLMRobertaModelTest = parent
+        self.batch_size = 13
+        self.seq_length = 7
+        self.is_training = True
+        self.use_input_mask = True
+        self.use_token_type_ids = True
+        self.use_labels = True
+        self.vocab_size = 99
+        self.hidden_size = 32
+        self.num_hidden_layers = 5
+        self.num_attention_heads = 4
+        self.intermediate_size = 37
+        self.hidden_act = "gelu"
+        self.hidden_dropout_prob = 0.1
+        self.attention_probs_dropout_prob = 0.1
+        self.max_position_embeddings = 512
+        self.type_vocab_size = 16
+        self.type_sequence_label_size = 2
+        self.initializer_range = 0.02
+        self.layer_norm_eps = 1e-12
+        self.pad_token_id = 1
+        self.bos_token_id = 0
+        self.eos_token_id = 2
+        self.position_embedding_type = "absolute"
+        self.use_cache = True
+        self.classifier_dropout = None
+        self.num_labels = 2
+        self.num_choices = 4
+        self.dropout = 0.56
+        self.scope = None
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return XLMRobertaConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            layer_norm_eps=self.layer_norm_eps,
+            pad_token_id=self.pad_token_id,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            position_embedding_type=self.position_embedding_type,
+            use_cache=self.use_cache,
+            classifier_dropout=self.classifier_dropout,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        return (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = XLMRobertaModel(config)
+        model.eval()
+
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_causal_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMRobertaForCausalLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMRobertaForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = XLMRobertaForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=token_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMRobertaForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = XLMRobertaForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=choice_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = XLMRobertaForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+        )
+
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class XLMRobertaModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = XLMRobertaModel
+    use_test_inputs_embeds: bool = False
+    return_dict: bool = False
+    use_labels: bool = False
+    test_tie_weights = True
+
+    all_model_classes = (
+        XLMRobertaForQuestionAnswering,
+        XLMRobertaForTokenClassification,
+        XLMRobertaForMultipleChoice,
+        XLMRobertaForSequenceClassification,
+        XLMRobertaForMaskedLM,
+        XLMRobertaForCausalLM,
+    )
+    all_generative_model_classes = (XLMRobertaForCausalLM,)
+
+    def setUp(self):
+        self.model_tester = XLMRobertaModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_causal_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs_for_decoder()
+        self.model_tester.create_and_check_for_causal_lm(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(XLMRobertaPretrainedModel.pretrained_init_configuration.keys())[:1]:
+            model = XLMRobertaModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class XLMRobertaCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-onnx-xlm-roberta"
+
+    @classmethod
+    @require_package("transformers", "torch")
+    def setUpClass(cls) -> None:
+        from transformers import XLMRobertaModel
+
+        cls.torch_model_path = tempfile.TemporaryDirectory().name
+        model = XLMRobertaModel.from_pretrained(cls.test_model_id)
+        model.save_pretrained(cls.torch_model_path)
+
+    @require_package("transformers", "torch")
+    def test_xlmroberta_model_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import XLMRobertaModel
+
+            paddle_model = XLMRobertaModel.from_pretrained(self.test_model_id, from_hf_hub=False, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch  model
+            import torch
+            from transformers import XLMRobertaModel
+
+            torch_model = XLMRobertaModel.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/xlm_roberta/test_tokenizer.py b/tests/transformers/xlm_roberta/test_tokenizer.py
new file mode 100644
index 000000000000..a5dad1977828
--- /dev/null
+++ b/tests/transformers/xlm_roberta/test_tokenizer.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers import XLMRobertaTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+# VOCAB_FILES_NAMES = XLMRobertaTokenizer.resource_files_names
+
+
+class XLMRobertaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    test_offsets = False
+    tokenizer_class = XLMRobertaTokenizer
+
+    # Set up method called before each test
+    def setUp(self):
+        super().setUp()
+        self.vocab_file = "BAAI/bge-m3"
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+    # Method to get a tokenizer instance with specified keyword arguments
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return XLMRobertaTokenizer.from_pretrained(self.vocab_file, **kwargs)
+
+    # Test method to check tokenization
+    def test_tokenization(self):
+        tokenizer = self.get_tokenizer()
+        text = "Hello, how are you?"
+        tokens = tokenizer.tokenize(text)
+        self.assertIsInstance(tokens, list)
+        self.assertGreater(len(tokens), 0)
+
+    # Test method to check conversion of token to ID
+    def test_token_to_id(self):
+        tokenizer = self.get_tokenizer()
+        token = "Hello"
+        token_id = tokenizer.convert_tokens_to_ids(token)
+        self.assertIsInstance(token_id, int)
+
+    # Test method to check conversion of ID to token
+    def test_id_to_token(self):
+        tokenizer = self.get_tokenizer()
+        token_id = tokenizer.convert_tokens_to_ids("How")
+        token = tokenizer.convert_ids_to_tokens(token_id)
+        self.assertEqual(token, "How")
+
+    # Test method to check special tokens
+    def test_special_tokens(self):
+        tokenizer = self.get_tokenizer(
+            vocab_file=self.vocab_file, cls_token="<cls>", sep_token="<sep>", pad_token="<pad>"
+        )
+        self.assertEqual(tokenizer.cls_token, "<cls>")
+        self.assertEqual(tokenizer.sep_token, "<sep>")
+        self.assertEqual(tokenizer.pad_token, "<pad>")
+
+    # Test method to check building inputs with special tokens
+    def test_build_inputs_with_special_tokens(self):
+        tokenizer = self.get_tokenizer()
+        token_ids_0 = tokenizer.convert_tokens_to_ids(["Hello", "world"])
+        token_ids_1 = tokenizer.convert_tokens_to_ids(["How", "are", "you"])
+
+        input_ids = tokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1)
+        self.assertEqual(input_ids[0], tokenizer.cls_token_id)
+        self.assertEqual(input_ids[-1], tokenizer.sep_token_id)
+
+
+if __name__ == "__main__":
+    unittest.main()

模型名称	数据集名称	CMeEE-V2	Boson	CLUENER	CCIR2021-NER	任务对话2018-NER	银行借贷2021-NER	SKE2019	Avg
	数据集领域	医疗领域	通用领域	通用领域	新闻领域	对话领域	金融领域	金融领域
PP-UIE-0.5B	F1(0-shot)	0.479	0.638	0.593	0.773	0.723	0.361	0.782	0.621
PP-UIE-1.5B	F1(0-shot)	0.485	0.688	0.61	0.799	0.768	0.444	0.803	0.657
	F1(5-shot)	0.52	0.694	0.625	0.812	0.812	0.466	0.801	0.676
PP-UIE-7B	F1(0-shot)	0.521	0.696	0.615	0.826	0.807	0.434	0.812	0.673
	F1(5-shot)	0.527	0.705	0.626	0.826	0.861	0.483	0.801	0.69
PP-UIE-14B	F1(0-shot)	0.556	0.712	0.637	0.841	0.843	0.488	0.832	0.701
	F1(5-shot)	0.588	0.729	0.67	0.837	0.865	0.576	0.832	0.728