diff --git a/docs/llm/peft.md b/docs/llm/peft.md
index 234756e0f71b..f720138c6d23 100644
--- a/docs/llm/peft.md
+++ b/docs/llm/peft.md
@@ -277,4 +277,4 @@ key function
该函数会遍历整个权重参数列表,对于每个权重参数weight,统计所有进行梯度更新的参数,最后将信息打印出来。
```
-更详细的使用可以参考[finetuning 脚本](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/finetune_generation.py)版本, 以及对应的启动脚本编写方式(写在 [README.md](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/README.md)文件中)。
+更详细的使用可以参考[finetuning 脚本](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/run_finetune.py)版本, 以及对应的启动脚本编写方式(写在 [README.md](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/README.md)文件中)。
diff --git a/llm/llama/benchmark.py b/legacy/examples/benchmark/llm/llama_single_gpu/benchmark.py
similarity index 100%
rename from llm/llama/benchmark.py
rename to legacy/examples/benchmark/llm/llama_single_gpu/benchmark.py
diff --git a/llm/llama/benchmark_utils.py b/legacy/examples/benchmark/llm/llama_single_gpu/benchmark_utils.py
similarity index 100%
rename from llm/llama/benchmark_utils.py
rename to legacy/examples/benchmark/llm/llama_single_gpu/benchmark_utils.py
diff --git a/llm/.gitignore b/llm/.gitignore
deleted file mode 100644
index d81fdef50031..000000000000
--- a/llm/.gitignore
+++ /dev/null
@@ -1,12 +0,0 @@
-# tmp files
-infer.json
-output.json
-
-# data
-AdvertiseGen.tar.gz
-
-# checkpoints
-checkpoints/
-
-# inference_model
-inference*/
\ No newline at end of file
diff --git a/llm/Alignment/RM/models b/llm/Alignment/RM/models
deleted file mode 120000
index 39963209bbb5..000000000000
--- a/llm/Alignment/RM/models
+++ /dev/null
@@ -1 +0,0 @@
-../PPO/models
\ No newline at end of file
diff --git a/llm/README.md b/llm/README.md
index 36311c9980d1..c3009a1ceab2 100644
--- a/llm/README.md
+++ b/llm/README.md
@@ -19,17 +19,17 @@
## 🛠️ 支持模型列表 🛠️
-| Model | Pretrain | SFT | LoRA | Prefix Tuning | Quantization | Weight convert |
-| --- | --- | --- | --- | --- | --- | --- |
-| [LLaMA/LLaMA2](./llama) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [Baichuan/Baichuan2](./llama) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [ChatGLM-6B](./chatglm) | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
-| [ChatGLM2/ChatGLM3](./chatglm2) | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [Qwen](./qwen) | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |j
-| [Bloom](./bloom) | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [GPT-3](./gpt-3) | ✅ | ✅ | 🚧 | 🚧 | 🚧 | ✅ |
-| [OPT](./opt) | 🚧 | ✅ | ✅ | 🚧 | 🚧 | ✅ |
-| [GLM](./glm) | ❌ | ✅ | ✅ | 🚧 | 🚧 | ✅ |
+| Model | Pretrain | SFT | LoRA | Prefix Tuning | DPO | Quantization | Weight convert |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| [LLaMA](./llama) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [Qwen](./qwen) | ✅ | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
+| [Mixtral](./mixtral) | ✅ | ✅ | ✅ | ❌ | 🚧 |🚧 | 🚧 |
+| [Baichuan/Baichuan2](./llama) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [ChatGLM-6B](./chatglm) | ❌ | ✅ | ✅ | ✅ | 🚧 | ✅ | ❌ |
+| [ChatGLM2/ChatGLM3](./chatglm2) | ❌ | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ |
+| [Bloom](./bloom) | ❌ | ✅ | ✅ | ✅ |🚧 | ✅ | ✅ |
+| [GPT-3](./gpt-3) | ✅ | ✅ | 🚧 | 🚧 |🚧 | 🚧 | ✅ |
+| [OPT](./opt) | 🚧 | ✅ | ✅ | 🚧 | 🚧 |🚧 | ✅ |
* ✅: Supported
* 🚧: In Progress
@@ -39,7 +39,7 @@
## 🚀 快速开始 🚀
### 1. 预训练
-PaddleNLP将飞桨4D并行策略加入到Trainer API中, 用户只需修改Trainer配置即可使用不同的分布式策略。目前工具链提供[LLaMA/LLaMA2](./llama)、[GPT-3](./gpt-3)、[Qwen](./qwen)、[Baichuan/Baichuan2](./llama) 等模型预训练功能,更多模型支持持续更新中。
+PaddleNLP将飞桨4D并行策略加入到Trainer API中, 用户只需修改Trainer配置即可使用不同的分布式策略。目前工具链提供[LLaMA/LLaMA2](./llama)、[GPT-3](./gpt-3)、[Qwen](./qwen)、[Baichuan/Baichuan2](./llama)、[Mixtral](./mixtral) 等模型预训练功能,更多模型支持持续更新中。
@@ -54,7 +54,7 @@ PaddleNLP将飞桨4D并行策略加入到Trainer API中, 用户只需修改Tra
我们在此处提供了更详细的[预训练数据制作](),[分布式策略支持情况]( https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html#model-capability),[性能测试报告文档](https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html#model-performance),参见: https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html. 大模型权重列表参见[此处](https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html#model-weight)
-此项目支持了LLaMA、GPT-3、BaiChuan、Qwen 等大模型的预训练。用户切换配置config文件,即可一键运行。
+此项目支持了LLaMA、GPT-3、BaiChuan、Qwen、Mixtral 等大模型的预训练。用户切换配置config文件,即可一键运行。
数据详细制作流程可参考[此处](https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/dataset.html) : https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/dataset.html
@@ -79,30 +79,26 @@ mv llama_openwebtext_100k.idx ./data
```shell
# 编译自定义算子,可选
-cd ../model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
+cd ..legacy/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
-# llama 模型预训练
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./llama/pretrain-llama2_7b-tp2sd4_stage2.json
-
-# Qwen 模型预训练
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./qwen/pretrain_argument_stage2.json
+# 模型预训练参考
+python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
```
注意:
1. 建议使用paddle develop版本训练,需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包
2. `use_flash_attention` 需要在A100机器开启,建议使用cuda11.8环境。
-3. `use_fused_rms_norm` 需要安装[此目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt-3/external_ops)下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子,需要额外设置PYTHONPATH
+3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子,需要额外设置PYTHONPATH
4. `continue_training` 表示从现有的预训练模型加载训练。7b模型初始loss大概为2.xx, 随机初始化模型loss从11.x左右下降。
-5. 当前脚本为sharding版本,需要4D并行训练(数据、sharding、张量、流水线并行)的用户,请参考 `run_trainer_tp4pp2.sh`脚本。
-6. 多机训练时,若各机器使用的训练数据文件位置相同(例如挂载共享硬盘情况),请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据,
-7. 若数据集文件夹中存在默认缓存文件夹`index-cache/`,则额外指定的`--data_cache`不生效,训练时优先加载默认缓存文件夹中的内容。
+5. 多机训练时,若各机器使用的训练数据文件位置相同(例如挂载共享硬盘情况),请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据,
+6. 若数据集文件夹中存在默认缓存文件夹`index-cache/`,则额外指定的`--data_cache`不生效,训练时优先加载默认缓存文件夹中的内容。
### 2. 精调
PaddleNLP支持多个主流大模型的SFT、LoRA、Prefix Tuning等精调策略,提供统一、高效精调方案:
- **统一训练入口**。飞桨大模型套件精调方案可适配业界主流大模型,用户只需修改配置文件,即能在单卡或多卡(支持4D并行分布式策略)进行多种大模型精调。
-- **高效数据和分布式策略**。Zero Padding零填充优化策略有效减少了pad token的占比,提高模型训练效率高达100%。独创PEFT结合低比特和分布式并行策略,大幅降低大模型精调硬件门槛,支持单卡(A100 80G)百亿模型微调、单机(A100 80G * 8)千亿模型微调。
+- **高效数据和分布式策略**。Zero Padding零填充优化策略结合FlashMask策略有效提升模型训练效率。独创PEFT结合低比特和分布式并行策略,大幅降低大模型精调硬件门槛,支持单卡(A100 80G)百亿模型微调、单机(A100 80G * 8)千亿模型微调。
- **支持多轮对话**。支持统一对话模板,支持多轮对话高效训练,详参[多轮对话文档](./docs/chat_template.md)。
@@ -137,26 +133,26 @@ tar -zxvf AdvertiseGen.tar.gz
**全参精调:SFT**
```bash
-# 四卡llama SFT启动命令参考
-python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_argument.json
+# SFT启动命令参考
+python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
```
**LoRA**
```bash
-# 单卡llama LoRA启动命令参考
-python finetune_generation.py ./llama/lora_argument.json
+# LoRA启动命令参考
+python run_finetune.py ./config/llama/lora_argument.json
```
**Prefix Tuning**
```bash
-# 单卡llama Prefix Tuning启动命令参考
-python finetune_generation.py ./llama/pt_argument.json
+# Prefix Tuning启动命令参考
+python run_finetune.py ./config/llama/pt_argument.json
```
更多大模型精调分布式使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)。
### 3. 对齐
-我们支持DPO等偏好对齐策略。
+我们支持DPO等偏好对齐策略。DPO策略采用zero_padding策略,结合FlashMask策略,有效提升模型训练效率。
**数据准备**:
@@ -189,10 +185,10 @@ wget https://bj.bcebos.com/paddlenlp/datasets/examples/ultrafeedback_binarized.t
tar -zxvf ultrafeedback_binarized.tar.gz
```
-**全参精调:SFT**
+**全参DPO**
```bash
-# 四卡llama SFT启动命令参考
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" dpo_train.py ./llama/dpo_argument.json
+# DPO启动命令参考
+python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./dpo/run_dpo.py ./config/llama/dpo_argument.json
```
### 4. 量化
@@ -215,10 +211,10 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" dpo_train.py ./
```
# PTQ 量化启动命令参考
-python finetune_generation.py ./llama/ptq_argument.json
+python run_finetune.py ./config/llama/ptq_argument.json
# GPTQ 量化启动命令参考
-python finetune_generation.py ./llama/ptq_argument.json
+python run_finetune.py ./config/llama/ptq_argument.json
```
更多技术细节和模型量化使用详见[量化文档](./docs/quantization.md)。
@@ -231,13 +227,13 @@ PaddleNLP除了提供常用模型推理外,还提供了高性能推理,内
```shell
# 动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16
# 静态图模型推理命令参考
# step1 : 静态图导出
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
# step2: 静态图推理
-python predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
+python ./predict/predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
```
- **InferenceModel 高性能推理**:PaddleNLP 还提供了高性能推理模型加快并行推理的速度,同时支持FP16、Prefix Tuning、WINT8、A8W8多种推理方式。
@@ -253,13 +249,13 @@ python predictor.py --model_name_or_path ./inference --data_file ./data/dev.json
```shell
# 高性能动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
# 高性能静态图模型推理命令参考
# step1 : 静态图导出
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
# step2: 静态图推理
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
```
更多常用模型推理和高性能模型使用方法详见[大模型推理文档](./docs/inference.md)。
@@ -277,7 +273,7 @@ python predictor.py --model_name_or_path ./inference --inference_model --dtype "
我们提供了一套基于动态图推理的简单易用UI服务化部署脚本,用户可以快速部署服务化推理。
```
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./predict/flask_server.py \
--model_name_or_path meta-llama/Llama-2-7b-chat \
--port 8010 \
--flask_port 8011 \
@@ -287,7 +283,7 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \
- `flask_port`: Flask服务端口号,默认8010。
- 其他参数请参见[推理文档](./docs/inference.md)中推理参数配置。
-此外,如果想通过API脚本的方式跑推理,可参考:`./request_flask_server.py` 文件。
+此外,如果想通过API脚本的方式跑推理,可参考:`./predict/request_flask_server.py` 文件。
diff --git a/llm/Alignment/README.md b/llm/alignment/README.md
similarity index 87%
rename from llm/Alignment/README.md
rename to llm/alignment/README.md
index fbf978dc208c..7a9a54408d92 100644
--- a/llm/Alignment/README.md
+++ b/llm/alignment/README.md
@@ -8,7 +8,7 @@
```
.
-├── PPO # PPO 训练相关目录
+├── ppo # PPO 训练相关目录
│ ├── comm_utils.py # 通信相关工具py文件
│ ├── data # 数据集相关目录
│ │ ├── alpaca.py # alpaca(raw)数据集py文件
@@ -28,16 +28,16 @@
│ │ ├── ppo_model_utils.py # PPO loss等模型策略py文件
│ │ ├── score_model.py # score model模型定义py文件
│ │ └── score_model_utils.py # score model基类及工具py文件
-│ ├── ppo_main.py # RLHF训练脚本
+│ ├── run_ppo.py # RLHF训练脚本
│ ├── ppo_trainer.py # RLHF训练执行器py脚本
│ ├── tests # 测试相关目录
│ │ ├── run_model.py
│ │ └── test_export.py
│ └── trainer_utils.py # Trainer补丁及工具py脚本
├── README.md
-└── RM # Reward Model 训练相关目录
- ├── models -> ../PPO/models
- ├── reward_main.py # reward model训练脚本
+└── rm # Reward Model 训练相关目录
+ ├── models -> ../ppo/models
+ ├── run_reward.py # reward model训练脚本
└── reward_trainer.py # reward训练执行器py脚本
```
@@ -179,14 +179,14 @@ PPO 完整的训练过程包括以下 3 个阶段,如下图所示(来自[Dee
2. Reward Model Fine-Tuning
-使用 `reward_main.py` 脚本根据 `rm.json` 训练奖励模型
+使用 `run_reward.py` 脚本根据 `rm_argument.json` 训练奖励模型
```
-cd RM
-python -u -m paddle.distributed.launch reward_main.py ../../config/llama/rm.json
+cd rm
+python -u -m paddle.distributed.launch run_reward.py ../../config/llama/rm_argument.json
```
-`rm.json` 中的绝大部分参数释义同[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),不再赘述;稍有区别的是 `train_datasets`/`eval_datasets` 分别使用数据集定义注册时的`NAME`属性给出训练和验证集。另外对于奖励模型训练有以下特殊参数配置及释义(使用 PKU-Alignment/PKU-SafeRLHF 中的默认值):
+`rm_argument.json` 中的绝大部分参数释义同[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),不再赘述;稍有区别的是 `train_datasets`/`eval_datasets` 分别使用数据集定义注册时的`NAME`属性给出训练和验证集。另外对于奖励模型训练有以下特殊参数配置及释义(使用 PKU-Alignment/PKU-SafeRLHF 中的默认值):
- `normalize_score_during_training`:是否在训练过程中对奖励进行 normalize,默认为 `False`。
- `normalizer_type`:使用 normalizer 时计算 mean、var 的方式,可选`"RunningMeanStd", "ExponentialMovingAverage"`。
@@ -196,15 +196,15 @@ python -u -m paddle.distributed.launch reward_main.py ../../config/llama/rm.json
3. RLHF:
-RLHF 阶段需要 actor model、reference model、critic model、reward model 四个模型;actor-model/reference-model 使用 SFT 模型进行 initialize/frozen;critic-model/reward-model 使用 reward 模型进行 initialize/frozen (另外注意若 SFT 使用 LoRA 请先将 LoRA 权重合并)。这里使用 PKU-Alignment/PKU-SafeRLHF 提供的 SFT 模型([PKU-Alignment/alpaca-7b-reproduced](https://huggingface.co/PKU-Alignment/alpaca-7b-reproduced))和 reward 模型([PKU-Alignment/beaver-7b-v1.0-reward](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward),注意该模型只关注 helpful 未考量 harmless)作为示例,使用 `ppo_main.py` 脚本根据 `ppo.json` 进行 RLHF 训练。
+RLHF 阶段需要 actor model、reference model、critic model、reward model 四个模型;actor-model/reference-model 使用 SFT 模型进行 initialize/frozen;critic-model/reward-model 使用 reward 模型进行 initialize/frozen (另外注意若 SFT 使用 LoRA 请先将 LoRA 权重合并)。这里使用 PKU-Alignment/PKU-SafeRLHF 提供的 SFT 模型([PKU-Alignment/alpaca-7b-reproduced](https://huggingface.co/PKU-Alignment/alpaca-7b-reproduced))和 reward 模型([PKU-Alignment/beaver-7b-v1.0-reward](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward),注意该模型只关注 helpful 未考量 harmless)作为示例,使用 `run_ppo.py` 脚本根据 `ppo_argument.json` 进行 RLHF 训练。
```
# 类型提升 warning 暂时通过 loglevel 屏蔽,待后续修复
-cd PPO
-PYTHONPATH=../../ GLOG_minloglevel=2 python -u -m paddle.distributed.launch ppo_main.py ../../config/llama/ppo.json
+cd ppo
+PYTHONPATH=../../ GLOG_minloglevel=2 python -u -m paddle.distributed.launch run_ppo.py ../../config/llama/ppo_argument.json
```
-`ppo.json` 中的绝大部分参数释义同[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),不再赘述,重点给出以下参数配置及释义(使用 PKU-Alignment/PKU-SafeRLHF 中的默认值):
+`ppo_argument.json` 中的绝大部分参数释义同[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),不再赘述,重点给出以下参数配置及释义(使用 PKU-Alignment/PKU-SafeRLHF 中的默认值):
- `train_datasets`:使用数据集定义注册时的`NAME`属性给出训练集。
- `eval_datasets`:使用数据集定义注册时的`NAME`属性给出验证集。
@@ -230,7 +230,7 @@ PYTHONPATH=../../ GLOG_minloglevel=2 python -u -m paddle.distributed.launch ppo_
此外为了支持更高性、更大规模的 RLHF 训练提供了以下特殊参数配置,可以按需使用:
- `use_fusemt`:安装 paddlenlp_ops 后将在 rollout 生成时开启生成加速(开启流水线并行时不支持生成加速),通过此设置可以禁用生成加速。
- `eval_mode`:支持为空或者设置为 "single"、"tensor_parallel";通常可以在使用流水线并行训练时设置为"tensor_parallel",以此在 rollout 生成阶段使用非流水线并行模型并进行生成加速。
-- `offload_level`:支持设置为"freeze_model"、"optimizer"、"train_model"或者同时使用(空格分隔),分别指示 reward+reference 两个冻结模型、actor+critic 两个训练模型的优化器状态和模型参数的 offload/reload,用于在不同阶段 model/optimizer 使用结束后及时 offload 并在下次使用时 reload 相应参数权重以节省显存。
+- `offload_level`:支持设置为"freeze_model"、"optimizer"、"train_model"或者同时使用(空格分隔),分别指示 reward+reference 两个冻结模型、actor+critic 两个训练模型的优化器状态和模型参数的 offload/reload,用于在不同阶段 model/optimizer 使用结束后及时 offload 并在下次使用时 reload 相应参数权重以节省显存。
另外注意,在使用流水线并行时(pipeline_parallel_degree大于1)建议将 `dataloader_drop_last` 设置为 true, 以此避免不同batch size带来的问题。
diff --git a/llm/dpo_argument.py b/llm/alignment/dpo/dpo_argument.py
similarity index 100%
rename from llm/dpo_argument.py
rename to llm/alignment/dpo/dpo_argument.py
diff --git a/llm/dpo_train.py b/llm/alignment/dpo/run_dpo.py
similarity index 100%
rename from llm/dpo_train.py
rename to llm/alignment/dpo/run_dpo.py
diff --git a/llm/Alignment/PPO/comm_utils.py b/llm/alignment/ppo/comm_utils.py
similarity index 100%
rename from llm/Alignment/PPO/comm_utils.py
rename to llm/alignment/ppo/comm_utils.py
diff --git a/llm/Alignment/PPO/data/__init__.py b/llm/alignment/ppo/data/__init__.py
similarity index 100%
rename from llm/Alignment/PPO/data/__init__.py
rename to llm/alignment/ppo/data/__init__.py
diff --git a/llm/Alignment/PPO/data/alpaca.py b/llm/alignment/ppo/data/alpaca.py
similarity index 100%
rename from llm/Alignment/PPO/data/alpaca.py
rename to llm/alignment/ppo/data/alpaca.py
diff --git a/llm/Alignment/PPO/data/base.py b/llm/alignment/ppo/data/base.py
similarity index 100%
rename from llm/Alignment/PPO/data/base.py
rename to llm/alignment/ppo/data/base.py
diff --git a/llm/Alignment/PPO/data/preference.py b/llm/alignment/ppo/data/preference.py
similarity index 100%
rename from llm/Alignment/PPO/data/preference.py
rename to llm/alignment/ppo/data/preference.py
diff --git a/llm/Alignment/PPO/data/prompt_only.py b/llm/alignment/ppo/data/prompt_only.py
similarity index 100%
rename from llm/Alignment/PPO/data/prompt_only.py
rename to llm/alignment/ppo/data/prompt_only.py
diff --git a/llm/Alignment/PPO/data/safe_rlhf.py b/llm/alignment/ppo/data/safe_rlhf.py
similarity index 100%
rename from llm/Alignment/PPO/data/safe_rlhf.py
rename to llm/alignment/ppo/data/safe_rlhf.py
diff --git a/llm/Alignment/PPO/data/supervised.py b/llm/alignment/ppo/data/supervised.py
similarity index 100%
rename from llm/Alignment/PPO/data/supervised.py
rename to llm/alignment/ppo/data/supervised.py
diff --git a/llm/Alignment/PPO/infer_utils.py b/llm/alignment/ppo/infer_utils.py
similarity index 100%
rename from llm/Alignment/PPO/infer_utils.py
rename to llm/alignment/ppo/infer_utils.py
diff --git a/llm/Alignment/PPO/models/__init__.py b/llm/alignment/ppo/models/__init__.py
similarity index 100%
rename from llm/Alignment/PPO/models/__init__.py
rename to llm/alignment/ppo/models/__init__.py
diff --git a/llm/Alignment/PPO/models/infer_model_utils.py b/llm/alignment/ppo/models/infer_model_utils.py
similarity index 100%
rename from llm/Alignment/PPO/models/infer_model_utils.py
rename to llm/alignment/ppo/models/infer_model_utils.py
diff --git a/llm/Alignment/PPO/models/model_pp.py b/llm/alignment/ppo/models/model_pp.py
similarity index 100%
rename from llm/Alignment/PPO/models/model_pp.py
rename to llm/alignment/ppo/models/model_pp.py
diff --git a/llm/Alignment/PPO/models/pp_model_utils.py b/llm/alignment/ppo/models/pp_model_utils.py
similarity index 100%
rename from llm/Alignment/PPO/models/pp_model_utils.py
rename to llm/alignment/ppo/models/pp_model_utils.py
diff --git a/llm/Alignment/PPO/models/ppo_model.py b/llm/alignment/ppo/models/ppo_model.py
similarity index 100%
rename from llm/Alignment/PPO/models/ppo_model.py
rename to llm/alignment/ppo/models/ppo_model.py
diff --git a/llm/Alignment/PPO/models/ppo_model_utils.py b/llm/alignment/ppo/models/ppo_model_utils.py
similarity index 100%
rename from llm/Alignment/PPO/models/ppo_model_utils.py
rename to llm/alignment/ppo/models/ppo_model_utils.py
diff --git a/llm/Alignment/PPO/models/score_model.py b/llm/alignment/ppo/models/score_model.py
similarity index 100%
rename from llm/Alignment/PPO/models/score_model.py
rename to llm/alignment/ppo/models/score_model.py
diff --git a/llm/Alignment/PPO/models/score_model_utils.py b/llm/alignment/ppo/models/score_model_utils.py
similarity index 100%
rename from llm/Alignment/PPO/models/score_model_utils.py
rename to llm/alignment/ppo/models/score_model_utils.py
diff --git a/llm/Alignment/PPO/ppo_trainer.py b/llm/alignment/ppo/ppo_trainer.py
similarity index 100%
rename from llm/Alignment/PPO/ppo_trainer.py
rename to llm/alignment/ppo/ppo_trainer.py
diff --git a/llm/Alignment/PPO/ppo_main.py b/llm/alignment/ppo/run_ppo.py
similarity index 100%
rename from llm/Alignment/PPO/ppo_main.py
rename to llm/alignment/ppo/run_ppo.py
diff --git a/llm/Alignment/PPO/tests/run_model.py b/llm/alignment/ppo/tests/run_model.py
similarity index 100%
rename from llm/Alignment/PPO/tests/run_model.py
rename to llm/alignment/ppo/tests/run_model.py
diff --git a/llm/Alignment/PPO/tests/test_export.py b/llm/alignment/ppo/tests/test_export.py
similarity index 100%
rename from llm/Alignment/PPO/tests/test_export.py
rename to llm/alignment/ppo/tests/test_export.py
diff --git a/llm/Alignment/PPO/trainer_utils.py b/llm/alignment/ppo/trainer_utils.py
similarity index 100%
rename from llm/Alignment/PPO/trainer_utils.py
rename to llm/alignment/ppo/trainer_utils.py
diff --git a/llm/alignment/rm/models b/llm/alignment/rm/models
new file mode 120000
index 000000000000..46643733d940
--- /dev/null
+++ b/llm/alignment/rm/models
@@ -0,0 +1 @@
+../ppo/models
\ No newline at end of file
diff --git a/llm/Alignment/RM/reward_trainer.py b/llm/alignment/rm/reward_trainer.py
similarity index 100%
rename from llm/Alignment/RM/reward_trainer.py
rename to llm/alignment/rm/reward_trainer.py
diff --git a/llm/Alignment/RM/reward_main.py b/llm/alignment/rm/run_reward.py
similarity index 100%
rename from llm/Alignment/RM/reward_main.py
rename to llm/alignment/rm/run_reward.py
diff --git a/llm/gpt-3/auto_parallel/run_pretrain_auto.py b/llm/auto_parallel/gpt-3/run_pretrain_auto.py
similarity index 70%
rename from llm/gpt-3/auto_parallel/run_pretrain_auto.py
rename to llm/auto_parallel/gpt-3/run_pretrain_auto.py
index 0ee470d37255..5afb828d0e2f 100644
--- a/llm/gpt-3/auto_parallel/run_pretrain_auto.py
+++ b/llm/auto_parallel/gpt-3/run_pretrain_auto.py
@@ -18,7 +18,6 @@
import random
import sys
import types
-from collections import OrderedDict
from dataclasses import dataclass, field
from typing import List, Optional
@@ -33,10 +32,10 @@
from paddlenlp.transformers import (
AutoTokenizer,
CosineAnnealingWithWarmupDecay,
- LinearAnnealingWithWarmupDecay,
GPTConfig,
GPTForCausalLMAuto,
GPTPretrainingCriterionAuto,
+ LinearAnnealingWithWarmupDecay,
)
from paddlenlp.utils.log import logger
@@ -50,11 +49,10 @@
print_rank_0,
)
-def add_start_docstrings(*docstr):
+def add_start_docstrings(*docstr):
def docstring_decorator(fn):
- fn.__doc__ = "".join(docstr) + (fn.__doc__
- if fn.__doc__ is not None else "")
+ fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
return fn
return docstring_decorator
@@ -70,22 +68,19 @@ class PreTrainingArguments(TrainingArguments):
decay_steps: float = field(
default=None,
metadata={
- "help":
- "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+ "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
},
)
enable_linear_fused_grad_add: bool = field(
default=False,
metadata={
- "help":
- "Enable fused linear grad add strategy, which will reduce elementwise add for grad accumulation in the backward of nn.Linear ."
+ "help": "Enable fused linear grad add strategy, which will reduce elementwise add for grad accumulation in the backward of nn.Linear ."
},
)
fused_linear_param_grad_add: bool = field(
default=False,
metadata={
- "help":
- "Enable fused_linear_param_grad pass, which should replace add_n_op with add_op for gradients accumulation."
+ "help": "Enable fused_linear_param_grad pass, which should replace add_n_op with add_op for gradients accumulation."
},
)
job_schedule_profiler_start: int = field(
@@ -97,27 +92,19 @@ class PreTrainingArguments(TrainingArguments):
metadata={"help": "The step to end job_schedule_profiler."},
)
pipeline_schedule_mode: str = field(
- default="1F1B",
- metadata={
- "help":
- "The pipeline schedule mode, support FThenB, 1F1B, VPP and Eager-1F1B."
- })
- sr: Optional[int] = field(
- default=0, metadata={"help": "The count of chunks without recompute."})
+ default="1F1B", metadata={"help": "The pipeline schedule mode, support FThenB, 1F1B, VPP and Eager-1F1B."}
+ )
+ sr: Optional[int] = field(default=0, metadata={"help": "The count of chunks without recompute."})
refined_ops_patterns: Optional[List[str]] = field(
- default=None, metadata={"help": "The pattern of refined recompute."})
+ default=None, metadata={"help": "The pattern of refined recompute."}
+ )
virtual_pipeline_seg_method: str = field(
- default="LlamaDecoderLayerAuto",
- metadata={
- "help": "The seg method of spliting pp layer for virtual pipeline."
- })
+ default="LlamaDecoderLayerAuto", metadata={"help": "The seg method of spliting pp layer for virtual pipeline."}
+ )
# NOTE(gongenlei): new add autotuner_benchmark
autotuner_benchmark: bool = field(
default=False,
- metadata={
- "help":
- "Weather to run benchmark by autotuner. True for from_scratch and pad_max_length."
- },
+ metadata={"help": "Weather to run benchmark by autotuner. True for from_scratch and pad_max_length."},
)
def __post_init__(self):
@@ -140,8 +127,7 @@ def __post_init__(self):
if self.fused_linear_param_grad_add:
fused_passes = self.strategy.fused_passes
fused_passes.enable = True
- fused_passes.fused_passes_list.append(
- "fused_linear_param_grad_add_pass")
+ fused_passes.fused_passes_list.append("fused_linear_param_grad_add_pass")
logger.info(self.strategy)
@@ -155,39 +141,28 @@ class DataArguments:
"""
input_dir: str = field(
- default=None,
- metadata={
- "help":
- "The name of the dataset to use (via the datasets library)."
- })
- split: str = field(default="949,50,1",
- metadata={"help": "Train/valid/test data split."})
+ default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+ )
+ split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
max_seq_length: int = field(
default=1024,
metadata={
- "help":
- "The maximum total input sequence length after tokenization. Sequences longer "
+ "help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
share_folder: bool = field(
default=False,
- metadata={
- "help":
- "Use share folder for data dir and output dir on multi machine."
- },
+ metadata={"help": "Use share folder for data dir and output dir on multi machine."},
)
- data_impl: str = field(
- default="mmap",
- metadata={"help": "The format of the preprocessed data."})
+ data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."})
skip_warmup: bool = field(
default=True,
metadata={"help": "Whether to skip the warmup process of mmap files."},
)
- data_cache: str = field(
- default=None, metadata={"help": "The path of the cached dataset."})
+ data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."})
@dataclass
@@ -197,52 +172,35 @@ class ModelArguments:
"""
model_type: Optional[str] = field(
- default="llama",
- metadata={"help": "Only support for llama pre-training for now."})
+ default="llama", metadata={"help": "Only support for llama pre-training for now."}
+ )
model_name_or_path: str = field(
default="__internal_testing__/tiny-random-llama",
metadata={
- "help":
- "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+ "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
},
)
tokenizer_name_or_path: Optional[str] = field(
- default=None,
- metadata={
- "help":
- "Pretrained tokenizer name or path if not the same as model_name"
- })
+ default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+ )
config_name: Optional[str] = field(
- default=None,
- metadata={
- "help":
- "Pretrained config name or path if not the same as model_name"
- })
+ default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+ )
vocab_size: Optional[int] = field(
default=None,
metadata={
- "help":
- ".Vocabulary size of the Llama model. Defines the number of different tokens that can be represented by the `inputs_ids`"
+ "help": ".Vocabulary size of the Llama model. Defines the number of different tokens that can be represented by the `inputs_ids`"
},
)
- hidden_size: Optional[int] = field(
- default=None,
- metadata={"help": "Dimension of the hidden representations."})
- intermediate_size: Optional[int] = field(
- default=None,
- metadata={"help": "Dimension of the MLP representations."})
+ hidden_size: Optional[int] = field(default=None, metadata={"help": "Dimension of the hidden representations."})
+ intermediate_size: Optional[int] = field(default=None, metadata={"help": "Dimension of the MLP representations."})
num_hidden_layers: Optional[int] = field(
- default=None,
- metadata={
- "help": "Number of hidden layers in the Transformer encoder."
- })
+ default=None, metadata={"help": "Number of hidden layers in the Transformer encoder."}
+ )
num_attention_heads: Optional[int] = field(
default=None,
- metadata={
- "help":
- "Number of attention heads for each attention layer in the Transformer encoder."
- },
+ metadata={"help": "Number of attention heads for each attention layer in the Transformer encoder."},
)
use_flash_attention: bool = field(
default=False,
@@ -258,9 +216,7 @@ class ModelArguments:
)
fuse_attention_ffn: bool = field(
default=False,
- metadata={
- "help": "whether to fuse first up and gate proj in mlp block"
- },
+ metadata={"help": "whether to fuse first up and gate proj in mlp block"},
)
recompute_granularity: str = field(
default="full",
@@ -273,15 +229,12 @@ class ModelArguments:
continue_training: bool = field(
default=False,
metadata={
- "help":
- "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
+ "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
},
)
- hidden_dropout_prob: float = field(
- default=0.1, metadata={"help": "The hidden dropout prob."})
- attention_probs_dropout_prob: float = field(
- default=0.1, metadata={"help": "The attention hidden dropout prob."})
+ hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."})
+ attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention hidden dropout prob."})
sequence_parallel: bool = field(
default=False,
@@ -297,16 +250,12 @@ class ModelArguments:
)
no_recompute_layers: Optional[List[int]] = field(
default=None,
- metadata={
- "help":
- "Specify the full transformer layers that should not be recomputed."
- },
+ metadata={"help": "Specify the full transformer layers that should not be recomputed."},
)
pp_recompute_interval: int = field(
default=1,
metadata={
- "help":
- "The interval for the number of layers at which recomputation occurs. A value of 0 indicates no recomputation. Default is 0."
+ "help": "The interval for the number of layers at which recomputation occurs. A value of 0 indicates no recomputation. Default is 0."
},
)
recompute_use_reentrant: bool = field(
@@ -323,30 +272,27 @@ def create_pretrained_dataset(
need_data=True,
):
- check_data_split(data_args.split, training_args.do_train,
- training_args.do_eval, training_args.do_predict)
+ check_data_split(data_args.split, training_args.do_train, training_args.do_eval, training_args.do_predict)
train_val_test_num_samples = [
- training_args.per_device_train_batch_size *
- training_args.data_parallel_degree * training_args.max_steps *
- training_args.gradient_accumulation_steps,
- training_args.per_device_eval_batch_size *
- training_args.data_parallel_degree * training_args.eval_iters *
- (training_args.max_steps // training_args.eval_steps + 1),
- training_args.per_device_eval_batch_size *
- training_args.data_parallel_degree * training_args.test_iters,
+ training_args.per_device_train_batch_size
+ * training_args.data_parallel_degree
+ * training_args.max_steps
+ * training_args.gradient_accumulation_steps,
+ training_args.per_device_eval_batch_size
+ * training_args.data_parallel_degree
+ * training_args.eval_iters
+ * (training_args.max_steps // training_args.eval_steps + 1),
+ training_args.per_device_eval_batch_size * training_args.data_parallel_degree * training_args.test_iters,
]
print_rank_0(" > datasets target sizes (minimum size):")
if training_args.do_train:
- print_rank_0(" train: {}".format(
- train_val_test_num_samples[0]))
+ print_rank_0(" train: {}".format(train_val_test_num_samples[0]))
if training_args.do_eval:
- print_rank_0(" validation: {}".format(
- train_val_test_num_samples[1]))
+ print_rank_0(" validation: {}".format(train_val_test_num_samples[1]))
if training_args.do_predict:
- print_rank_0(" test: {}".format(
- train_val_test_num_samples[2]))
+ print_rank_0(" test: {}".format(train_val_test_num_samples[2]))
# Build the datasets.
train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
@@ -399,9 +345,9 @@ def get_train_data_file(args):
return args.input_dir.split()
else:
files = [
- os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)
- if (os.path.isfile(os.path.join(args.input_dir, f)) and (
- "_idx.npz" in str(f) or ".idx" in str(f)))
+ os.path.join(args.input_dir, f)
+ for f in os.listdir(args.input_dir)
+ if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
]
files = [x.replace("_idx.npz", "") for x in files]
files = [x.replace(".idx", "") for x in files] # add
@@ -419,7 +365,6 @@ def get_train_data_file(args):
class PretrainingTrainer(AutoTrainer):
-
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
@@ -441,8 +386,7 @@ def print_config(args, key=""):
logger.info("{:^40}".format("{} Configuration Arguments".format(key)))
logger.info("{:30}: {}".format("paddle commit id", paddle.version.commit))
- logger.info("{:30}: {}".format("paddlenlp commit id",
- paddlenlp.version.commit))
+ logger.info("{:30}: {}".format("paddlenlp commit id", paddlenlp.version.commit))
for a in dir(args):
if a[:2] != "__": # don't print double underscore methods
@@ -467,12 +411,10 @@ def init_seed(seed: int = 1234, args=None):
dp_degree=args.data_parallel_degree,
pp_degree=args.pipeline_parallel_degree,
mp_degree=args.tensor_parallel_degree,
- sharding_degree=
- 1, # auto_parallel's sharding is not orthogonal with dp, mp and pp
+ sharding_degree=1, # auto_parallel's sharding is not orthogonal with dp, mp and pp
)
- global_seed, local_seed, random_seed = _get_distributed_seeds(
- args.seed, topo)
+ global_seed, local_seed, random_seed = _get_distributed_seeds(args.seed, topo)
paddle.seed(local_seed)
random.seed(random_seed)
@@ -480,8 +422,8 @@ def init_seed(seed: int = 1234, args=None):
logger.info(
"The global seed is set to {}, local seed is set to {} and "
- "random seed is set to {}.".format(global_seed, local_seed,
- random_seed))
+ "random seed is set to {}.".format(global_seed, local_seed, random_seed)
+ )
else:
random.seed(args.seed)
np.random.seed(args.seed)
@@ -489,14 +431,11 @@ def init_seed(seed: int = 1234, args=None):
def main():
- parser = PdArgumentParser(
- (ModelArguments, DataArguments, PreTrainingArguments))
+ parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
- model_args, data_args, training_args = parser.parse_json_file(
- json_file=os.path.abspath(sys.argv[1]))
+ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
else:
- model_args, data_args, training_args = parser.parse_args_into_dataclasses(
- )
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if training_args.enable_linear_fused_grad_add:
from fused_layers import mock_layers
@@ -524,15 +463,12 @@ def main():
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
- +
- f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+ + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
)
# Detecting last checkpoint.
last_checkpoint = None
- if os.path.isdir(
- training_args.output_dir
- ) and training_args.do_train and not training_args.overwrite_output_dir:
+ if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
logger.info(
@@ -540,41 +476,35 @@ def main():
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
- config_class, model_class, criterion_class = MODEL_CLASSES[
- model_args.model_type]
+ config_class, model_class, criterion_class = MODEL_CLASSES[model_args.model_type]
- tokenizer = AutoTokenizer.from_pretrained(
- model_args.tokenizer_name_or_path)
+ tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
config = config_class.from_pretrained(model_args.model_name_or_path)
config.seq_length = data_args.max_seq_length
# There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings
if not model_args.continue_training:
- config.max_position_embeddings = max(config.max_position_embeddings,
- data_args.max_seq_length)
+ config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
if not model_args.continue_training:
- config.vocab_size = max(config.vocab_size,
- ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
- logger.info(
- f"Reset vocab size to {config.vocab_size} for batter amp peformance."
- )
+ config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
+ logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.")
if model_args.no_recompute_layers is not None:
model_args.no_recompute_layers.sort()
config.vocab_size = model_args.vocab_size if model_args.vocab_size is not None else config.vocab_size
config.hidden_size = model_args.hidden_size if model_args.hidden_size is not None else config.hidden_size
- config.intermediate_size = (model_args.intermediate_size
- if model_args.intermediate_size is not None
- else config.intermediate_size)
- config.num_hidden_layers = (model_args.num_hidden_layers
- if model_args.num_hidden_layers is not None
- else config.num_hidden_layers)
- config.num_attention_heads = (model_args.num_attention_heads
- if model_args.num_attention_heads is not None
- else config.num_attention_heads)
+ config.intermediate_size = (
+ model_args.intermediate_size if model_args.intermediate_size is not None else config.intermediate_size
+ )
+ config.num_hidden_layers = (
+ model_args.num_hidden_layers if model_args.num_hidden_layers is not None else config.num_hidden_layers
+ )
+ config.num_attention_heads = (
+ model_args.num_attention_heads if model_args.num_attention_heads is not None else config.num_attention_heads
+ )
config.use_flash_attention = model_args.use_flash_attention
config.use_fused_rms_norm = model_args.use_fused_rms_norm
@@ -615,10 +545,7 @@ def main():
if training_args.recompute:
def fn(layer):
- if hasattr(
- layer,
- "enable_recompute") and (layer.enable_recompute is False
- or layer.enable_recompute == 0):
+ if hasattr(layer, "enable_recompute") and (layer.enable_recompute is False or layer.enable_recompute == 0):
layer.enable_recompute = True
model.apply(fn)
diff --git a/llm/gpt-3/auto_parallel/run_pretrain_auto_dp2mp2pp2.sh b/llm/auto_parallel/gpt-3/run_pretrain_auto_dp2mp2pp2.sh
similarity index 72%
rename from llm/gpt-3/auto_parallel/run_pretrain_auto_dp2mp2pp2.sh
rename to llm/auto_parallel/gpt-3/run_pretrain_auto_dp2mp2pp2.sh
index 9219cd27e3a3..71578bb81532 100755
--- a/llm/gpt-3/auto_parallel/run_pretrain_auto_dp2mp2pp2.sh
+++ b/llm/auto_parallel/gpt-3/run_pretrain_auto_dp2mp2pp2.sh
@@ -1,3 +1,17 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
export PYTHONPATH="../../../":$PYTHONPATH
export FLAGS_cudnn_deterministic=1
export FLAGS_embedding_deterministic=1
diff --git a/llm/llama/auto_parallel/README.md b/llm/auto_parallel/llama/README.md
similarity index 100%
rename from llm/llama/auto_parallel/README.md
rename to llm/auto_parallel/llama/README.md
diff --git a/llm/llama/auto_parallel/run_llama3.sh b/llm/auto_parallel/llama/run_llama3.sh
similarity index 100%
rename from llm/llama/auto_parallel/run_llama3.sh
rename to llm/auto_parallel/llama/run_llama3.sh
diff --git a/llm/llama/auto_parallel/run_pretrain_auto.py b/llm/auto_parallel/llama/run_pretrain_auto.py
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_auto.py
rename to llm/auto_parallel/llama/run_pretrain_auto.py
diff --git a/llm/llama/auto_parallel/run_pretrain_auto.sh b/llm/auto_parallel/llama/run_pretrain_auto.sh
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_auto.sh
rename to llm/auto_parallel/llama/run_pretrain_auto.sh
diff --git a/llm/llama/auto_parallel/run_pretrain_auto_static.py b/llm/auto_parallel/llama/run_pretrain_auto_static.py
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_auto_static.py
rename to llm/auto_parallel/llama/run_pretrain_auto_static.py
diff --git a/llm/llama/auto_parallel/run_pretrain_auto_static.sh b/llm/auto_parallel/llama/run_pretrain_auto_static.sh
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_auto_static.sh
rename to llm/auto_parallel/llama/run_pretrain_auto_static.sh
diff --git a/llm/llama/auto_parallel/run_pretrain_auto_static_sp.sh b/llm/auto_parallel/llama/run_pretrain_auto_static_sp.sh
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_auto_static_sp.sh
rename to llm/auto_parallel/llama/run_pretrain_auto_static_sp.sh
diff --git a/llm/llama/auto_parallel/run_pretrain_hand.py b/llm/auto_parallel/llama/run_pretrain_hand.py
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_hand.py
rename to llm/auto_parallel/llama/run_pretrain_hand.py
diff --git a/llm/llama/auto_parallel/run_pretrain_hand.sh b/llm/auto_parallel/llama/run_pretrain_hand.sh
similarity index 100%
rename from llm/llama/auto_parallel/run_pretrain_hand.sh
rename to llm/auto_parallel/llama/run_pretrain_hand.sh
diff --git a/llm/qwen/auto_parallel/pretrain_argument_auto_dp2tp2pp2.json b/llm/auto_parallel/qwen/pretrain_argument_auto_dp2tp2pp2.json
similarity index 100%
rename from llm/qwen/auto_parallel/pretrain_argument_auto_dp2tp2pp2.json
rename to llm/auto_parallel/qwen/pretrain_argument_auto_dp2tp2pp2.json
diff --git a/llm/qwen/auto_parallel/run_pretrain_3D_auto.py b/llm/auto_parallel/qwen/run_pretrain_3D_auto.py
similarity index 100%
rename from llm/qwen/auto_parallel/run_pretrain_3D_auto.py
rename to llm/auto_parallel/qwen/run_pretrain_3D_auto.py
diff --git a/llm/qwen/auto_parallel/run_pretrain_3D_auto.sh b/llm/auto_parallel/qwen/run_pretrain_3D_auto.sh
similarity index 100%
rename from llm/qwen/auto_parallel/run_pretrain_3D_auto.sh
rename to llm/auto_parallel/qwen/run_pretrain_3D_auto.sh
diff --git a/llm/baichuan/pretrain-baichuan2_13b-sd8_stage2.json b/llm/baichuan/pretrain-baichuan2_13b-sd8_stage2.json
deleted file mode 100644
index 51d55556a9c1..000000000000
--- a/llm/baichuan/pretrain-baichuan2_13b-sd8_stage2.json
+++ /dev/null
@@ -1,40 +0,0 @@
-{
- "model_name_or_path": "baichuan-inc/Baichuan2-13B-Base",
- "tokenizer_name_or_path": "baichuan-inc/Baichuan2-13B-Base",
- "input_dir": "./data",
- "output_dir": "./checkpoints/baichuan_pretrain_ckpts",
- "per_device_train_batch_size": 1,
- "gradient_accumulation_steps": 8,
- "per_device_eval_batch_size": 2,
- "tensor_parallel_degree": 1,
- "pipeline_parallel_degree": 1,
- "sharding": "stage2",
- "virtual_pp_degree": 1,
- "sequence_parallel": 0,
- "use_flash_attention": true,
- "use_fused_rms_norm": true,
- "use_fused_rope": true,
- "max_seq_length": 4096,
- "learning_rate": 3e-05,
- "min_learning_rate": 3e-06,
- "warmup_steps": 1000,
- "logging_steps": 1,
- "max_steps": 10000,
- "save_steps": 5000,
- "eval_steps": 1000,
- "weight_decay": 0.01,
- "bf16": true,
- "fp16_opt_level": "O2",
- "warmup_ratio": 0.01,
- "max_grad_norm": 1.0,
- "dataloader_num_workers": 1,
- "continue_training": 1,
- "do_train": true,
- "do_eval": true,
- "do_predict": true,
- "disable_tqdm": true,
- "recompute": true,
- "distributed_dataloader": 1,
- "recompute_granularity": "full",
- "save_total_limit": 2
- }
diff --git a/llm/benchmark.sh b/llm/benchmark.sh
deleted file mode 100644
index d49858b42b76..000000000000
--- a/llm/benchmark.sh
+++ /dev/null
@@ -1,36 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-export PYTHONPATH=$(dirname $(pwd)):$PYTHONPATH
-
-export FLAGS_control_flow_use_new_executor=1
-export FLAGS_new_executor_serial_run=1
-export FLAGS_allocator_strategy=naive_best_fit
-export FLAGS_fraction_of_gpu_memory_to_use=0.92
-
-export FLAGS_use_autotune=1
-export FLAGS_cublaslt_exhaustive_search_times=10
-export FLAGS_cache_inference_while_scope=1
-
-
-python predictor.py \
- --model_name_or_path ./llama7b-inference_model_fp16 \
- --dtype float16 \
- --src_length 300 \
- --max_length 100 \
- --output_file "infer.json" \
- --mode "static" \
- --batch_size 1 \
- --benchmark \
- --inference_model
diff --git a/llm/config/baichuan/README.md b/llm/config/baichuan/README.md
new file mode 100644
index 000000000000..98bf760a6caa
--- /dev/null
+++ b/llm/config/baichuan/README.md
@@ -0,0 +1,15 @@
+# Baichuan
+
+## 1. 模型介绍
+
+**支持模型权重:**
+
+| Model |
+| ---------------------------------|
+| baichuan-inc/Baichuan-7B |
+| baichuan-inc/Baichuan-13B-Base |
+| baichuan-inc/Baichuan-13B-Chat |
+| baichuan-inc/Baichuan2-7B-Base |
+| baichuan-inc/Baichuan2-7B-Chat |
+| baichuan-inc/Baichuan2-13B-Base |
+| baichuan-inc/Baichuan2-13B-Chat |
diff --git a/llm/config/baichuan/awq_argument.json b/llm/config/baichuan/awq_argument.json
new file mode 100644
index 000000000000..23c1884ed768
--- /dev/null
+++ b/llm/config/baichuan/awq_argument.json
@@ -0,0 +1,23 @@
+{
+ "model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
+ "per_device_train_batch_size": 8,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "src_length": 1024,
+ "max_length": 2048,
+ "fp16": true,
+ "fp16_opt_level": "O2",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/ptq_ckpts",
+ "do_eval": true,
+ "eval_with_do_generation": false,
+ "do_ptq": true,
+ "quant_type": "weight_only_int4",
+ "weight_quant_method": "groupwise",
+ "ptq_step": 16,
+ "smooth": true,
+ "auto_clip": true,
+ "autoclip_step": 1,
+ "do_awq": true,
+ "unified_checkpoint": true
+ }
\ No newline at end of file
diff --git a/llm/config/baichuan/dpo_argument.json b/llm/config/baichuan/dpo_argument.json
new file mode 100644
index 000000000000..376caef0eda7
--- /dev/null
+++ b/llm/config/baichuan/dpo_argument.json
@@ -0,0 +1,38 @@
+{
+ "model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
+ "train_dataset_path": "./data/train.jsonl",
+ "dev_dataset_path": "./data/dev.jsonl",
+ "output_dir": "./checkpoints/dpo_ckpts",
+ "per_device_train_batch_size": 1,
+ "gradient_accumulation_steps": 8,
+ "per_device_eval_batch_size": 1,
+ "num_train_epochs": 1,
+ "max_steps": 100,
+ "learning_rate": 1e-06,
+ "warmup_steps": 10,
+ "logging_steps": 1,
+ "evaluation_strategy": "steps",
+ "save_strategy": "steps",
+ "eval_steps": 100,
+ "save_steps": 500,
+ "max_seq_len": 4096,
+ "max_prompt_len": 2048,
+ "bf16": true,
+ "fp16_opt_level": "O2",
+ "do_train": true,
+ "do_eval": true,
+ "disable_tqdm": true,
+ "load_best_model_at_end": true,
+ "tensor_parallel_degree": 8,
+ "sharding_parallel_degree": 1,
+ "sharding": "stage1",
+ "use_flash_attention": true,
+ "recompute": false,
+ "recompute_granularity": "full",
+ "dpo_beta": 0.1,
+ "benchmark": false,
+ "dpo_loss_type": "sigmoid",
+ "dpo_label_smoothing": 0.0,
+ "unified_checkpoint": true,
+ "autotuner_benchmark":false
+ }
diff --git a/llm/chatglm2/gptq_argument.json b/llm/config/baichuan/gptq_argument.json
similarity index 71%
rename from llm/chatglm2/gptq_argument.json
rename to llm/config/baichuan/gptq_argument.json
index 9285e8b628ad..593773a268e2 100644
--- a/llm/chatglm2/gptq_argument.json
+++ b/llm/config/baichuan/gptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/chatglm2_sft_ckpts",
+ "model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,9 +8,10 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm2_gptq_ckpts",
+ "output_dir": "./checkpoints/gptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_gptq": true,
+ "unified_checkpoint": true,
"gptq_step": 8
}
\ No newline at end of file
diff --git a/llm/config/baichuan/lora_argument.json b/llm/config/baichuan/lora_argument.json
new file mode 100644
index 000000000000..8d2702551f4b
--- /dev/null
+++ b/llm/config/baichuan/lora_argument.json
@@ -0,0 +1,35 @@
+{
+ "model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/lora_ckpts",
+ "per_device_train_batch_size": 4,
+ "gradient_accumulation_steps": 4,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "num_train_epochs": 3,
+ "learning_rate": 3e-04,
+ "warmup_steps": 30,
+ "logging_steps": 1,
+ "evaluation_strategy": "epoch",
+ "save_strategy": "epoch",
+ "src_length": 1024,
+ "max_length": 2048,
+ "fp16": true,
+ "fp16_opt_level": "O2",
+ "do_train": true,
+ "do_eval": true,
+ "disable_tqdm": true,
+ "load_best_model_at_end": true,
+ "eval_with_do_generation": false,
+ "metric_for_best_model": "accuracy",
+ "recompute": true,
+ "save_total_limit": 1,
+ "tensor_parallel_degree": 1,
+ "pipeline_parallel_degree": 1,
+ "sharding_parallel_degree": 1,
+ "sharding": "stage1",
+ "lora": true,
+ "zero_padding": false,
+ "unified_checkpoint": true,
+ "use_flash_attention": true
+ }
diff --git a/llm/baichuan/pretrain-baichuan2_7b-tp2sd4_stage2.json b/llm/config/baichuan/pretrain_argument.json
similarity index 90%
rename from llm/baichuan/pretrain-baichuan2_7b-tp2sd4_stage2.json
rename to llm/config/baichuan/pretrain_argument.json
index da31682d6949..aeb17cf475a4 100644
--- a/llm/baichuan/pretrain-baichuan2_7b-tp2sd4_stage2.json
+++ b/llm/config/baichuan/pretrain_argument.json
@@ -2,12 +2,13 @@
"model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
"tokenizer_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
"input_dir": "./data",
- "output_dir": "./checkpoints/baichuan_pretrain_ckpts",
+ "output_dir": "./checkpoints/pretrain_ckpts",
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 8,
"per_device_eval_batch_size": 2,
"tensor_parallel_degree": 2,
"pipeline_parallel_degree": 1,
+ "sharding_parallel_degree": 4,
"sharding": "stage2",
"virtual_pp_degree": 1,
"sequence_parallel": 0,
@@ -36,5 +37,6 @@
"recompute": false,
"distributed_dataloader": 1,
"recompute_granularity": "full",
+ "unified_checkpoint": true,
"save_total_limit": 2
}
diff --git a/llm/config/baichuan/ptq_argument.json b/llm/config/baichuan/ptq_argument.json
new file mode 100644
index 000000000000..f15164f44eef
--- /dev/null
+++ b/llm/config/baichuan/ptq_argument.json
@@ -0,0 +1,23 @@
+{
+ "model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
+ "per_device_train_batch_size": 8,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "src_length": 1024,
+ "max_length": 2048,
+ "fp16": true,
+ "fp16_opt_level": "O2",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/ptq_ckpts",
+ "do_eval": true,
+ "eval_with_do_generation": false,
+ "do_ptq": true,
+ "ptq_step": 16,
+ "unified_checkpoint": true,
+ "smooth": true,
+ "smooth_step": 16,
+ "smooth_all_linears": true,
+ "smooth_piecewise_search": true,
+ "smooth_k_piece": 3,
+ "smooth_search_piece": true
+}
\ No newline at end of file
diff --git a/llm/config/baichuan/qlora_argument.json b/llm/config/baichuan/qlora_argument.json
new file mode 100644
index 000000000000..c820bcff63df
--- /dev/null
+++ b/llm/config/baichuan/qlora_argument.json
@@ -0,0 +1,34 @@
+{
+ "model_name_or_path": "baichuan-inc/Baichuan2-7B-Base",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/qlora_ckpts",
+ "per_device_train_batch_size": 4,
+ "gradient_accumulation_steps": 4,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "num_train_epochs": 3,
+ "learning_rate": 3e-04,
+ "warmup_steps": 30,
+ "logging_steps": 1,
+ "evaluation_strategy": "epoch",
+ "save_strategy": "epoch",
+ "src_length": 1024,
+ "max_length": 2048,
+ "fp16": true,
+ "fp16_opt_level": "O2",
+ "do_train": true,
+ "do_eval": true,
+ "disable_tqdm": true,
+ "load_best_model_at_end": true,
+ "eval_with_do_generation": false,
+ "metric_for_best_model": "accuracy",
+ "recompute": true,
+ "save_total_limit": 1,
+ "tensor_parallel_degree": 1,
+ "pipeline_parallel_degree": 1,
+ "lora": true,
+ "zero_padding": false,
+ "use_flash_attention": true,
+ "unified_checkpoint": true,
+ "weight_quantize_algo": "nf4"
+ }
\ No newline at end of file
diff --git a/llm/bloom/README.md b/llm/config/bloom/README.md
similarity index 92%
rename from llm/bloom/README.md
rename to llm/config/bloom/README.md
index 2cdeafa66968..52311561818a 100644
--- a/llm/bloom/README.md
+++ b/llm/config/bloom/README.md
@@ -20,6 +20,3 @@ BLOOM是一种自回归大型语言模型(LLM),在大量文本数据上训练
| bigscience/bloomz-7b1-p3 |
| bigscience/bloomz-7b1 |
| bellegroup/belle-7b-2m |
-
-## 2. 模型精调
-请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/llama/gptq_argument.json b/llm/config/bloom/gptq_argument.json
similarity index 72%
rename from llm/llama/gptq_argument.json
rename to llm/config/bloom/gptq_argument.json
index 75944f076c29..615286908be0 100644
--- a/llm/llama/gptq_argument.json
+++ b/llm/config/bloom/gptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+ "model_name_or_path": "bigscience/bloomz-7b1-mt",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,9 +8,10 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_gptq_ckpts",
+ "output_dir": "./checkpoints/gptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_gptq": true,
+ "unified_checkpoint": true,
"gptq_step": 8
}
\ No newline at end of file
diff --git a/llm/bloom/lora_argument.json b/llm/config/bloom/lora_argument.json
similarity index 91%
rename from llm/bloom/lora_argument.json
rename to llm/config/bloom/lora_argument.json
index 6867ecaeedf2..d36d821a35ce 100644
--- a/llm/bloom/lora_argument.json
+++ b/llm/config/bloom/lora_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "bigscience/bloomz-7b1-mt",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/bloom_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,6 @@
"pipeline_parallel_degree": 1,
"lora": true,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/bloom/pt_argument.json b/llm/config/bloom/pt_argument.json
similarity index 92%
rename from llm/bloom/pt_argument.json
rename to llm/config/bloom/pt_argument.json
index 30d6839369cc..44801b6eb623 100644
--- a/llm/bloom/pt_argument.json
+++ b/llm/config/bloom/pt_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "bigscience/bloomz-7b1-mt",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/bloom_pt_ckpts",
+ "output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,6 @@
"pipeline_parallel_degree": 1,
"prefix_tuning": true,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/chatglm2/ptq_argument.json b/llm/config/bloom/ptq_argument.json
similarity index 79%
rename from llm/chatglm2/ptq_argument.json
rename to llm/config/bloom/ptq_argument.json
index 46a57083584a..fff4560700e7 100644
--- a/llm/chatglm2/ptq_argument.json
+++ b/llm/config/bloom/ptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/chatglm2_sft_ckpts",
+ "model_name_or_path": "bigscience/bloomz-7b1-mt",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,7 +8,7 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm2_ptq_ckpts",
+ "output_dir": "./checkpoints/ptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_ptq": true,
@@ -18,5 +18,6 @@
"smooth_all_linears": true,
"smooth_piecewise_search": true,
"smooth_k_piece": 3,
+ "unified_checkpoint": true,
"smooth_search_piece": true
}
\ No newline at end of file
diff --git a/llm/bloom/sft_argument.json b/llm/config/bloom/sft_argument.json
similarity index 91%
rename from llm/bloom/sft_argument.json
rename to llm/config/bloom/sft_argument.json
index 2c793576b7e0..31b020da30a1 100644
--- a/llm/bloom/sft_argument.json
+++ b/llm/config/bloom/sft_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "bigscience/bloomz-7b1-mt",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/bloom_sft_ckpts",
+ "output_dir": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -27,5 +27,6 @@
"tensor_parallel_degree": 4,
"pipeline_parallel_degree": 1,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/chatglm/README.md b/llm/config/chatglm/README.md
similarity index 92%
rename from llm/chatglm/README.md
rename to llm/config/chatglm/README.md
index 281a7ceea61f..c8cfb4f8b28b 100644
--- a/llm/chatglm/README.md
+++ b/llm/config/chatglm/README.md
@@ -14,6 +14,3 @@ ChatGLM-6B 是一个开源的、支持中英双语问答的对话语言模型,
## 2. 模型协议
ChatGLM-6B 模型的权重的使用需要遵循[License](../../paddlenlp/transformers/chatglm/LICENSE)。
-
-## 3. 模型精调
-请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/bloom/gptq_argument.json b/llm/config/chatglm/gptq_argument.json
similarity index 73%
rename from llm/bloom/gptq_argument.json
rename to llm/config/chatglm/gptq_argument.json
index 6a5cb7e882a7..d509f6aed280 100644
--- a/llm/bloom/gptq_argument.json
+++ b/llm/config/chatglm/gptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/bloom_sft_ckpts",
+ "model_name_or_path": "THUDM/chatglm-6b",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,9 +8,10 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/bloom_gptq_ckpts",
+ "output_dir": "./checkpoints/gptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_gptq": true,
+ "unified_checkpoint": true,
"gptq_step": 8
}
\ No newline at end of file
diff --git a/llm/chatglm/lora_argument.json b/llm/config/chatglm/lora_argument.json
similarity index 91%
rename from llm/chatglm/lora_argument.json
rename to llm/config/chatglm/lora_argument.json
index af49af041d72..11069e723f8f 100644
--- a/llm/chatglm/lora_argument.json
+++ b/llm/config/chatglm/lora_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "THUDM/chatglm-6b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,6 @@
"pipeline_parallel_degree": 1,
"lora": true,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/chatglm/pt_argument.json b/llm/config/chatglm/pt_argument.json
similarity index 94%
rename from llm/chatglm/pt_argument.json
rename to llm/config/chatglm/pt_argument.json
index 03158f7f127f..54c95fd56744 100644
--- a/llm/chatglm/pt_argument.json
+++ b/llm/config/chatglm/pt_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "THUDM/chatglm-6b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm_pt_ckpts",
+ "output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
diff --git a/llm/chatglm/ptq_argument.json b/llm/config/chatglm/ptq_argument.json
similarity index 73%
rename from llm/chatglm/ptq_argument.json
rename to llm/config/chatglm/ptq_argument.json
index 63474a9e0a19..64b6e480776b 100644
--- a/llm/chatglm/ptq_argument.json
+++ b/llm/config/chatglm/ptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+ "model_name_or_path": "THUDM/chatglm-6b",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,9 +8,10 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_ptq_ckpts",
+ "output_dir": "./checkpoints/ptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_ptq": true,
+ "unified_checkpoint": true,
"ptq_step": 16
}
\ No newline at end of file
diff --git a/llm/chatglm/sft_argument.json b/llm/config/chatglm/sft_argument.json
similarity index 91%
rename from llm/chatglm/sft_argument.json
rename to llm/config/chatglm/sft_argument.json
index 8309f28f1439..73286c3bb5c8 100644
--- a/llm/chatglm/sft_argument.json
+++ b/llm/config/chatglm/sft_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "THUDM/chatglm-6b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm_sft_ckpts",
+ "output_dir": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -27,5 +27,6 @@
"tensor_parallel_degree": 4,
"pipeline_parallel_degree": 1,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/chatglm2/README.md b/llm/config/chatglm2/README.md
similarity index 91%
rename from llm/chatglm2/README.md
rename to llm/config/chatglm2/README.md
index f04166f5bd50..0929e7b20fac 100644
--- a/llm/chatglm2/README.md
+++ b/llm/config/chatglm2/README.md
@@ -15,6 +15,3 @@ ChatGLM2-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/TH
ChatGLM2-6B 模型的权重的使用需要遵循[License](../../paddlenlp/transformers/chatglm_v2/LICENSE)。
-
-## 3. 模型精调
-请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/chatglm/gptq_argument.json b/llm/config/chatglm2/gptq_argument.json
similarity index 73%
rename from llm/chatglm/gptq_argument.json
rename to llm/config/chatglm2/gptq_argument.json
index 8b1c07742ba8..137f036a0552 100644
--- a/llm/chatglm/gptq_argument.json
+++ b/llm/config/chatglm2/gptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/chatglm_sft_ckpts",
+ "model_name_or_path": "THUDM/chatglm2-6b",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,9 +8,10 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm_gptq_ckpts",
+ "output_dir": "./checkpoints/gptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_gptq": true,
+ "unified_checkpoint": true,
"gptq_step": 8
}
\ No newline at end of file
diff --git a/llm/chatglm2/lora_argument.json b/llm/config/chatglm2/lora_argument.json
similarity index 91%
rename from llm/chatglm2/lora_argument.json
rename to llm/config/chatglm2/lora_argument.json
index c88636b9bd1d..6e734fc1f2a8 100644
--- a/llm/chatglm2/lora_argument.json
+++ b/llm/config/chatglm2/lora_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "THUDM/chatglm2-6b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm2_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,6 @@
"pipeline_parallel_degree": 1,
"lora": true,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/chatglm2/pt_argument.json b/llm/config/chatglm2/pt_argument.json
similarity index 94%
rename from llm/chatglm2/pt_argument.json
rename to llm/config/chatglm2/pt_argument.json
index a10f9b4d788c..52a80b837686 100644
--- a/llm/chatglm2/pt_argument.json
+++ b/llm/config/chatglm2/pt_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "THUDM/chatglm2-6b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm2_pt_ckpts",
+ "output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
diff --git a/llm/bloom/ptq_argument.json b/llm/config/chatglm2/ptq_argument.json
similarity index 79%
rename from llm/bloom/ptq_argument.json
rename to llm/config/chatglm2/ptq_argument.json
index 21a28735ecc1..806c80a3cf63 100644
--- a/llm/bloom/ptq_argument.json
+++ b/llm/config/chatglm2/ptq_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "./checkpoints/bloom_sft_ckpts",
+ "model_name_or_path": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
@@ -8,11 +8,12 @@
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/bloom_ptq_ckpts",
+ "output_dir": "./checkpoints/ptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_ptq": true,
"ptq_step": 16,
+ "unified_checkpoint": true,
"smooth": true,
"smooth_step": 16,
"smooth_all_linears": true,
diff --git a/llm/chatglm2/sft_argument.json b/llm/config/chatglm2/sft_argument.json
similarity index 85%
rename from llm/chatglm2/sft_argument.json
rename to llm/config/chatglm2/sft_argument.json
index 8508d9676379..ee2ffb4ee7ae 100644
--- a/llm/chatglm2/sft_argument.json
+++ b/llm/config/chatglm2/sft_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "THUDM/chatglm2-6b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm2_sft_ckpts",
+ "output_dir": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -24,8 +24,9 @@
"metric_for_best_model": "accuracy",
"recompute": true,
"save_total_limit": 1,
- "sharding_parallel_degree": 4,
- "sharding": "stage3",
+ "sharding_parallel_degree": 8,
+ "sharding": "stage2",
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/gemma/README.md b/llm/config/gemma/README.md
similarity index 100%
rename from llm/gemma/README.md
rename to llm/config/gemma/README.md
diff --git a/llm/gemma/sft_argument.json b/llm/config/gemma/sft_argument.json
similarity index 71%
rename from llm/gemma/sft_argument.json
rename to llm/config/gemma/sft_argument.json
index 45a483d7e52a..15d9c3b93807 100644
--- a/llm/gemma/sft_argument.json
+++ b/llm/config/gemma/sft_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "google/gemma-2b/",
+ "model_name_or_path": "google/gemma-2b",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/gemma_sft_ckpts",
+ "output_dir": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 8,
@@ -24,7 +24,11 @@
"metric_for_best_model": "accuracy",
"recompute": true,
"save_total_limit": 1,
- "tensor_parallel_degree": 2,
+ "tensor_parallel_degree": 1,
+ "pipeline_parallel_degree": 1,
+ "sharding_parallel_degree": 8,
+ "sharding": "stage2",
"zero_padding": false,
- "use_flash_attention": false
+ "unified_checkpoint": true,
+ "use_flash_attention": true
}
\ No newline at end of file
diff --git a/llm/config/gpt-3/README.md b/llm/config/gpt-3/README.md
new file mode 100644
index 000000000000..472c2f74cd42
--- /dev/null
+++ b/llm/config/gpt-3/README.md
@@ -0,0 +1,5 @@
+# GPT
+
+## 1. 模型介绍
+
+GPT-3是一种预训练语言模型,它能够模拟人类语言思维和表达。GPT-3拥有巨大的参数,包含了1750亿个参数,这使得它具有强大的语言理解和生成能力。它可以完成的任务包括文本生成、文本摘要、回答问题、翻译、阅读理解等。GPT-3的预训练过程使用了大量的语料库,包括互联网上的大量文本。它通过分析这些文本,学习如何生成和理解人类语言。GPT-3在自然语言处理领域具有很高的影响力,它可以模拟人类对话和生成文本,这使得它在许多应用领域都有广泛的应用,比如智能客服、自然语言处理、游戏设计等。
diff --git a/llm/llama/lora_argument.json b/llm/config/gpt-3/lora_argument.json
similarity index 86%
rename from llm/llama/lora_argument.json
rename to llm/config/gpt-3/lora_argument.json
index 6817215e0c74..1ed0576d951b 100644
--- a/llm/llama/lora_argument.json
+++ b/llm/config/gpt-3/lora_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "facebook/llama-7b",
+ "model_name_or_path": "gpt2-medium-en",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_lora_ckpts",
+ "output_dir": "./checkpoints/gpt_lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,6 @@
"pipeline_parallel_degree": 1,
"lora": true,
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
diff --git a/llm/gpt-3/pretrain-gpt_medium_en-stage2.json b/llm/config/gpt-3/pretrain_argument.json
similarity index 97%
rename from llm/gpt-3/pretrain-gpt_medium_en-stage2.json
rename to llm/config/gpt-3/pretrain_argument.json
index 3d7685a9696d..3959956bd21d 100644
--- a/llm/gpt-3/pretrain-gpt_medium_en-stage2.json
+++ b/llm/config/gpt-3/pretrain_argument.json
@@ -33,6 +33,7 @@
"disable_tqdm": true,
"recompute": false,
"distributed_dataloader": 1,
+ "unified_checkpoint": true,
"recompute_granularity": "full",
"save_total_limit": 2
}
diff --git a/llm/config/gpt-3/sft_argument.json b/llm/config/gpt-3/sft_argument.json
new file mode 100644
index 000000000000..76d50ec28628
--- /dev/null
+++ b/llm/config/gpt-3/sft_argument.json
@@ -0,0 +1,33 @@
+{
+ "model_name_or_path": "gpt2-medium-en",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/sft_ckpts",
+ "per_device_train_batch_size": 4,
+ "gradient_accumulation_steps": 4,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "num_train_epochs": 3,
+ "learning_rate": 3e-05,
+ "warmup_steps": 30,
+ "logging_steps": 1,
+ "evaluation_strategy": "epoch",
+ "save_strategy": "epoch",
+ "src_length": 1024,
+ "max_length": 2048,
+ "fp16": true,
+ "fp16_opt_level": "O2",
+ "do_train": true,
+ "do_eval": true,
+ "disable_tqdm": true,
+ "load_best_model_at_end": true,
+ "eval_with_do_generation": false,
+ "metric_for_best_model": "accuracy",
+ "recompute": true,
+ "save_total_limit": 1,
+ "tensor_parallel_degree": 1,
+ "pipeline_parallel_degree": 1,
+ "lora": true,
+ "zero_padding": false,
+ "unified_checkpoint": true,
+ "use_flash_attention": false
+ }
diff --git a/llm/llama/README.md b/llm/config/llama/README.md
similarity index 92%
rename from llm/llama/README.md
rename to llm/config/llama/README.md
index c707c0cd64ac..bda1959533d7 100644
--- a/llm/llama/README.md
+++ b/llm/config/llama/README.md
@@ -16,6 +16,10 @@
| meta-llama/Llama-2-13b-chat |
| meta-llama/Llama-2-70b |
| meta-llama/Llama-2-70b-chat |
+|meta-llama/Meta-Llama-3-8B|
+|meta-llama/Meta-Llama-3-8B-Instruct|
+|meta-llama/Meta-Llama-3-70B|
+|meta-llama/Meta-Llama-3-70B-Instruct|
| ziqingyang/chinese-llama-7b |
| ziqingyang/chinese-llama-13b |
| ziqingyang/chinese-alpaca-7b |
@@ -48,11 +52,3 @@ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat")
LLaMA 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/LICENSE)。
Llama2 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/Llama2.LICENSE)。
-
-
-## 3. 预训练
-
-请参考[LLM全流程工具介绍](../README.md)
-
-## 4. 模型精调
-请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/llama/awq_argument.json b/llm/config/llama/awq_argument.json
similarity index 76%
rename from llm/llama/awq_argument.json
rename to llm/config/llama/awq_argument.json
index 21a9bcdb13b3..7ae7f55b678c 100644
--- a/llm/llama/awq_argument.json
+++ b/llm/config/llama/awq_argument.json
@@ -1,14 +1,14 @@
{
- "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"max_length": 2048,
- "fp16": true,
+ "bf16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_ptq_ckpts",
+ "output_dir": "./checkpoints/ptq_ckpts",
"do_eval": true,
"eval_with_do_generation": false,
"do_ptq": true,
@@ -18,5 +18,6 @@
"smooth": true,
"auto_clip": true,
"autoclip_step": 1,
+ "unified_checkpoint": true,
"do_awq": true
}
\ No newline at end of file
diff --git a/llm/llama/dpo_argument.json b/llm/config/llama/dpo_argument.json
similarity index 92%
rename from llm/llama/dpo_argument.json
rename to llm/config/llama/dpo_argument.json
index 7aa86b342128..b30fcc86478c 100644
--- a/llm/llama/dpo_argument.json
+++ b/llm/config/llama/dpo_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "meta-llama/Llama-2-7b-chat",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"train_dataset_path": "./data/train.jsonl",
"dev_dataset_path": "./data/dev.jsonl",
"output_dir": "./checkpoints/dpo_ckpts",
@@ -34,5 +34,6 @@
"benchmark": false,
"dpo_loss_type": "sigmoid",
"dpo_label_smoothing": 0.0,
+ "unified_checkpoint": true,
"autotuner_benchmark":false
}
diff --git a/llm/config/llama/gptq_argument.json b/llm/config/llama/gptq_argument.json
new file mode 100644
index 000000000000..bbc2ac60d5a7
--- /dev/null
+++ b/llm/config/llama/gptq_argument.json
@@ -0,0 +1,17 @@
+{
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
+ "per_device_train_batch_size": 8,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "src_length": 1024,
+ "max_length": 2048,
+ "bf16": true,
+ "fp16_opt_level": "O2",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/gptq_ckpts",
+ "do_eval": true,
+ "eval_with_do_generation": false,
+ "do_gptq": true,
+ "unified_checkpoint": true,
+ "gptq_step": 8
+ }
\ No newline at end of file
diff --git a/llm/config/llama/lora_argument.json b/llm/config/llama/lora_argument.json
new file mode 100644
index 000000000000..3b4374529880
--- /dev/null
+++ b/llm/config/llama/lora_argument.json
@@ -0,0 +1,35 @@
+{
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
+ "dataset_name_or_path": "./data",
+ "output_dir": "./checkpoints/lora_ckpts",
+ "per_device_train_batch_size": 4,
+ "gradient_accumulation_steps": 4,
+ "per_device_eval_batch_size": 8,
+ "eval_accumulation_steps":16,
+ "num_train_epochs": 3,
+ "learning_rate": 3e-04,
+ "warmup_steps": 30,
+ "logging_steps": 1,
+ "evaluation_strategy": "epoch",
+ "save_strategy": "epoch",
+ "src_length": 1024,
+ "max_length": 2048,
+ "bf16": true,
+ "fp16_opt_level": "O2",
+ "do_train": true,
+ "do_eval": true,
+ "disable_tqdm": true,
+ "load_best_model_at_end": true,
+ "eval_with_do_generation": false,
+ "metric_for_best_model": "accuracy",
+ "recompute": true,
+ "save_total_limit": 1,
+ "tensor_parallel_degree": 1,
+ "pipeline_parallel_degree": 1,
+ "sharding": "stage1",
+ "lora": true,
+ "zero_padding": false,
+ "use_flash_attention": true,
+ "unified_checkpoint": true,
+ "pissa": false
+ }
diff --git a/llm/config/llama/ppo.json b/llm/config/llama/ppo_argument.json
similarity index 100%
rename from llm/config/llama/ppo.json
rename to llm/config/llama/ppo_argument.json
diff --git a/llm/llama/pretrain-llama2_13b-tp2sd4_stage2.json b/llm/config/llama/pretrain_argument.json
similarity index 83%
rename from llm/llama/pretrain-llama2_13b-tp2sd4_stage2.json
rename to llm/config/llama/pretrain_argument.json
index 3dbfd8c1e12c..dff5b322337e 100644
--- a/llm/llama/pretrain-llama2_13b-tp2sd4_stage2.json
+++ b/llm/config/llama/pretrain_argument.json
@@ -1,8 +1,8 @@
{
- "model_name_or_path": "meta-llama/Llama-2-13b",
- "tokenizer_name_or_path": "meta-llama/Llama-2-13b",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
+ "tokenizer_name_or_path": "meta-llama/Meta-Llama-3-8B",
"input_dir": "./data",
- "output_dir": "./checkpoints/llama2_pretrain_ckpts",
+ "output_dir": "./checkpoints/pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 16,
"per_device_eval_batch_size": 2,
@@ -36,5 +36,6 @@
"recompute": false,
"distributed_dataloader": 1,
"recompute_granularity": "full",
+ "unified_checkpoint": true,
"save_total_limit": 2
}
diff --git a/llm/qwen/pt_argument.json b/llm/config/llama/pt_argument.json
similarity index 85%
rename from llm/qwen/pt_argument.json
rename to llm/config/llama/pt_argument.json
index 3500215eb3da..66c336cc4b87 100644
--- a/llm/qwen/pt_argument.json
+++ b/llm/config/llama/pt_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "qwen/qwen-7b",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/qwen_pt_ckpts",
+ "output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,5 @@
"pipeline_parallel_degree": 1,
"prefix_tuning": true,
"zero_padding": false,
- "use_flash_attention": false
+ "use_flash_attention": true
}
diff --git a/llm/llama/ptq_argument.json b/llm/config/llama/ptq_argument.json
similarity index 83%
rename from llm/llama/ptq_argument.json
rename to llm/config/llama/ptq_argument.json
index 0a64f3818834..79cc82e8d5d7 100644
--- a/llm/llama/ptq_argument.json
+++ b/llm/config/llama/ptq_argument.json
@@ -1,11 +1,11 @@
{
- "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"max_length": 2048,
- "fp16": true,
+ "bf16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
"output_dir": "./checkpoints/llama_ptq_ckpts",
@@ -13,6 +13,7 @@
"eval_with_do_generation": false,
"do_ptq": true,
"ptq_step": 16,
+ "unified_checkpoint": true,
"smooth": true,
"smooth_step": 16,
"smooth_all_linears": true,
diff --git a/llm/llama/qlora_argument.json b/llm/config/llama/qlora_argument.json
similarity index 84%
rename from llm/llama/qlora_argument.json
rename to llm/config/llama/qlora_argument.json
index 38775ac03948..30963715d2af 100644
--- a/llm/llama/qlora_argument.json
+++ b/llm/config/llama/qlora_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "facebook/llama-7b",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -14,7 +14,7 @@
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
- "fp16": true,
+ "bf16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
@@ -29,5 +29,6 @@
"lora": true,
"zero_padding": false,
"use_flash_attention": false,
+ "unified_checkpoint": true,
"weight_quantize_algo": "nf4"
}
\ No newline at end of file
diff --git a/llm/config/llama/rm.json b/llm/config/llama/rm_argument.json
similarity index 100%
rename from llm/config/llama/rm.json
rename to llm/config/llama/rm_argument.json
diff --git a/llm/llama/sft_argument.json b/llm/config/llama/sft_argument.json
similarity index 68%
rename from llm/llama/sft_argument.json
rename to llm/config/llama/sft_argument.json
index 34b36a3bc023..9af167187555 100644
--- a/llm/llama/sft_argument.json
+++ b/llm/config/llama/sft_argument.json
@@ -1,9 +1,9 @@
{
- "model_name_or_path": "facebook/llama-7b",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"dataset_name_or_path": "./data",
"output_dir": "./checkpoints/llama_sft_ckpts",
- "per_device_train_batch_size": 4,
- "gradient_accumulation_steps": 4,
+ "per_device_train_batch_size": 1,
+ "gradient_accumulation_steps": 2,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"num_train_epochs": 3,
@@ -14,7 +14,7 @@
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
- "fp16": true,
+ "bf16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
@@ -22,10 +22,13 @@
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
- "recompute": true,
+ "recompute": false,
"save_total_limit": 1,
- "tensor_parallel_degree": 4,
+ "tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
+ "pipeline_parallel_config": "disable_p2p_cache_shape",
+ "sharding": "stage2",
"zero_padding": false,
+ "unified_checkpoint": true,
"use_flash_attention": false
}
\ No newline at end of file
diff --git a/llm/llama/wint8_lora_argument.json b/llm/config/llama/wint8_lora_argument.json
similarity index 89%
rename from llm/llama/wint8_lora_argument.json
rename to llm/config/llama/wint8_lora_argument.json
index 97d9f96d6419..fbce73a89e50 100644
--- a/llm/llama/wint8_lora_argument.json
+++ b/llm/config/llama/wint8_lora_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "facebook/llama-7b",
+ "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"dataset_name_or_path": "./data",
"output_dir": "./checkpoints/llama_lora_ckpts",
"per_device_train_batch_size": 4,
@@ -14,7 +14,7 @@
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
- "fp16": true,
+ "bf16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
@@ -29,5 +29,6 @@
"lora": true,
"zero_padding": false,
"use_flash_attention": false,
+ "unified_checkpoint": true,
"weight_quantize_algo": "weight_only_int8"
}
\ No newline at end of file
diff --git a/llm/mixtral/lora_argument.json b/llm/config/mixtral/lora_argument.json
similarity index 88%
rename from llm/mixtral/lora_argument.json
rename to llm/config/mixtral/lora_argument.json
index 507c0f76e798..e70bd58a5eb7 100644
--- a/llm/mixtral/lora_argument.json
+++ b/llm/config/mixtral/lora_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/mixtral_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -28,5 +28,6 @@
"pipeline_parallel_degree": 1,
"lora": true,
"zero_padding": false,
- "use_flash_attention": false
+ "unified_checkpoint": true,
+ "use_flash_attention": true
}
diff --git a/llm/llama/pretrain-ziya_llama_13b-tp2sd4_stage2.json b/llm/config/mixtral/pretrain_argument.json
similarity index 79%
rename from llm/llama/pretrain-ziya_llama_13b-tp2sd4_stage2.json
rename to llm/config/mixtral/pretrain_argument.json
index bd227877bfd2..efd3823fa988 100644
--- a/llm/llama/pretrain-ziya_llama_13b-tp2sd4_stage2.json
+++ b/llm/config/mixtral/pretrain_argument.json
@@ -1,12 +1,12 @@
{
- "model_name_or_path": "idea-ccnl/ziya-llama-13b-v1",
- "tokenizer_name_or_path": "idea-ccnl/ziya-llama-13b-v1",
+ "model_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
+ "tokenizer_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"input_dir": "./data",
- "output_dir": "./checkpoints/ziya_pretrain_ckpts",
+ "output_dir": "./checkpoints/pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 16,
"per_device_eval_batch_size": 2,
- "tensor_parallel_degree": 2,
+ "tensor_parallel_degree": 8,
"pipeline_parallel_degree": 1,
"sharding": "stage2",
"virtual_pp_degree": 1,
@@ -36,5 +36,6 @@
"recompute": false,
"distributed_dataloader": 1,
"recompute_granularity": "full",
+ "unified_checkpoint": true,
"save_total_limit": 2
}
diff --git a/llm/mixtral/sft_argument.json b/llm/config/mixtral/sft_argument.json
similarity index 74%
rename from llm/mixtral/sft_argument.json
rename to llm/config/mixtral/sft_argument.json
index 3e778b913ffc..b11bb80380a0 100644
--- a/llm/mixtral/sft_argument.json
+++ b/llm/config/mixtral/sft_argument.json
@@ -1,9 +1,9 @@
{
"model_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/mixtral_sft_ckpts",
- "per_device_train_batch_size": 4,
- "gradient_accumulation_steps": 4,
+ "output_dir": "./checkpoints/sft_ckpts",
+ "per_device_train_batch_size": 1,
+ "gradient_accumulation_steps": 16,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"num_train_epochs": 3,
@@ -26,5 +26,8 @@
"save_total_limit": 1,
"tensor_parallel_degree": 8,
"sharding": "stage2",
- "pipeline_parallel_degree": 1
+ "pipeline_parallel_degree": 1,
+ "zero_padding": false,
+ "unified_checkpoint": true,
+ "use_flash_attention": true
}
diff --git a/llm/opt/README.md b/llm/config/opt/README.md
similarity index 88%
rename from llm/opt/README.md
rename to llm/config/opt/README.md
index 98b3f140fbfb..3b77d6304b14 100644
--- a/llm/opt/README.md
+++ b/llm/config/opt/README.md
@@ -17,6 +17,3 @@
|facebook/opt-66b |
|facebook/opt-iml-1.3b |
|opt-iml-max-1.3b |
-
-## 2. 模型精调
-请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/opt/lora_argument.json b/llm/config/opt/lora_argument.json
similarity index 94%
rename from llm/opt/lora_argument.json
rename to llm/config/opt/lora_argument.json
index 75193e47238d..2ddeb5f2a9f8 100644
--- a/llm/opt/lora_argument.json
+++ b/llm/config/opt/lora_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "facebook/opt-125m",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/opt_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
diff --git a/llm/opt/sft_argument.json b/llm/config/opt/sft_argument.json
similarity index 94%
rename from llm/opt/sft_argument.json
rename to llm/config/opt/sft_argument.json
index 4eed122fa3cb..2b4f03b842bc 100644
--- a/llm/opt/sft_argument.json
+++ b/llm/config/opt/sft_argument.json
@@ -1,7 +1,7 @@
{
"model_name_or_path": "facebook/opt-125m",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/opt_sft_ckpts",
+ "output_dir": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
diff --git a/llm/qwen/README.md b/llm/config/qwen/README.md
similarity index 96%
rename from llm/qwen/README.md
rename to llm/config/qwen/README.md
index 22ac37c19e17..ce32fd88d5b5 100644
--- a/llm/qwen/README.md
+++ b/llm/config/qwen/README.md
@@ -55,7 +55,3 @@
| Qwen/Qwen2-72B-Instruct |
| Qwen/Qwen2-57B-A14B |
| Qwen/Qwen2-57B-A14B-Instruct |
-
-
-## 2. 模型精调
-请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/qwen/dpo_argument.json b/llm/config/qwen/dpo_argument.json
similarity index 93%
rename from llm/qwen/dpo_argument.json
rename to llm/config/qwen/dpo_argument.json
index 19884cfaefc0..716cdba59da6 100644
--- a/llm/qwen/dpo_argument.json
+++ b/llm/config/qwen/dpo_argument.json
@@ -1,5 +1,5 @@
{
- "model_name_or_path": "qwen/qwen-7b",
+ "model_name_or_path": "Qwen/Qwen2-7B",
"train_dataset_path": "./data/train.jsonl",
"dev_dataset_path": "./data/dev.jsonl",
"output_dir": "./checkpoints/dpo_ckpts",
@@ -32,6 +32,7 @@
"recompute_granularity": "full",
"dpo_beta": 0.1,
"benchmark": false,
+ "unified_checkpoint": true,
"dpo_loss_type": "sigmoid",
"dpo_label_smoothing": 0.0,
"autotuner_benchmark":false
diff --git a/llm/qwen/lora_argument.json b/llm/config/qwen/lora_argument.json
similarity index 82%
rename from llm/qwen/lora_argument.json
rename to llm/config/qwen/lora_argument.json
index 321a2ee3354f..aeb0d5d61f92 100644
--- a/llm/qwen/lora_argument.json
+++ b/llm/config/qwen/lora_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "qwen/qwen-7b",
+ "model_name_or_path": "Qwen/Qwen2-7B",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/qwen_lora_ckpts",
+ "output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -27,6 +27,8 @@
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"lora": true,
+ "unified_checkpoint": true,
"zero_padding": false,
- "use_flash_attention": false
+ "use_flash_attention": true,
+ "pissa": false
}
diff --git a/llm/qwen/pretrain_argument_stage2.json b/llm/config/qwen/pretrain_argument.json
similarity index 84%
rename from llm/qwen/pretrain_argument_stage2.json
rename to llm/config/qwen/pretrain_argument.json
index 1345021f3d19..99d37d832874 100644
--- a/llm/qwen/pretrain_argument_stage2.json
+++ b/llm/config/qwen/pretrain_argument.json
@@ -1,8 +1,8 @@
{
- "model_name_or_path": "qwen/qwen-7b",
- "tokenizer_name_or_path": "qwen/qwen-7b",
+ "model_name_or_path": "Qwen/Qwen2-7B",
+ "tokenizer_name_or_path": "Qwen/Qwen2-7B",
"input_dir": "./data",
- "output_dir": "./checkpoints/qwen_pretrain_ckpts",
+ "output_dir": "./checkpoints/pretrain_ckpts",
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 2,
@@ -35,5 +35,6 @@
"recompute": true,
"distributed_dataloader": 1,
"recompute_granularity": "full",
+ "unified_checkpoint": true,
"save_total_limit": 2
}
diff --git a/llm/llama/pt_argument.json b/llm/config/qwen/pt_argument.json
similarity index 81%
rename from llm/llama/pt_argument.json
rename to llm/config/qwen/pt_argument.json
index 501e09c47160..b70e4a144c75 100644
--- a/llm/llama/pt_argument.json
+++ b/llm/config/qwen/pt_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "facebook/llama-7b",
+ "model_name_or_path": "Qwen/Qwen2-7B",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_pt_ckpts",
+ "output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -14,7 +14,7 @@
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
- "fp16": true,
+ "bf16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
@@ -27,6 +27,5 @@
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"prefix_tuning": true,
- "zero_padding": false,
- "use_flash_attention": false
+ "use_flash_attention": true
}
diff --git a/llm/qwen/sft_argument.json b/llm/config/qwen/sft_argument.json
similarity index 78%
rename from llm/qwen/sft_argument.json
rename to llm/config/qwen/sft_argument.json
index 38daa1d0f293..21b1e0da7f74 100644
--- a/llm/qwen/sft_argument.json
+++ b/llm/config/qwen/sft_argument.json
@@ -1,7 +1,7 @@
{
- "model_name_or_path": "qwen/qwen-7b",
+ "model_name_or_path": "Qwen/Qwen2-7B",
"dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/qwen_sft_ckpts",
+ "output_dir": "./checkpoints/sft_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
@@ -24,8 +24,10 @@
"metric_for_best_model": "accuracy",
"recompute": true,
"save_total_limit": 1,
- "tensor_parallel_degree": 4,
+ "tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
+ "sharding": "stage2",
"zero_padding": false,
- "use_flash_attention": false
+ "unified_checkpoint": true,
+ "use_flash_attention": true
}
diff --git a/llm/docs/chat_template.md b/llm/docs/chat_template.md
index 6c9e699c8468..e8ad37167f26 100644
--- a/llm/docs/chat_template.md
+++ b/llm/docs/chat_template.md
@@ -36,14 +36,14 @@
...
```
-其次就是将构造好的`chat_template.json`文件传入到 `llm/finetune_generation.py` 模块当中:
+其次就是将构造好的`chat_template.json`文件传入到 `llm/run_finetune.py` 模块当中:
* 使用模型自带chat-template
> 并不是所有的模型支持chat-template,PaddleNLP 正在全力支持,可根据是否有下载 `chat_template.json` 文件来判断该模型是否支持 chat-template。
```shell
-python finetune_generation.py ... --model_name_or_path qwen/qwen-7b-chat --chat_template qwen/qwen-7b-chat
+python run_finetune.py ... --model_name_or_path qwen/qwen-7b-chat --chat_template qwen/qwen-7b-chat
```
此时当 `chat_template` 参数和 `model_name_or_path` 参数一致时,此时将默认使用模型自带的chat_template.json` 文件。
@@ -51,7 +51,7 @@ python finetune_generation.py ... --model_name_or_path qwen/qwen-7b-chat --chat_
* 使用自定义 chat-template
```shell
-python finetune_generation.py ... --chat_template ./qwen_14b_chat_template.json
+python run_finetune.py ... --chat_template ./qwen_14b_chat_template.json
```
1. 当 `chat_template` 参数和 `model_name_or_path` 参数一致时,此时将默认使用模型自带的 `chat_template.json` 文件。
diff --git a/llm/docs/finetune.md b/llm/docs/finetune.md
index 79bd7eb84dfe..b590a09739b7 100644
--- a/llm/docs/finetune.md
+++ b/llm/docs/finetune.md
@@ -70,28 +70,21 @@ git clone 代码到本地,即可开始。
SFT(Supervised Fine-Tuning)模型全参微调依托飞桨提出的[4D混合分布式并行](https://ai.baidu.com/forum/topic/show/987996)能力,支持使用Trainer API轻松切换数据并行(DP)、[张量并行(TP, Tensor Parallelism)](https://arxiv.org/abs/1909.08053)、[流水线并行(PP, Pipeline Parallelism)](https://arxiv.org/abs/1811.06965)(目前仅支持Llama)等多种分布式训练策略。
```
-# 张量并行分布式训练(常用)
-python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_argument.json
-
-# 目前ChatGLM2、OPT不支持张量并行,默认使用Sharding策略
-python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./chatglm2/sft_argument.json
-
-# 张量并行&流水线并行分布式训练(目前仅支持Llama)
-python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_pp_argument.json
+python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
```
1. `zero_padding`设为True有助于提高训练效率。建议将`per_device_train_batch_size`设为1,使用`gradient_accumulation_steps`控制batch size,适当调整`max_length`取值。
2. 设置`use_flash_attention`为True使用FlashAttention。
+3. SFT API支持4D并行策略,可以通过控制`tensor_parallel_degree`、`pipeline_parallel_degree`、 `sharding`、`sharding_parallel_degree`调整
### 2.4 LoRA
```
# 单卡训练
-python finetune_generation.py ./llama/lora_argument.json
+python run_finetune.py ./config/llama/lora_argument.json
-# 张量并行分布式训练(ChatGLM2、OPT不支持张量并行)
-# 将lora_argument.json中tensor_parallel_degree修改为2
-python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./llama/lora_argument.json
+# 张量并行分布式训练
+python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/lora_argument.json
```
**Note:**
@@ -107,11 +100,10 @@ python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./
```
# 单卡训练
-python finetune_generation.py ./llama/pt_argument.json
+python run_finetune.py ./llama/pt_argument.json
-# 张量并行分布式训练(ChatGLM2、OPT不支持张量并行)
-# 将pt_argument.json中tensor_parallel_degree修改为2
-python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./llama/pt_argument.json
+# 张量并行分布式训练
+python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./llama/pt_argument.json
```
**Note:**
@@ -198,7 +190,7 @@ python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./
## 4.分布式策略参数合并
-我们使用张量并行(TP,Tensor Parallelism)和 流水线并行(PP,Pipeline Parallelism)训练过程中,为了节省TP参数合并时间通常在中间checkpoint将参数存储为多个TP和PP参数分片,可以使用提供的分片合并参数脚本进行参数合并。
+**如果开启unified_checkpoint则不需要合参**。我们使用张量并行(TP,Tensor Parallelism)和 流水线并行(PP,Pipeline Parallelism)训练过程中,为了节省TP参数合并时间通常在中间checkpoint将参数存储为多个TP和PP参数分片,可以使用提供的分片合并参数脚本进行参数合并。
```
python merge_tp_and_pp_params.py \
@@ -216,16 +208,18 @@ python merge_tp_and_pp_params.py \
为了后续的**压缩**和**静态图推理**方便,我们提供LoRA参数合并脚本,可以将LoRA参数合并到主干模型并保存相应的权重。
```
python merge_lora_params.py \
- --lora_path ./checkpoints/llama_lora_ckpts \
- --merge_lora_model_path ./checkpoints/llama_lora_merge \
+ --model_name_or_path ./checkpoints/sft_ckpts \
+ --lora_path ./checkpoints/lora_ckpts \
+ --output_path ./checkpoints/lora_merge \
--device "gpu" \
- --low_gpu_mem True
+ --safe_serialization True
```
- `lora_path`: LoRA参数和配置路径,对LoRA参数进行初始化,默认为None。
+- `model_name_or_path`: 必须,主干模型参数路径,默认为None。
- `merge_model_path`: 必须,合并参数后保存路径,默认为None。
- `device`: 运行环境,默认为gpu。
-- `low_gpu_mem`:降低合参时候所需显存,默认为False。如果合参时显存不足,建议开启
+- `safe_serialization`: 是否保存为safetensor格式,默认为True。
diff --git a/llm/docs/inference.md b/llm/docs/inference.md
index a20e3a32d614..9660778a22ef 100644
--- a/llm/docs/inference.md
+++ b/llm/docs/inference.md
@@ -17,7 +17,7 @@ PaddleNLP 提供了动态图推理和静态图推理两种方式,方便用户
### 1.1 动态图推理
```shell
# 动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16
```
对于LoRA、PrefixTuning 模型只需额外传入相应的lora_path或prefix_path即可,如:`--lora_path ./checkpoints/llama_lora_ckpts`或`--prefix_path ./checkpoints/llama_prefix_ckpts`,详见推理参数减少。
@@ -26,9 +26,9 @@ python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file
```shell
# 静态图模型推理命令参考, LoRA需要先合并参数,Prefix Tuning暂不支持
# step1 : 静态图导出
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
# step2: 静态图推理
-python predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
+python ./predict/predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
```
## 2. 高性能模型推理
@@ -86,7 +86,7 @@ git clone https://github.com/PaddlePaddle/PaddleNLP
#GPU设备安装自定义算子
cd ./paddlenlp/csrc && python setup_cuda.py install
#XPU设备安装自定义算子
-cd ./paddlenlp/csrc/xpu/src && sh cmake_build.sh
+cd ./paddlenlp/csrc/xpu/src && sh cmake_build.sh
```
### 2.3 关闭BlockAttention的高性能推理
@@ -95,16 +95,16 @@ cd ./paddlenlp/csrc/xpu/src && sh cmake_build.sh
```shell
# 动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
# PrefixTuning动态图推理参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --export_precache true --prefix_path ./checkpoints/llama_prefix_ckpts
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --export_precache true --prefix_path ./checkpoints/llama_prefix_ckpts
# Weight Only Int8 动态图推理参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --quant_type weight_only_int8
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --quant_type weight_only_int8
# PTQ-A8W8推理命令参考
-python predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --dtype float16
+python ./predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --dtype float16
```
**Note**:
1. LoRA 模型在推理之前是需要合并参数,详细可见:[合并 LoRA 参数](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/merge_lora_params.py)。
@@ -115,16 +115,16 @@ python predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference
**step1:动转静**
```shell
# 动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
# PrefixTuning动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --export_precache true
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --export_precache true
# Weight Only Int8 动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --quant_type weight_only_int8
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --quant_type weight_only_int8
# PTQ-A8W8动转静命令参考
-python export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --output_path ./inference --dtype float16
```
**Note**:
1. LoRA 模型在推理之前是需要合并参数,详细可见:[合并 LoRA 参数](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/merge_lora_params.py)。
@@ -135,13 +135,13 @@ python export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --infere
**step2:静态图推理**
```shell
# 静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static"
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static"
# PrefixTuning静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static" --export_precache true --prefix_path ./checkpoints/llama_prefix_ckpts
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static" --export_precache true --prefix_path ./checkpoints/llama_prefix_ckpts
# Weight Only Int8 静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static" --quant_type weight_only_int8
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static" --quant_type weight_only_int8
# PTQ-A8W8静态图推理命令参考
# 以下环境变量用于开启int8矩阵乘的算法选择以获得更快的推理速度,打开之后第一次执行会执行算法选择从而导致速度较慢。
@@ -149,7 +149,7 @@ export FLAGS_use_autotune=1
export FLAGS_cublaslt_exhaustive_search_times=10
export FLAGS_cache_inference_while_scope=1
-python predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static"
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static"
```
**Note**:
1. LoRA 模型在推理之前是需要合并参数,详细可见:[合并 LoRA 参数](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/merge_lora_params.py)。
@@ -164,50 +164,50 @@ python predictor.py --model_name_or_path ./inference --inference_model --quant_
```shell
# 动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --block_attn
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --block_attn
# XPU设备动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --block_attn --device xpu
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --block_attn --device xpu
# Weight Only Int8 动态图推理参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --quant_type weight_only_int8 --block_attn
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --quant_type weight_only_int8 --block_attn
# PTQ-A8W8推理命令参考
-python predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --dtype float16 --block_attn
+python ./predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --dtype float16 --block_attn
# CacheKV 动态量化推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --block_attn --cachekv_int8
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16 --block_attn --cachekv_int8
```
#### 2.4.2 静态图推理
**step1:动转静**
```shell
# 动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --block_attn
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --block_attn
# XPU设备动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --block_attn --device xpu
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --block_attn --device xpu
# Weight Only Int8 动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --quant_type weight_only_int8 --block_attn
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --quant_type weight_only_int8 --block_attn
# PTQ-A8W8动转静命令参考
-python export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --output_path ./inference --dtype float16 --block_attn
+python ./predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --inference_model --output_path ./inference --dtype float16 --block_attn
# CacheKV 动态量化动转静命令参考
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --block_attn --cachekv_int8
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16 --block_attn --cachekv_int8
```
**step2:静态图推理**
```shell
# 静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --block_attn
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --block_attn
# XPU设备静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --block_attn --device xpu
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --block_attn --device xpu
# Weight Only Int8 静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --quant_type weight_only_int8 --block_attn
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --quant_type weight_only_int8 --block_attn
# PTQ-A8W8静态图推理命令参考
# 以下环境变量用于开启int8矩阵乘的算法选择以获得更快的推理速度,打开之后第一次执行会执行算法选择从而导致速度较慢。
@@ -215,10 +215,10 @@ export FLAGS_use_autotune=1
export FLAGS_cublaslt_exhaustive_search_times=10
export FLAGS_cache_inference_while_scope=1
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --block_attn
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --block_attn
# CacheKV 动态量化8静态图推理命令参考
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --cachekv_int8 --block_attn
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static" --cachekv_int8 --block_attn
```
**Note**:
1. 使用Weight Only Int8 推理需要额外传入 `quant_type`。
diff --git a/llm/docs/pretrain.rst b/llm/docs/pretrain.rst
index 987e6c53f90d..d0fd203b97e3 100644
--- a/llm/docs/pretrain.rst
+++ b/llm/docs/pretrain.rst
@@ -68,10 +68,10 @@ git clone 代码到本地,即可开始。
cd ../model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
# llama 模型预训练
- python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./llama/pretrain-llama2_7b-tp2sd4_stage2.json
+ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
# Qwen 模型预训练
- python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./qwen/pretrain_argument_stage2.json
+ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/qwen/pretrain_argument.json
注意:
diff --git a/llm/docs/quantization.md b/llm/docs/quantization.md
index 101c18f4441a..eadaa77397a2 100644
--- a/llm/docs/quantization.md
+++ b/llm/docs/quantization.md
@@ -58,19 +58,19 @@ git clone 代码到本地,即可开始。
### 2.3 PTQ 量化
```
-python finetune_generation.py ./llama/ptq_argument.json
+python run_finetune.py ./config/llama/ptq_argument.json
```
### 2.4 GPTQ 量化
```
-python finetune_generation.py ./llama/gptq_argument.json
+python run_finetune.py ./config/llama/gptq_argument.json
```
### 2.5 AWQ 量化
```
-python finetune_generation.py ./llama/awq_argument.json
+python run_finetune.py ./config/llama/awq_argument.json
```
### 2.6 量化参数介绍
diff --git a/llm/ernie-3.5-se/README.md b/llm/experimental/ernie-3.5-se/README.md
similarity index 100%
rename from llm/ernie-3.5-se/README.md
rename to llm/experimental/ernie-3.5-se/README.md
diff --git a/llm/ernie-3.5-se/configuration.py b/llm/experimental/ernie-3.5-se/configuration.py
similarity index 100%
rename from llm/ernie-3.5-se/configuration.py
rename to llm/experimental/ernie-3.5-se/configuration.py
diff --git a/llm/ernie-3.5-se/conversion_utils.py b/llm/experimental/ernie-3.5-se/conversion_utils.py
similarity index 100%
rename from llm/ernie-3.5-se/conversion_utils.py
rename to llm/experimental/ernie-3.5-se/conversion_utils.py
diff --git a/llm/ernie-3.5-se/data.py b/llm/experimental/ernie-3.5-se/data.py
similarity index 100%
rename from llm/ernie-3.5-se/data.py
rename to llm/experimental/ernie-3.5-se/data.py
diff --git a/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model b/llm/experimental/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model
similarity index 100%
rename from llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model
rename to llm/experimental/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model
diff --git a/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json b/llm/experimental/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json
similarity index 100%
rename from llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json
rename to llm/experimental/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json
diff --git a/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json b/llm/experimental/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json
similarity index 100%
rename from llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json
rename to llm/experimental/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json
diff --git a/llm/ernie-3.5-se/ernie_dataset.py b/llm/experimental/ernie-3.5-se/ernie_dataset.py
similarity index 100%
rename from llm/ernie-3.5-se/ernie_dataset.py
rename to llm/experimental/ernie-3.5-se/ernie_dataset.py
diff --git a/llm/ernie-3.5-se/finetune_generation.py b/llm/experimental/ernie-3.5-se/finetune_generation.py
similarity index 100%
rename from llm/ernie-3.5-se/finetune_generation.py
rename to llm/experimental/ernie-3.5-se/finetune_generation.py
diff --git a/llm/ernie-3.5-se/modeling.py b/llm/experimental/ernie-3.5-se/modeling.py
similarity index 100%
rename from llm/ernie-3.5-se/modeling.py
rename to llm/experimental/ernie-3.5-se/modeling.py
diff --git a/llm/ernie-3.5-se/predict_generation.py b/llm/experimental/ernie-3.5-se/predict_generation.py
similarity index 100%
rename from llm/ernie-3.5-se/predict_generation.py
rename to llm/experimental/ernie-3.5-se/predict_generation.py
diff --git a/llm/ernie-3.5-se/run_pretrain.py b/llm/experimental/ernie-3.5-se/run_pretrain.py
similarity index 100%
rename from llm/ernie-3.5-se/run_pretrain.py
rename to llm/experimental/ernie-3.5-se/run_pretrain.py
diff --git a/llm/ernie-3.5-se/run_trainer_stage2.sh b/llm/experimental/ernie-3.5-se/run_trainer_stage2.sh
similarity index 100%
rename from llm/ernie-3.5-se/run_trainer_stage2.sh
rename to llm/experimental/ernie-3.5-se/run_trainer_stage2.sh
diff --git a/llm/ernie-3.5-se/tokenizer.py b/llm/experimental/ernie-3.5-se/tokenizer.py
similarity index 100%
rename from llm/ernie-3.5-se/tokenizer.py
rename to llm/experimental/ernie-3.5-se/tokenizer.py
diff --git a/llm/ernie-3.5-se/utils.py b/llm/experimental/ernie-3.5-se/utils.py
similarity index 100%
rename from llm/ernie-3.5-se/utils.py
rename to llm/experimental/ernie-3.5-se/utils.py
diff --git a/llm/llama/run_sharding_v2.sh b/llm/experimental/scripts/run_sharding_v2.sh
similarity index 100%
rename from llm/llama/run_sharding_v2.sh
rename to llm/experimental/scripts/run_sharding_v2.sh
diff --git a/llm/llama/run_trainer.sh b/llm/experimental/scripts/run_trainer.sh
similarity index 100%
rename from llm/llama/run_trainer.sh
rename to llm/experimental/scripts/run_trainer.sh
diff --git a/llm/llama/run_trainer_tp2cp2.sh b/llm/experimental/scripts/run_trainer_tp2cp2.sh
similarity index 100%
rename from llm/llama/run_trainer_tp2cp2.sh
rename to llm/experimental/scripts/run_trainer_tp2cp2.sh
diff --git a/llm/llama/run_trainer_tp4pp2.sh b/llm/experimental/scripts/run_trainer_tp4pp2.sh
similarity index 100%
rename from llm/llama/run_trainer_tp4pp2.sh
rename to llm/experimental/scripts/run_trainer_tp4pp2.sh
diff --git a/llm/llama/run_trainer_tp4sep2.sh b/llm/experimental/scripts/run_trainer_tp4sep2.sh
similarity index 100%
rename from llm/llama/run_trainer_tp4sep2.sh
rename to llm/experimental/scripts/run_trainer_tp4sep2.sh
diff --git a/llm/fused_layers.py b/llm/fused_layers.py
deleted file mode 120000
index b183f45159cc..000000000000
--- a/llm/fused_layers.py
+++ /dev/null
@@ -1 +0,0 @@
-llama/fused_layers.py
\ No newline at end of file
diff --git a/llm/gemma/sft_argument_7b.json b/llm/gemma/sft_argument_7b.json
deleted file mode 100644
index 16eba55bed9e..000000000000
--- a/llm/gemma/sft_argument_7b.json
+++ /dev/null
@@ -1,32 +0,0 @@
-{
- "model_name_or_path": "google/gemma-7b",
- "dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/gemma_sft_ckpts",
- "per_device_train_batch_size": 8,
- "gradient_accumulation_steps": 1,
- "per_device_eval_batch_size": 8,
- "eval_accumulation_steps":1,
- "num_train_epochs": 3,
- "learning_rate": 3e-06,
- "warmup_steps": 30,
- "logging_steps": 1,
- "evaluation_strategy": "epoch",
- "save_strategy": "epoch",
- "src_length": 512,
- "max_length": 1024,
- "bf16": true,
- "fp16_opt_level": "O2",
- "do_train": true,
- "do_eval": true,
- "do_predict": true,
- "disable_tqdm": true,
- "load_best_model_at_end": true,
- "eval_with_do_generation": false,
- "metric_for_best_model": "accuracy",
- "recompute": true,
- "save_total_limit": 1,
- "tensor_parallel_degree": 8,
- "pipeline_parallel_degree": 1,
- "zero_padding": false,
- "use_flash_attention": false
-}
\ No newline at end of file
diff --git a/llm/gemma/sft_argument_7b_sharding.json b/llm/gemma/sft_argument_7b_sharding.json
deleted file mode 100644
index ca04affdb243..000000000000
--- a/llm/gemma/sft_argument_7b_sharding.json
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "model_name_or_path": "google/gemma-7b",
- "dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/llama_sft_ckpts",
- "per_device_train_batch_size": 1,
- "gradient_accumulation_steps": 1,
- "per_device_eval_batch_size": 8,
- "eval_accumulation_steps":1,
- "num_train_epochs": 3,
- "learning_rate": 3e-06,
- "warmup_steps": 30,
- "logging_steps": 1,
- "evaluation_strategy": "epoch",
- "save_strategy": "epoch",
- "src_length": 1024,
- "max_length": 2048,
- "fp16": true,
- "fp16_opt_level": "O2",
- "do_train": true,
- "do_eval": true,
- "do_predict": true,
- "disable_tqdm": true,
- "load_best_model_at_end": true,
- "eval_with_do_generation": false,
- "metric_for_best_model": "accuracy",
- "recompute": true,
- "save_total_limit": 1,
- "sharding_parallel_degree": 8,
- "sharding": "stage3",
- "pipeline_parallel_degree": 1,
- "zero_padding": false,
- "use_flash_attention": false
-}
\ No newline at end of file
diff --git a/llm/gemma/sft_argument_sharding.json b/llm/gemma/sft_argument_sharding.json
deleted file mode 100644
index d462645e2235..000000000000
--- a/llm/gemma/sft_argument_sharding.json
+++ /dev/null
@@ -1,31 +0,0 @@
-{
- "model_name_or_path": "google/gemma-2b/",
- "dataset_name_or_path": "./data",
- "output_dir": "./checkpoints/chatglm2_sft_ckpts",
- "per_device_train_batch_size": 1,
- "gradient_accumulation_steps": 1,
- "per_device_eval_batch_size": 1,
- "eval_accumulation_steps":1,
- "num_train_epochs": 3,
- "learning_rate": 3e-05,
- "warmup_steps": 30,
- "logging_steps": 1,
- "evaluation_strategy": "epoch",
- "save_strategy": "epoch",
- "src_length": 512,
- "max_length": 1024,
- "fp16": true,
- "fp16_opt_level": "O2",
- "do_train": true,
- "do_eval": true,
- "disable_tqdm": true,
- "load_best_model_at_end": true,
- "eval_with_do_generation": false,
- "metric_for_best_model": "accuracy",
- "recompute": true,
- "save_total_limit": 1,
- "sharding_parallel_degree": 2,
- "sharding": "stage3",
- "zero_padding": false,
- "use_flash_attention": false
- }
\ No newline at end of file
diff --git a/llm/glm/README.md b/llm/glm/README.md
deleted file mode 100644
index 86bc69d571e6..000000000000
--- a/llm/glm/README.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# GLM
-
-## 1. 模型介绍
-
-[General Language Model (GLM)](https://arxiv.org/abs/2103.10360) 是以自回归填空作为训练目标的通用语言模型,可用于各类理解和生成任务。
-
-现有预训练框架包括以 BERT 为代表的自编码模型,以 GPT 为代表的自回归模型和以 T5 为代表的编码-解码模型。但这些框架均不能完全支持自然语言理解、无条件生成和条件生成这三类主要任务。为了解决这一问题,我们提出了基于自回归填空任务的通用语言模型(GLM)。GLM 使用 2D 位置编码和任意顺序预测改进了填空预训练过程,在自然语言理解任务上超越了 BERT 和 T5。同时,GLM 的预训练过程基于多种任务,填空长度和数量各不相同。在自然语言理解、无条件生成和条件生成任务上,GLM 均超过了具有相同参数规模和训练数据量的 BERT、T5 和 GPT 模型。除此之外,GLM 还以 BERT Large 1.25 倍参数量的规模取得了当前最优的效果,证明了其在不同下游任务上良好的泛化能力。
-
-
-**支持模型权重:**
-
-| Model |
-|----------------------------------|
-| THUDM/glm-large-chinese |
-| THUDM/glm-10b-chinese |
-
-## 3. 模型精调
-
-### SFT
-
-```
-python -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py \
---model_name_or_path THUDM/glm-large-chinese \
---num_train_epochs 4 \
---learning_rate 3e-5 \
---warmup_ratio 0.06 \
---weight_decay 0.1 \
---label_smoothing 0.1 \
---save_steps 100 \
---logging_steps 1 \
---eval_steps 100 \
---output_dir ./checkpoints/glm-large-chinese \
---src_length 608 \
---tgt_length 160 \
---min_tgt_length 55 \
---length_penalty 0.7 \
---no_repeat_ngram_size 3 \
---num_beams 5 \
---select_topk True \
---per_device_eval_batch_size 2 \
---per_device_train_batch_size 2 \
---max_grad_norm 1.0 \
---lr_scheduler_type linear \
---fp16 \
---fp16_opt_level O2 \
---recompute \
---do_train \
---do_eval
-```
-
-### 单卡LoRA微调
-
-```
-python finetune_generation.py \
---model_name_or_path THUDM/glm-large-chinese \
---num_train_epochs 4 \
---learning_rate 3e-5 \
---warmup_ratio 0.06 \
---weight_decay 0.1 \
---label_smoothing 0.1 \
---save_steps 100 \
---logging_steps 1 \
---eval_steps 100 \
---output_dir ./checkpoints/glm-large-chinese \
---src_length 608 \
---tgt_length 160 \
---min_tgt_length 55 \
---length_penalty 0.7 \
---no_repeat_ngram_size 3 \
---num_beams 5 \
---select_topk True \
---per_device_eval_batch_size 2 \
---per_device_train_batch_size 2 \
---max_grad_norm 1.0 \
---lr_scheduler_type linear \
---fp16 \
---fp16_opt_level O2 \
---recompute \
---do_train \
---do_eval \
---lora True
-```
-
-其中参数释义如下:
-
-- `model_name_or_path`: 预训练模型内置名称或者模型所在目录,默认为`THUDM/glm-large-chinese`。
-- `src_length`: 上下文的最大输入长度,默认为608.
-- `tgt_length`: 生成文本的最大长度,默认为160.
-- `min_tgt_length`: 生成文本的最小长度,默认为55.
-- `length_penalty`: 生成解码时的长度惩罚因子,默认为0.7.
-- `num_beams`: 搜索方向数量,默认为5。
-- `label_smoothing`: 标签平滑因子,默认为0.1.
-- `lr_decay_ratio`: 学习率衰减因子,默认为0.1.
-- `lora`: 是否使用LoRA技术.
-
-
-## 3.4 动态图推理
-
-```
-python predict_generation.py \
- --model_name_or_path THUDM/glm-large-chinese
-```
diff --git a/llm/glm/data.py b/llm/glm/data.py
deleted file mode 100644
index 40f5f3320a64..000000000000
--- a/llm/glm/data.py
+++ /dev/null
@@ -1,67 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import numpy as np
-
-
-def custom_convert_example(example, tokenizer, data_args, is_test=True):
- source = None
- title = None
- target = None
- if "source" in example and "title" in example:
- source = example["source"]
- if "title" in example.keys():
- title = example["title"]
- elif "context" in example and "answer" in example:
- source = example["context"]
- if "answer" in example.keys():
- title = example["answer"]
- else:
- assert False, "Source and title are not in the input dictionary, nor are context and answer."
- if "target" in example.keys():
- target = example["target"]
- elif "question" in example.keys():
- target = example["question"]
- example["text_a"] = "答案:" + title + "," + "上下文:" + source
- example["text_b"] = "在已知答案的前提下,问题:" + target
- inputs = tokenizer.encode(example["text_a"], max_length=data_args.src_length - 1, truncation=True)
- inputs["input_ids"] = inputs["input_ids"][:-1] + [tokenizer.gmask_token_id] + inputs["input_ids"][-1:]
- pad_length = data_args.src_length - len(inputs["input_ids"])
- inputs["input_ids"] = np.array([inputs["input_ids"] + [tokenizer.pad_token_id] * pad_length])
- inputs["attention_mask"] = np.array([inputs["attention_mask"] + [1] + [0] * pad_length])
- sep = inputs["input_ids"].shape[1]
- inputs = tokenizer.build_inputs_for_generation(
- inputs,
- max_gen_length=data_args.tgt_length,
- targets=" " + example["text_b"] if not is_test else None,
- padding="max_length",
- )
-
- for input_name in inputs.keys():
- inputs[input_name] = inputs[input_name].squeeze(0)
- if is_test:
- inputs["position_ids"] = inputs["position_ids"][:, : inputs["input_ids"].shape[-1]]
- labels = tokenizer.encode(
- " " + example["text_b"], add_special_tokens=False, max_length=data_args.tgt_length - 1
- )["input_ids"]
- loss_mask = [0] * sep + [1] * len(labels) + [0] * (data_args.tgt_length - len(labels))
- labels = (
- [0] * sep
- + labels
- + [tokenizer.eop_token_id]
- + [tokenizer.pad_token_id] * (data_args.tgt_length - len(labels) - 1)
- )
- inputs["label_ids"] = labels
- inputs["loss_mask"] = loss_mask
- return inputs
diff --git a/llm/glm/finetune_generation.py b/llm/glm/finetune_generation.py
deleted file mode 100644
index e8779d68f3ee..000000000000
--- a/llm/glm/finetune_generation.py
+++ /dev/null
@@ -1,188 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import sys
-from dataclasses import dataclass, field
-from functools import partial
-
-import paddle
-from data import custom_convert_example
-from utils import GLMTrainer
-
-from paddlenlp.data import DefaultDataCollator
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import BLEU, Rouge1, Rouge2, RougeL
-from paddlenlp.peft import LoRAConfig, LoRAModel
-from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint
-from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer
-from paddlenlp.utils.log import logger
-
-
-@dataclass
-class DataArgument:
- task_name: str = field(default="dureader_qg", metadata={"help": "The name of task."})
- src_length: int = field(default=608, metadata={"help": "The max length of source text."})
- tgt_length: int = field(default=160, metadata={"help": "The max length of target text."})
- min_tgt_length: int = field(default=55, metadata={"help": "The min length of target text."})
- length_penalty: float = field(default=0.7, metadata={"help": "The length penalty."})
- no_repeat_ngram_size: int = field(default=3, metadata={"help": "The no repeat ngram size."})
- num_beams: int = field(default=5, metadata={"help": "The number of beams."})
- select_topk: bool = field(default=True, metadata={"help": "Whether to select top k tokens for generation."})
- top_p: float = field(
- default=0.0, metadata={"help": "The cumulative probability for top-p-filtering in the 'sampling' strategy."}
- )
- top_k: int = field(
- default=0,
- metadata={
- "help": "The number of highest probability tokens to keep for top-k-filtering in the 'sampling' strategy."
- },
- )
- no_block_position: bool = field(default=False)
-
-
-@dataclass
-class ModelArgument:
- model_name_or_path: str = field(
- default="THUDM/glm-2b", metadata={"help": "Build-in pretrained model name or the path to local model."}
- )
- label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."})
- lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"})
- lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
-
-
-def main():
- parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments))
- if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
- model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
- else:
- model_args, data_args, training_args = parser.parse_args_into_dataclasses()
-
- training_args.print_config(model_args, "Model")
- training_args.print_config(data_args, "Data")
- setattr(training_args, "label_smoothing", model_args.label_smoothing)
- setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio)
-
- paddle.set_device(training_args.device)
-
- # Log on each process the small summary:
- logger.warning(
- f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
- + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
- )
-
- # Detecting last checkpoint.
- last_checkpoint = None
- if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
- last_checkpoint = get_last_checkpoint(training_args.output_dir)
- if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
- raise ValueError(
- f"Output directory ({training_args.output_dir}) already exists and is not empty. "
- "Use --overwrite_output_dir to overcome."
- )
- elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
- logger.info(
- f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
- "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
- )
-
- dtype = None
- if training_args.fp16_opt_level == "O2":
- if training_args.fp16:
- dtype = "float16"
- if training_args.bf16:
- dtype = "bfloat16"
-
- # Load the pretrained language model.
- model = AutoModelForConditionalGeneration.from_pretrained(
- model_args.model_name_or_path,
- output_predict=True,
- parallel_output=True,
- dtype=dtype, # todo enable set dtype to avoid additional mem usage
- tensor_parallel_degree=training_args.tensor_parallel_degree,
- tensor_parallel_rank=training_args.tensor_parallel_rank,
- )
- if model_args.lora:
- # TODO: hardcode parameters for now. Change after MergedLoRA is introduced
- lora_config = LoRAConfig(
- target_modules=[".*query_key_value.*"],
- r=4,
- lora_alpha=8,
- merge_weights=True,
- tensor_parallel_degree=training_args.tensor_parallel_degree,
- dtype=dtype,
- )
- model = LoRAModel(model, lora_config)
- model.mark_only_lora_as_trainable()
- model.print_trainable_parameters()
-
- tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
-
- # Load the dataset.
- train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train", "dev"])
- trans_func = partial(custom_convert_example, tokenizer=tokenizer, data_args=data_args)
- train_ds = train_ds.map(partial(trans_func, is_test=False))
- test_ds = dev_ds.map(trans_func)
-
- collate_fn = DefaultDataCollator()
-
- def compute_metrics(eval_preds):
- rouge1 = Rouge1()
- rouge2 = Rouge2()
- rougel = RougeL()
- bleu4 = BLEU(n_size=4)
- predictions = [x[x != -100] for x in eval_preds.predictions]
- references = [x[x != -100] for x in eval_preds.label_ids]
-
- # for pred in predictions:
-
- rouge1_score = rouge1.score(predictions, references)
- rouge2_score = rouge2.score(predictions, references)
- for pred, ref in zip(predictions, references):
- rougel.add_inst(pred, [ref])
- bleu4.add_inst(pred, [ref])
- return {
- "rouge1": rouge1_score,
- "rouge2": rouge2_score,
- "rougel": rougel.score(),
- "bleu4": bleu4.score(),
- }
-
- trainer = GLMTrainer(
- model=model,
- args=training_args,
- train_dataset=train_ds,
- eval_dataset=dev_ds,
- tokenizer=tokenizer,
- compute_metrics=compute_metrics,
- do_generation=True,
- data_collator=collate_fn,
- )
- if training_args.fp16_opt_level == "O2":
- trainer.disable_autocast_context_manager()
-
- if training_args.do_train:
- train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
- trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
- trainer.log_metrics("train", train_result.metrics)
- trainer.save_metrics("train", train_result.metrics)
- trainer.save_state()
-
- if training_args.do_eval:
- eval_result = trainer.evaluate(test_ds)
- trainer.log_metrics("test", eval_result)
-
-
-if __name__ == "__main__":
- main()
diff --git a/llm/glm/predict_generation.py b/llm/glm/predict_generation.py
deleted file mode 100644
index 41dd6b3459af..000000000000
--- a/llm/glm/predict_generation.py
+++ /dev/null
@@ -1,151 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-from paddle.distributed import fleet
-
-from paddlenlp.peft import LoRAConfig, LoRAModel
-from paddlenlp.transformers import (
- AutoConfig,
- AutoModelForConditionalGeneration,
- AutoTokenizer,
-)
-
-
-def parse_arguments():
- import argparse
-
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "--model_name_or_path", default="THUDM/glm-large-chinese", required=True, help="The directory of model."
- )
- parser.add_argument("--lora_path", default=None, help="The directory of LoRA parameters. Default to None")
- parser.add_argument(
- "--merge_tensor_parallel_path", default=None, help="The directory of model to merge tensor parallel parts."
- )
- parser.add_argument("--batch_size", type=int, default=2, help="The batch size of data.")
- parser.add_argument("--src_length", type=int, default=200, help="The batch size of data.")
- parser.add_argument("--tgt_length", type=int, default=20, help="The batch size of data.")
- return parser.parse_args()
-
-
-def batchfy_text(texts, batch_size):
- batch_texts = []
- batch_start = 0
- while batch_start < len(texts):
- batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
- batch_start += batch_size
- return batch_texts
-
-
-class Predictor(object):
- def __init__(self, args):
- self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
- self.batch_size = args.batch_size
- self.args = args
-
- tensor_parallel_degree = paddle.distributed.get_world_size()
- tensor_parallel_rank = 0
- if tensor_parallel_degree > 1:
- strategy = fleet.DistributedStrategy()
- strategy.hybrid_configs = {
- "dp_degree": 1,
- "mp_degree": tensor_parallel_degree,
- "pp_degree": 1,
- "sharding_degree": 1,
- }
- fleet.init(is_collective=True, strategy=strategy)
- hcg = fleet.get_hybrid_communicate_group()
- tensor_parallel_rank = hcg.get_model_parallel_rank()
-
- if self.args.lora_path is not None:
- lora_config = LoRAConfig.from_pretrained(self.args.lora_path)
- dtype = lora_config.dtype
- else:
- config = AutoConfig.from_pretrained(args.model_name_or_path)
- dtype = config.dtype if config.dtype is not None else "float32"
-
- self.model = AutoModelForConditionalGeneration.from_pretrained(
- args.model_name_or_path,
- tensor_parallel_degree=tensor_parallel_degree,
- tensor_parallel_rank=tensor_parallel_rank,
- dtype=dtype,
- )
- if self.args.lora_path is not None:
- self.model = LoRAModel.from_pretrained(self.model, self.args.lora_path)
- self.model.eval()
-
- def preprocess(self, input_text):
- input_text = [text.strip() + "[gMASK]" for text in input_text]
- inputs = self.tokenizer(
- input_text,
- return_tensors="np",
- add_special_tokens=True,
- padding=True,
- max_length=self.args.src_length,
- truncation=True,
- truncation_side="left",
- )
- inputs = self.tokenizer.build_inputs_for_generation(inputs, max_gen_length=self.args.tgt_length)
- inputs_tensor = {}
- for key, value in inputs.items():
- inputs_tensor[key] = paddle.to_tensor(value)
- return inputs_tensor
-
- def infer(self, inputs):
- result = self.model.generate(
- **inputs,
- decode_strategy="sampling",
- top_k=1,
- max_length=self.args.tgt_length,
- eos_token_id=self.tokenizer.eop_token_id,
- pad_token_id=self.tokenizer.pad_token_id,
- )
- result = result[0]
- return result
-
- def postprocess(self, infer_data):
- result = []
- for x in infer_data.tolist():
- res = self.tokenizer.decode(x, skip_special_tokens=True)
- result.append(res)
- out_dict = {"result": result}
- return out_dict
-
- def predict(self, texts):
- input_map = self.preprocess(texts)
- infer_result = self.infer(input_map)
- output = self.postprocess(infer_result)
- return output
-
-
-if __name__ == "__main__":
- args = parse_arguments()
- predictor = Predictor(args)
- all_texts = [
- "答案:年基准利率4.35%,上下文:从实际看,贷款的基本条件是: 一是中国大陆居民,年龄在60岁以下; 二是有稳定的住址和工作或经营地点; 三是有稳定的收入来源; 四是无不良信用记录,贷款用途不能作为炒股,赌博等行为; 五是具有完全民事行为能力。在已知答案的前提下,问题:",
- "答案:U系列,上下文:U系列是最好的,采用国际顶尖技术(由格力自主研发)双级变频压缩机,提高压缩机运转效率,制冷制热能力更强劲;1赫兹变频技术,使空调相当于一个15 W电灯泡,更加节能省电;送风面积广,风力大;生态风,净化空气。非常不错,现在国美在做活动,可以了解一下。在已知答案的前提下,问题:",
- ]
- batch_texts = batchfy_text(all_texts, args.batch_size)
- for bs, texts in enumerate(batch_texts):
- outputs = predictor.predict(texts)
- for text, result in zip(texts, outputs["result"]):
- print("{}\n{}".format(text, result))
-
- if args.merge_tensor_parallel_path is not None:
- predictor.model.save_pretrained(
- save_dir=args.merge_tensor_parallel_path,
- merge_tensor_parallel=True,
- )
- predictor.tokenizer.save_pretrained(args.merge_tensor_parallel_path)
diff --git a/llm/glm/utils.py b/llm/glm/utils.py
deleted file mode 100644
index d3b9e8919aa7..000000000000
--- a/llm/glm/utils.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from typing import Any, Dict, List, Optional, Tuple, Union
-
-import numpy as np
-import paddle
-import paddle.nn as nn
-
-from paddlenlp.trainer import Trainer
-
-
-class GLMTrainer(Trainer):
- def __init__(self, do_generation: bool, **kwargs):
- super().__init__(**kwargs)
- self.do_generation = do_generation
-
- def prediction_step(
- self,
- model: nn.Layer,
- inputs: Dict[str, Union[paddle.Tensor, Any]],
- prediction_loss_only: bool,
- ignore_keys: Optional[List[str]] = None,
- ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
-
- if not self.do_generation:
- return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
-
- model.eval()
- with paddle.no_grad():
- tokens = model.generate(
- input_ids=inputs["input_ids"],
- position_ids=inputs["position_ids"],
- attention_mask=inputs["attention_mask"],
- decode_strategy="sampling",
- top_k=1,
- repetition_penalty=2.0,
- bos_token_id=self.tokenizer.sop_token_id,
- eos_token_id=self.tokenizer.eop_token_id,
- pad_token_id=self.tokenizer.pad_token_id,
- )[0]
- all_preds = []
- for pred_tokens in tokens:
- all_preds.append(pred_tokens[pred_tokens != self.tokenizer.pad_token_id].tolist())
- max_pred_length = max([len(x) for x in all_preds])
- for index, preds in enumerate(all_preds):
- all_preds[index] = preds + [-100] * (max_pred_length - len(preds))
-
- all_labels = []
- for label, mask in zip(inputs["labels"].numpy(), inputs["loss_mask"].numpy()):
- label = label[mask.astype("bool")]
- label = [x for x in label[label != self.tokenizer.pad_token_id]]
- all_labels.append(label)
- max_label_length = max([len(x) for x in all_labels])
- for index, labels in enumerate(all_labels):
- all_labels[index] = labels + [-100] * (max_label_length - len(labels))
-
- return (None, paddle.to_tensor(all_preds), paddle.to_tensor(all_labels))
-
- def log(self, logs: Dict[str, float], **kwargs) -> None:
-
- if self.state.epoch is not None:
- logs["epoch"] = round(self.state.epoch, 4)
-
- if "eval_loss" in logs:
- logs["eval_ppl"] = np.exp(logs["eval_loss"])
- output = {**logs, **{"step": self.state.global_step}}
- self.state.log_history.append(output)
- self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs, **kwargs)
diff --git a/llm/gpt-3/README.md b/llm/gpt-3/README.md
deleted file mode 100644
index a0c387158d43..000000000000
--- a/llm/gpt-3/README.md
+++ /dev/null
@@ -1,205 +0,0 @@
-# GPT
-
-## 1. 模型介绍
-
-GPT-3是一种预训练语言模型,它能够模拟人类语言思维和表达。GPT-3拥有巨大的参数,包含了1750亿个参数,这使得它具有强大的语言理解和生成能力。它可以完成的任务包括文本生成、文本摘要、回答问题、翻译、阅读理解等。GPT-3的预训练过程使用了大量的语料库,包括互联网上的大量文本。它通过分析这些文本,学习如何生成和理解人类语言。GPT-3在自然语言处理领域具有很高的影响力,它可以模拟人类对话和生成文本,这使得它在许多应用领域都有广泛的应用,比如智能客服、自然语言处理、游戏设计等。
-
-## 2. 预训练
-
-预训练数据制作参考[此处](../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)
-
-为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本:
-```shell
-wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
-wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
-```
-
-将所有预处理得到的文件统一放入一个文件夹中,以备训练使用:
-
-```
-mkdir data
-mv gpt_en_dataset_300m_ids.npy ./data
-mv gpt_en_dataset_300m_idx.npz ./data
-```
-
-注意:
-1. 需要paddle develop版本训练,需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包
-2. `use_flash_attention` 需要在A100机器开启。建议使用cuda11.8环境。
-
-使用下面脚本,即可在gpt2-medium-en的基础上,继续训练.
-```shell
-task_name="gpt3_hybrid"
-export PYTHONPATH="../../PaddleNLP/"
-export FLAGS_cudnn_deterministic=True
-log_dir="log"
-rm -rf $log_dir
-
-python -u -m paddle.distributed.launch \
- --gpus "0,1,2,3,4,5,6,7" \
- --log_dir ${log_dir} \
- run_pretrain.py \
- --model_name_or_path gpt2-medium-en \
- --tokenizer_name_or_path gpt2-medium-en \
- --input_dir "./data" \
- --output_dir "output/$task_name" \
- --split 949,50,1 \
- --max_seq_length 1024 \
- --per_device_train_batch_size 1 \
- --per_device_eval_batch_size 1 \
- --tensor_parallel_degree 1 \
- --pipeline_parallel_degree 1 \
- --sequence_parallel 0 \
- --fuse_attention_qkv 0 \
- --use_flash_attention 0 \
- --fp16 \
- --fp16_opt_level "O2" \
- --scale_loss 1024 \
- --learning_rate 0.00001 \
- --min_learning_rate 0.000005 \
- --max_steps 10000 \
- --save_steps 5000 \
- --weight_decay 0.01 \
- --warmup_ratio 0.01 \
- --max_grad_norm 1.0 \
- --logging_steps 1\
- --continue_training \
- --dataloader_num_workers 1 \
- --sharding "stage2" \
- --eval_steps 1000 \
- --report_to "visualdl" \
- --disable_tqdm true \
- --recompute 1 \
- --gradient_accumulation_steps 2 \
- --do_train \
- --do_eval \
- --device "gpu"
-```
-
-其中参数释义如下:
-
-- `model_name_or_path`: 预训练模型内置名称或者模型所在目录,默认为`gpt2-medium-en`。
-- `tokenizer_name_or_path`: tokenizer名称或者tokenizer所在目录,默认为`gpt2-medium-en`。
-- `input_dir`: 预训练数据所在目录。
-- `output_dir`: 模型参数及日志保存目录。
-- `split`: 预训练数据切分比例,默认为949,50,1。
-- `max_seq_length`: 预训练最大序列长度,默认为1024。
-- `per_device_train_batch_size`: 单卡训练batch_size大小,默认为1。
-- `per_device_eval_batch_size`: 单卡评估batch_size大小,默认为1。
-- `tensor_parallel_degree`: 模型并行数量。
-- `pipeline_parallel_degree`: 流水线并行数量。
-- `sequence_parallel`: 序列并行数量。需要当`tensor_parallel_degree>1`时,使用序列并行。注意:当模型规模较小、batch_size较小、sequence_length较小时,不建议使用序列并行。
-- `fuse_attention_qkv`:在MultiHeadAttention中使用qkv线性层融合
-- `use_flash_attention`:使用flash attention技术,注意此处需要在A100机器开启, 建议使用cuda11.8环境。
-- `fp16`: 使用 float16 精度进行模型训练和推理。
-- `fp16_opt_level`: float16 精度训练模式,`O2`表示纯 float16 训练。
-- `scale_loss`: float16 精度训练时,损失值的缩放比例。微调时建议使用1024,预训练时建议调大。
-- `learning_rate`: 参数更新的学习率。
-- `min_learning_rate`: 最小学习率。
-- `max_steps`: 模型训练步数。
-- `save_steps`: 模型参数保存的间隔步数。
-- `weight_decay`: 权重衰减系数。
-- `warmup_ratio`: warmup比例。
-- `max_grad_norm`: 梯度裁剪系数。
-- `logging_steps`: 训练日志打印的间隔步数。
-- `continue_training`: 是否继续训练模型。
-- `dataloader_num_workers`: dataloader进程数。
-- `sharding`: sharding切分策略,包含stage1、stage2、stage3。
-- `eval_steps`: 模型评估的间隔步数。
-- `recompute`: 使用重计算策略,开启后可节省训练显存。
-- `gradient_accumulation_steps`: 模型参数梯度累积的步数,可用于扩大 batch size。实际的 batch_size = per_device_train_batch_size * gradient_accumulation_steps。
-- `do_train`: 是否训练模型。
-- `do_eval`: 是否评估模型。
-- `lora`: 是否使用LoRA技术。
-
-