PaddlePaddle · ZHUI · Jan 21, 2025 · Jan 24, 2025 · Feb 11, 2025 · Feb 12, 2025
diff --git a/README.md b/README.md
@@ -166,7 +166,7 @@
 ### 环境依赖
 
 * python >= 3.8
-* paddlepaddle >= 3.0.0b0
+* paddlepaddle >= 3.0.0rc0
 
 如果您尚未安装 PaddlePaddle，请参考 [飞桨官网](https://www.paddlepaddle.org.cn/) 进行安装。
 
@@ -211,7 +211,7 @@ wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwe
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
 cd .. # change folder to PaddleNLP/llm
 # 如需使用use_fused_rms_norm=true，需要前往slm/model_zoo/gpt-3/external_ops安装fused_ln
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json --use_fused_rms_norm false
+python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
 ```
 
 ### 大模型 SFT 精调
@@ -221,7 +221,7 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm
-python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
+python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
 ```
 
 更多大模型全流程步骤，请参考[飞桨大模型套件](./llm)介绍。
@@ -236,7 +236,7 @@ dataset = load_dataset("ZHUI/alpaca_demo", split="train")
 training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", device="gpu")
 trainer = SFTTrainer(
     args=training_args,
-    model="Qwen/Qwen2.5-0.5B",
+    model="Qwen/Qwen2.5-0.5B-Instruct",
     train_dataset=dataset,
 )
 trainer.train()

diff --git a/llm/README.md b/llm/README.md
@@ -37,6 +37,11 @@
 
 ## 🚀 快速开始 🚀
 
+开始之前，您可以安装先 PaddleNLP 最新 develop 版本:
+```shell
+pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+```
+
 ### 1. 预训练
 
 PaddleNLP 将飞桨4D 并行策略加入到 Trainer API 中， 用户只需修改 Trainer 配置即可使用不同的分布式策略。目前大模型套件提供[LLaMA/LLaMA2/LLaMA3](./config/llama)、[GPT-3](./config/gpt-3)、[Qwen](./config/qwen)、[Baichuan/Baichuan2](./config/baichuan)、[Mixtral](./config/mixtral) 等模型预训练功能，更多模型支持持续更新中。
@@ -73,19 +78,30 @@ mkdir data
 mv llama_openwebtext_100k.bin ./data
 mv llama_openwebtext_100k.idx ./data
 ```
+单卡训练:
+```shell
+# 16G 显存可训练
+python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
+```
+- 该配置16G 显存可训练，可以开启 use_flash_attention,use_fused_rms_norm,recompute 进一步省显存
+- 如果上述配置无法开启，或显存依然不够，可以开启`offload_optim`,此时显存约为11G  `python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json  --offload_optim  1`
 
+高性能、多卡、多机训练:
 ```shell
 # 编译自定义算子，可选
 cd ../slm/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
 
-# 模型预训练参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
+# 多卡模型预训练参考:
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
+# 多机训练参考: 占用45G显存左右
+python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7"  --master=192.168.1.1:8090 --nnodes=2  run_pretrain.py ./config/llama/pretrain_argument.json
 ```
+- 更详细的分布式启动命令请参考[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.6/api/paddle/distributed/launch_cn.html#launch)。
 
 注意：
 
 1. 建议使用 paddle develop 版本训练，需要安装`pip install fast_dataindex visualdl==2.5.3`等相关缺失 whl 包
-2. `use_flash_attention` 需要在 A100机器开启，建议使用 cuda11.8环境。
+2. `use_flash_attention` 需要在 A100 以上机器开启，建议使用 cuda11.8以上环境。
 3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子，需要额外设置 PYTHONPATH
 4. `continue_training` 表示从现有的预训练模型加载训练。7b 模型初始 loss 大概为2.xx, 随机初始化模型 loss 从11.x 左右下降。
 5. 多机训练时，若各机器使用的训练数据文件位置相同（例如挂载共享硬盘情况），请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据，
@@ -125,29 +141,45 @@ PaddleNLP 支持多个主流大模型的 SFT、PEFT 等精调策略，提供统
 为了方便测试，我们也提供了[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)demo 数据集可以直接使用：
 
 ```shell
+# 在 PaddleNLP/llm 目录执行
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz
 tar -xvf alpaca_demo.gz
 ```
 
 #### 2.2 全参精调：SFT
 
+单卡
+```bash
+# 需要12G显存左右
+python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
+# 单卡性能最佳实践，16G显存，可以参考打开开关。
+# ./config/qwen/sft_argument_0p5b_best.json
+```
+
+多卡
 ```bash
-# SFT 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
+# SFT 启动命令参考，需要45G显存左右
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_finetune.py ./config/qwen/sft_argument.json
 ```
 
 #### 2.3 LoRA
 
+LoRA 启动命令参考
 ```bash
-# LoRA 启动命令参考
-python  run_finetune.py ./config/llama/lora_argument.json
+# 需要9G左右显存
+python run_finetune.py ./config/qwen/lora_argument_0p5b.json
+# 需要29G左右显存
+python run_finetune.py ./config/qwen/lora_argument.json
 ```
 
 #### 2.4 Prefix Tuning
 
+Prefix Tuning 启动命令参考
 ```bash
-# Prefix Tuning 启动命令参考
-python  run_finetune.py ./config/llama/pt_argument.json
+# 需要10G左右显存
+python run_finetune.py ./config/qwen/pt_argument_0p5b.json
+# 需要30G左右显存
+python run_finetune.py ./config/qwen/pt_argument.json
 ```
 
 除了 LoRA、Prefix Tuning 外，还支持 LoKr、VeRA、MoRA、ReFT、rsLoRA、LoRA+、PiSSA、MoSLoRA 等多种精调算法，更多大模型精调使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)。
@@ -192,18 +224,26 @@ tar -zxvf ultrafeedback_binarized.tar.gz
 
 ##### 全参 DPO
 
+
 ```bash
-# DPO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
+# DPO 启动命令参考, 8卡训练， 需要大概40G显存
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
+
+# 单卡训练，大概需要26G显存左右
+python -u  ./alignment/dpo/run_dpo.py ./config/qwen/dpo_argument_0p5b.json
 ```
 
 ##### LoRA DPO
 
 ```bash
 # DPO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
 ```
 更多 DPO 技术细节和使用说明详见[DPO 文档](./docs/dpo.md)。
+```bash
+# 需要52G左右显存
+python -u  ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
+```
 
 #### 3.2 KTO
 
@@ -240,13 +280,13 @@ tar -zxvf ultrafeedback_binarized.tar.gz
 
 ```bash
 # KTO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
 ```
 ##### LoRA KTO
 
 ```bash
 # KTO 启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
+python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
 ```
 
 #### 3.3 RLHF
@@ -362,7 +402,8 @@ python ./predict/predictor.py --model_name_or_path ./inference --inference_model
 
 服务化部署脚本
 
-```shell
+```shell 
+# 单卡，可以使用 paddle.distributed.launch 启动多卡推理
 python  ./predict/flask_server.py \
     --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
     --port 8010 \

diff --git a/llm/config/llama/dpo_argument.json b/llm/config/llama/dpo_argument.json
@@ -1,5 +1,5 @@
 {
-    "model_name_or_path": "meta-llama/Meta-Llama-3-8B",
+    "model_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
     "train_dataset_path": "./data/train.jsonl",
     "dev_dataset_path": "./data/dev.jsonl",
     "output_dir": "./checkpoints/dpo_ckpts",

diff --git a/llm/config/llama/pretrain_argument.json b/llm/config/llama/pretrain_argument.json
@@ -28,7 +28,7 @@
     "warmup_ratio": 0.01,
     "max_grad_norm": 1.0,
     "dataloader_num_workers": 1,
-    "continue_training": 1,
+    "continue_training": 0,
     "do_train": true,
     "do_eval": true,
     "do_predict": true,

diff --git a/llm/config/qwen/dpo_argument_0p5b.json b/llm/config/qwen/dpo_argument_0p5b.json
@@ -0,0 +1,39 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "train_dataset_path": "./data/train.jsonl",
+    "dev_dataset_path": "./data/dev.jsonl",
+    "output_dir": "./checkpoints/dpo_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 1,
+    "num_train_epochs": 1,
+    "max_steps": 100,
+    "learning_rate": 1e-06,
+    "warmup_steps": 10,
+    "logging_steps": 1,
+    "evaluation_strategy": "steps",
+    "save_strategy": "steps",
+    "eval_steps": 100,
+    "save_steps": 500,
+    "max_seq_len": 2048,
+    "max_prompt_len": 1024,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "tensor_parallel_degree": 1,
+    "sharding": "stage1",
+    "use_flash_attention": false,
+    "flash_mask": false,
+    "recompute": true,
+    "recompute_granularity": "full",
+    "benchmark": false,
+    "unified_checkpoint": true,
+    "autotuner_benchmark":false,
+    "beta": 0.1,
+    "loss_type": "sigmoid",
+    "greedy_zero_padding": false,
+    "label_smoothing": 0.0
+  }
diff --git a/llm/config/qwen/lora_argument.json b/llm/config/qwen/lora_argument.json
@@ -4,7 +4,7 @@
     "output_dir": "./checkpoints/lora_ckpts",
     "per_device_train_batch_size": 4,
     "gradient_accumulation_steps": 4,
-    "per_device_eval_batch_size": 8,
+    "per_device_eval_batch_size": 4,
     "eval_accumulation_steps":16,
     "num_train_epochs": 3,
     "learning_rate": 3e-04,

diff --git a/llm/config/qwen/lora_argument_0p5b.json b/llm/config/qwen/lora_argument_0p5b.json
@@ -0,0 +1,34 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/lora_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 2,
+    "eval_accumulation_steps": 32,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true,
+    "unified_checkpoint": true,
+    "zero_padding": false,
+    "use_flash_attention": false,
+    "pissa": false
+  }
diff --git a/llm/config/qwen/pretrain_argument_0p5b.json b/llm/config/qwen/pretrain_argument_0p5b.json
@@ -0,0 +1,40 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B",
+    "tokenizer_name_or_path": "Qwen/Qwen2.5-0.5B",
+    "input_dir": "./data",
+    "output_dir": "./checkpoints/pretrain_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 1,
+    "per_device_eval_batch_size": 2,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "sharding": "stage2",
+    "virtual_pp_degree": 1,
+    "sequence_parallel": 0,   
+    "use_flash_attention": false,
+    "use_fused_rms_norm": false,
+    "max_seq_length": 1024,
+    "learning_rate": 3e-05,
+    "min_learning_rate": 3e-06,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "max_steps": 10000,
+    "save_steps": 5000,
+    "eval_steps": 1000,
+    "weight_decay": 0.01,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "warmup_ratio": 0.01,
+    "max_grad_norm": 1.0,
+    "dataloader_num_workers": 1,
+    "continue_training": 0,
+    "do_train": true,
+    "do_eval": true,
+    "do_predict": true,
+    "disable_tqdm": true,
+    "recompute": false,
+    "distributed_dataloader": 1,
+    "recompute_granularity": "full",
+    "unified_checkpoint": true,
+    "save_total_limit": 2
+  }
diff --git a/llm/config/qwen/pt_argument.json b/llm/config/qwen/pt_argument.json
@@ -4,8 +4,8 @@
     "output_dir": "./checkpoints/pt_ckpts",
     "per_device_train_batch_size": 4,
     "gradient_accumulation_steps": 4,
-    "per_device_eval_batch_size": 8,
-    "eval_accumulation_steps":16,
+    "per_device_eval_batch_size": 4,
+    "eval_accumulation_steps": 32,
     "num_train_epochs": 3,
     "learning_rate": 3e-02,
     "warmup_steps": 30,

diff --git a/llm/config/qwen/pt_argument_0p5b.json b/llm/config/qwen/pt_argument_0p5b.json
@@ -0,0 +1,31 @@
+{
+    "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/pt_ckpts",
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 4,
+    "eval_accumulation_steps": 32,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true,
+    "use_flash_attention": false
+  }