Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lower memory cards. #9804

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@
### 环境依赖

* python >= 3.8
* paddlepaddle >= 3.0.0b0
* paddlepaddle >= 3.0.0rc0

如果您尚未安装 PaddlePaddle,请参考 [飞桨官网](https://www.paddlepaddle.org.cn/) 进行安装。

Expand Down Expand Up @@ -211,7 +211,7 @@ wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwe
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
cd .. # change folder to PaddleNLP/llm
# 如需使用use_fused_rms_norm=true,需要前往slm/model_zoo/gpt-3/external_ops安装fused_ln
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json --use_fused_rms_norm false
python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
```

### 大模型 SFT 精调
Expand All @@ -221,7 +221,7 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已
mkdir -p llm/data && cd llm/data
wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
cd .. # change folder to PaddleNLP/llm
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
```

更多大模型全流程步骤,请参考[飞桨大模型套件](./llm)介绍。
Expand All @@ -236,7 +236,7 @@ dataset = load_dataset("ZHUI/alpaca_demo", split="train")
training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", device="gpu")
trainer = SFTTrainer(
args=training_args,
model="Qwen/Qwen2.5-0.5B",
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
Expand Down
71 changes: 56 additions & 15 deletions llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@

## 🚀 快速开始 🚀

开始之前,您可以安装先 PaddleNLP 最新 develop 版本:
```shell
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
```

### 1. 预训练

PaddleNLP 将飞桨4D 并行策略加入到 Trainer API 中, 用户只需修改 Trainer 配置即可使用不同的分布式策略。目前大模型套件提供[LLaMA/LLaMA2/LLaMA3](./config/llama)[GPT-3](./config/gpt-3)[Qwen](./config/qwen)[Baichuan/Baichuan2](./config/baichuan)[Mixtral](./config/mixtral) 等模型预训练功能,更多模型支持持续更新中。
Expand Down Expand Up @@ -73,19 +78,30 @@ mkdir data
mv llama_openwebtext_100k.bin ./data
mv llama_openwebtext_100k.idx ./data
```
单卡训练:
```shell
# 16G 显存可训练
python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
```
- 该配置16G 显存可训练,可以开启 use_flash_attention,use_fused_rms_norm,recompute 进一步省显存
- 如果上述配置无法开启,或显存依然不够,可以开启`offload_optim`,此时显存约为11G `python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json --offload_optim 1`

高性能、多卡、多机训练:
```shell
# 编译自定义算子,可选
cd ../slm/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -

# 模型预训练参考
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
# 多卡模型预训练参考:
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
# 多机训练参考: 占用45G显存左右
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" --master=192.168.1.1:8090 --nnodes=2 run_pretrain.py ./config/llama/pretrain_argument.json
```
- 更详细的分布式启动命令请参考[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.6/api/paddle/distributed/launch_cn.html#launch)

注意:

1. 建议使用 paddle develop 版本训练,需要安装`pip install fast_dataindex visualdl==2.5.3`等相关缺失 whl 包
2. `use_flash_attention` 需要在 A100机器开启,建议使用 cuda11.8环境
2. `use_flash_attention` 需要在 A100 以上机器开启,建议使用 cuda11.8以上环境
3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子,需要额外设置 PYTHONPATH
4. `continue_training` 表示从现有的预训练模型加载训练。7b 模型初始 loss 大概为2.xx, 随机初始化模型 loss 从11.x 左右下降。
5. 多机训练时,若各机器使用的训练数据文件位置相同(例如挂载共享硬盘情况),请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据,
Expand Down Expand Up @@ -125,29 +141,45 @@ PaddleNLP 支持多个主流大模型的 SFT、PEFT 等精调策略,提供统
为了方便测试,我们也提供了[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)demo 数据集可以直接使用:

```shell
# 在 PaddleNLP/llm 目录执行
wget https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz
tar -xvf alpaca_demo.gz
```

#### 2.2 全参精调:SFT

单卡
```bash
# 需要12G显存左右
python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
# 单卡性能最佳实践,16G显存,可以参考打开开关。
# ./config/qwen/sft_argument_0p5b_best.json
```

多卡
```bash
# SFT 启动命令参考
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
# SFT 启动命令参考,需要45G显存左右
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_finetune.py ./config/qwen/sft_argument.json
```

#### 2.3 LoRA

LoRA 启动命令参考
```bash
# LoRA 启动命令参考
python run_finetune.py ./config/llama/lora_argument.json
# 需要9G左右显存
python run_finetune.py ./config/qwen/lora_argument_0p5b.json
# 需要29G左右显存
python run_finetune.py ./config/qwen/lora_argument.json
```

#### 2.4 Prefix Tuning

Prefix Tuning 启动命令参考
```bash
# Prefix Tuning 启动命令参考
python run_finetune.py ./config/llama/pt_argument.json
# 需要10G左右显存
python run_finetune.py ./config/qwen/pt_argument_0p5b.json
# 需要30G左右显存
python run_finetune.py ./config/qwen/pt_argument.json
```

除了 LoRA、Prefix Tuning 外,还支持 LoKr、VeRA、MoRA、ReFT、rsLoRA、LoRA+、PiSSA、MoSLoRA 等多种精调算法,更多大模型精调使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)
Expand Down Expand Up @@ -192,18 +224,26 @@ tar -zxvf ultrafeedback_binarized.tar.gz

##### 全参 DPO


```bash
# DPO 启动命令参考
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
# DPO 启动命令参考, 8卡训练, 需要大概40G显存
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json

# 单卡训练,大概需要26G显存左右
python -u ./alignment/dpo/run_dpo.py ./config/qwen/dpo_argument_0p5b.json
```

##### LoRA DPO

```bash
# DPO 启动命令参考
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
```
更多 DPO 技术细节和使用说明详见[DPO 文档](./docs/dpo.md)
```bash
# 需要52G左右显存
python -u ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
```

#### 3.2 KTO

Expand Down Expand Up @@ -240,13 +280,13 @@ tar -zxvf ultrafeedback_binarized.tar.gz

```bash
# KTO 启动命令参考
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
```
##### LoRA KTO

```bash
# KTO 启动命令参考
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
```

#### 3.3 RLHF
Expand Down Expand Up @@ -362,7 +402,8 @@ python ./predict/predictor.py --model_name_or_path ./inference --inference_model

服务化部署脚本

```shell
```shell
# 单卡,可以使用 paddle.distributed.launch 启动多卡推理
python ./predict/flask_server.py \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--port 8010 \
Expand Down
2 changes: 1 addition & 1 deletion llm/config/llama/dpo_argument.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"model_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
"train_dataset_path": "./data/train.jsonl",
"dev_dataset_path": "./data/dev.jsonl",
"output_dir": "./checkpoints/dpo_ckpts",
Expand Down
2 changes: 1 addition & 1 deletion llm/config/llama/pretrain_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"dataloader_num_workers": 1,
"continue_training": 1,
"continue_training": 0,
"do_train": true,
"do_eval": true,
"do_predict": true,
Expand Down
39 changes: 39 additions & 0 deletions llm/config/qwen/dpo_argument_0p5b.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
"train_dataset_path": "./data/train.jsonl",
"dev_dataset_path": "./data/dev.jsonl",
"output_dir": "./checkpoints/dpo_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 8,
"per_device_eval_batch_size": 1,
"num_train_epochs": 1,
"max_steps": 100,
"learning_rate": 1e-06,
"warmup_steps": 10,
"logging_steps": 1,
"evaluation_strategy": "steps",
"save_strategy": "steps",
"eval_steps": 100,
"save_steps": 500,
"max_seq_len": 2048,
"max_prompt_len": 1024,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"tensor_parallel_degree": 1,
"sharding": "stage1",
"use_flash_attention": false,
"flash_mask": false,
"recompute": true,
"recompute_granularity": "full",
"benchmark": false,
"unified_checkpoint": true,
"autotuner_benchmark":false,
"beta": 0.1,
"loss_type": "sigmoid",
"greedy_zero_padding": false,
"label_smoothing": 0.0
}
2 changes: 1 addition & 1 deletion llm/config/qwen/lora_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
"per_device_eval_batch_size": 4,
"eval_accumulation_steps":16,
"num_train_epochs": 3,
"learning_rate": 3e-04,
Expand Down
34 changes: 34 additions & 0 deletions llm/config/qwen/lora_argument_0p5b.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
"dataset_name_or_path": "./data",
"output_dir": "./checkpoints/lora_ckpts",
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 8,
"per_device_eval_batch_size": 2,
"eval_accumulation_steps": 32,
"num_train_epochs": 3,
"learning_rate": 3e-04,
"warmup_steps": 30,
"logging_steps": 1,
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
"recompute": true,
"save_total_limit": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"lora": true,
"unified_checkpoint": true,
"zero_padding": false,
"use_flash_attention": false,
"pissa": false
}
40 changes: 40 additions & 0 deletions llm/config/qwen/pretrain_argument_0p5b.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"model_name_or_path": "Qwen/Qwen2.5-0.5B",
"tokenizer_name_or_path": "Qwen/Qwen2.5-0.5B",
"input_dir": "./data",
"output_dir": "./checkpoints/pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 2,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"sharding": "stage2",
"virtual_pp_degree": 1,
"sequence_parallel": 0,
"use_flash_attention": false,
"use_fused_rms_norm": false,
"max_seq_length": 1024,
"learning_rate": 3e-05,
"min_learning_rate": 3e-06,
"warmup_steps": 30,
"logging_steps": 1,
"max_steps": 10000,
"save_steps": 5000,
"eval_steps": 1000,
"weight_decay": 0.01,
"fp16": true,
"fp16_opt_level": "O2",
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"dataloader_num_workers": 1,
"continue_training": 0,
"do_train": true,
"do_eval": true,
"do_predict": true,
"disable_tqdm": true,
"recompute": false,
"distributed_dataloader": 1,
"recompute_granularity": "full",
"unified_checkpoint": true,
"save_total_limit": 2
}
4 changes: 2 additions & 2 deletions llm/config/qwen/pt_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
"output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"per_device_eval_batch_size": 4,
"eval_accumulation_steps": 32,
"num_train_epochs": 3,
"learning_rate": 3e-02,
"warmup_steps": 30,
Expand Down
31 changes: 31 additions & 0 deletions llm/config/qwen/pt_argument_0p5b.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
"dataset_name_or_path": "./data",
"output_dir": "./checkpoints/pt_ckpts",
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 8,
"per_device_eval_batch_size": 4,
"eval_accumulation_steps": 32,
"num_train_epochs": 3,
"learning_rate": 3e-02,
"warmup_steps": 30,
"logging_steps": 1,
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
"recompute": true,
"save_total_limit": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"prefix_tuning": true,
"use_flash_attention": false
}
Loading