全量微调baichuan2-7b-chat #4316

leezy18 · 2024-06-17T02:22:34Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.2.dev0
Platform: Linux-3.10.0-1160.88.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.8.19
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.20.0
Accelerate version: 0.31.0
PEFT version: 0.11.1
TRL version: 0.9.4
GPU type: NVIDIA RTX A6000（共三张）
DeepSpeed version: 0.14.0
vLLM version: 0.4.3

Reproduction

我的llama3_full_sft.yaml

### model
model_name_or_path: /datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/Baichuan2-7B-Chat

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: entity
template: baichuan2
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/baichuan2-7b/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

只在train部分的参数改成bf16

运行
CUDA_VISIBLE_DEVICES=0,1,2 llamafactory-cli train examples/full_multi_gpu/llama3_full_sft.yaml

出现
[rank1]: Traceback (most recent call last):
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 8, in
[rank1]: launch()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 4, in launch
[rank1]: run_exp()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 33, in run_sft
[rank1]: dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/data/loader.py", line 154, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/contextlib.py", line 113, in enter
[rank1]: return next(self.gen)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/transformers/training_args.py", line 2332, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier
[rank1]: work = default_pg.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: Error while creating shared memory segment /dev/shm/nccl-x6hG9U (size 9637888)

Expected behavior

期望模型可以成功进行训练

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-06-17T10:24:12Z

设备故障

XYZliang · 2024-11-09T16:09:27Z

我也遇到这个情况，在和gpt的交流下得到正确解决，如果看到这里的朋友也是在容器内遇到这个问题，可以尝试增加容器的共享内存限制，这个默认是64M，宿主机的共享内存量可以通过
df -h /dev/shm
查看主机的共享内存量，并在创建容器的时候使用--shm-size=XXg来增加容器内的这个共享内存，从而解决问题。这个会在一些压力很大的任务上（我lora微调33b，7b没遇到，全量微调7b遇到的）遇到。我后续会给官方的docker-compose添加这个参数

…mory allocation for large-scale model fine-tuning tasks. This pull request increases the shm_size parameter in docker-compose.yml to 16GB. The goal is to enhance the LLaMA-Factory framework’s performance for large model fine-tuning tasks by providing sufficient shared memory for efficient data loading and parallel processing. This PR also addresses the issues discussed in [this comment](hiyouga#4316 (comment)) regarding Shared Memory Limit error.

github-actions bot added the pending This problem is yet to be addressed label Jun 17, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 17, 2024

hiyouga closed this as completed Jun 17, 2024

XYZliang mentioned this issue Nov 13, 2024

Increase shm_size to 16GB in docker-compose.yml #6010

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

全量微调baichuan2-7b-chat #4316

全量微调baichuan2-7b-chat #4316

leezy18 commented Jun 17, 2024

hiyouga commented Jun 17, 2024

XYZliang commented Nov 9, 2024

全量微调baichuan2-7b-chat #4316

全量微调baichuan2-7b-chat #4316

Comments

leezy18 commented Jun 17, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jun 17, 2024

XYZliang commented Nov 9, 2024