Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

全量微调baichuan2-7b-chat #4316

Closed
1 task done
leezy18 opened this issue Jun 17, 2024 · 2 comments · May be fixed by #6010
Closed
1 task done

全量微调baichuan2-7b-chat #4316

leezy18 opened this issue Jun 17, 2024 · 2 comments · May be fixed by #6010
Labels
solved This problem has been already solved

Comments

@leezy18
Copy link

leezy18 commented Jun 17, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.8.2.dev0
  • Platform: Linux-3.10.0-1160.88.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.8.19
  • PyTorch version: 2.3.0+cu121 (GPU)
  • Transformers version: 4.41.2
  • Datasets version: 2.20.0
  • Accelerate version: 0.31.0
  • PEFT version: 0.11.1
  • TRL version: 0.9.4
  • GPU type: NVIDIA RTX A6000(共三张)
  • DeepSpeed version: 0.14.0
  • vLLM version: 0.4.3

Reproduction

我的llama3_full_sft.yaml

### model
model_name_or_path: /datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/Baichuan2-7B-Chat

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: entity
template: baichuan2
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/baichuan2-7b/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

只在train部分的参数改成bf16

运行
CUDA_VISIBLE_DEVICES=0,1,2 llamafactory-cli train examples/full_multi_gpu/llama3_full_sft.yaml

出现
[rank1]: Traceback (most recent call last):
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 8, in
[rank1]: launch()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 4, in launch
[rank1]: run_exp()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 33, in run_sft
[rank1]: dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/data/loader.py", line 154, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/contextlib.py", line 113, in enter
[rank1]: return next(self.gen)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/transformers/training_args.py", line 2332, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier
[rank1]: work = default_pg.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: Error while creating shared memory segment /dev/shm/nccl-x6hG9U (size 9637888)

Expected behavior

期望模型可以成功进行训练

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 17, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 17, 2024

设备故障

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 17, 2024
@hiyouga hiyouga closed this as completed Jun 17, 2024
@XYZliang
Copy link

XYZliang commented Nov 9, 2024

我也遇到这个情况,在和gpt的交流下得到正确解决,如果看到这里的朋友也是在容器内遇到这个问题,可以尝试增加容器的共享内存限制,这个默认是64M,宿主机的共享内存量可以通过
df -h /dev/shm
查看主机的共享内存量,并在创建容器的时候使用--shm-size=XXg来增加容器内的这个共享内存,从而解决问题。这个会在一些压力很大的任务上(我lora微调33b,7b没遇到,全量微调7b遇到的)遇到。我后续会给官方的docker-compose添加这个参数

XYZliang added a commit to XYZliang/LLaMA-Factory that referenced this issue Nov 13, 2024
…mory allocation for large-scale model fine-tuning tasks.

This pull request increases the shm_size parameter in docker-compose.yml to 16GB. The goal is to enhance the LLaMA-Factory framework’s performance for large model fine-tuning tasks by providing sufficient shared memory for efficient data loading and parallel processing.

This PR also addresses the issues discussed in [this comment](hiyouga#4316 (comment)) regarding Shared Memory Limit error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants