You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
出现
[rank1]: Traceback (most recent call last):
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 8, in
[rank1]: launch()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 4, in launch
[rank1]: run_exp()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 33, in run_sft
[rank1]: dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/data/loader.py", line 154, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/contextlib.py", line 113, in enter
[rank1]: return next(self.gen)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/transformers/training_args.py", line 2332, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier
[rank1]: work = default_pg.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: Error while creating shared memory segment /dev/shm/nccl-x6hG9U (size 9637888)
Expected behavior
期望模型可以成功进行训练
Others
No response
The text was updated successfully, but these errors were encountered:
…mory allocation for large-scale model fine-tuning tasks.
This pull request increases the shm_size parameter in docker-compose.yml to 16GB. The goal is to enhance the LLaMA-Factory framework’s performance for large model fine-tuning tasks by providing sufficient shared memory for efficient data loading and parallel processing.
This PR also addresses the issues discussed in [this comment](hiyouga#4316 (comment)) regarding Shared Memory Limit error.
Reminder
System Info
llamafactory
version: 0.8.2.dev0Reproduction
我的llama3_full_sft.yaml
只在train部分的参数改成bf16
运行
CUDA_VISIBLE_DEVICES=0,1,2 llamafactory-cli train examples/full_multi_gpu/llama3_full_sft.yaml
出现
[rank1]: Traceback (most recent call last):
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 8, in
[rank1]: launch()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/launcher.py", line 4, in launch
[rank1]: run_exp()
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 33, in run_sft
[rank1]: dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: File "/datasata0/cloud-wuzhengyuan/lzy/baichuan2_finetune/LLaMA-Factory/src/llamafactory/data/loader.py", line 154, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/contextlib.py", line 113, in enter
[rank1]: return next(self.gen)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/transformers/training_args.py", line 2332, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/root/miniconda3/envs/LLaMA/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier
[rank1]: work = default_pg.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: Error while creating shared memory segment /dev/shm/nccl-x6hG9U (size 9637888)
Expected behavior
期望模型可以成功进行训练
Others
No response
The text was updated successfully, but these errors were encountered: