Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO Example Script Accelerator error: initialize your accelerator via accelerator = Accelerator() #2377

Open
2 of 4 tasks
hitzkrieg opened this issue Nov 21, 2024 · 4 comments
Labels
🐛 bug Something isn't working 🙋 help from community wanted Open invitation for community members to contribute 🏋 PPO Related to PPO

Comments

@hitzkrieg
Copy link

System Info

  • Platform: Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.26
  • Python version: 3.10.15
  • PyTorch version: 2.5.1
  • CUDA device(s): Tesla T4, Tesla T4, Tesla T4, Tesla T4
  • Transformers version: 4.46.3
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • Datasets version: 3.1.0
  • HF Hub version: 0.26.2
  • TRL version: 0.12.1
  • bitsandbytes version: not installed
  • DeepSpeed version: 0.15.4
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: not installed
  • PEFT version: not installed

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

# Running the code on ml.g4dn.12xlarge instance 

# Setup environment 
conda create -n py_310_conda python="3.10" -y
conda activate py_310_conda
pip install trl 
pip install deepspeed

# Clone the repo
git clone https://github.com/huggingface/trl.git
cd trl 

# To Resolve some NVJIT link error (irrelevant for this issue)
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH

# Change num_processes to 4 in examples/accelerate_configs/deepspeed_zero2.yaml
# Launch the example script 
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
    examples/scripts/ppo/ppo_tldr.py \
    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
    --dataset_test_split validation \
    --output_dir models/minimal/ppo_tldr \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --total_episodes 1000000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    --local_rollout_forward_batch_size 16 \
    --missing_eos_penalty 1.0 \
    --stop_token eos \
    --eval_strategy steps \
    --eval_steps 100

outputs:

(py_310_conda) -sh-4.2$ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo/ppo_tldr.py --dataset_name trl-internal-testing/tldr-preference-sft-trl-style --dataset
_test_split validation --output_dir models/minimal/ppo_tldr --learning_rate 3e-6 --per_device_train_batch_size 16 --gradient_accumulation_steps 4 --total_episodes 1000000 --model_name_or_path EleutherAI/pythia-1b-
deduped --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr --local_rollout_forward_batch_size 16 --missing_eos_penalty 1.0 --sto
p_token eos
[2024-11-21 01:56:22,808] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] 
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] *****************************************
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune t
he variable for optimal performance in your application as needed. 
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] *****************************************
[2024-11-21 01:56:27,920] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:27,963] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:27,967] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:27,972] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:29,627] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 01:56:29,648] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 01:56:29,649] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 01:56:29,649] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-11-21 01:56:29,651] [INFO] [comm.py:652:init_distributed] cdb=None
[rank1]:[W1121 01:56:31.240210284 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W1121 01:56:31.281175904 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1121 01:56:31.363231178 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W1121 01:56:32.470602356 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/examples/scripts/ppo/ppo_tldr.py", line 165, in <module>
[rank0]:     trainer = PPOTrainer(
[rank0]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/trl/trainer/ppo_trainer.py", line 179, in __init__
[rank0]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank0]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 292, in __init__
[rank0]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank0]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/state.py", line 887, in __init__
[rank3]:     trainer = PPOTrainer(
[rank3]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/trl/trainer/ppo_trainer.py", line 179, in __init__
[rank3]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank3]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 292, in __init__
[rank3]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank3]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/state.py", line 887, in __init__
[rank3]:     raise ValueError(
[rank3]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/examples/scripts/ppo/ppo_tldr.py", line 165, in <module>
[rank2]:     trainer = PPOTrainer(
[rank2]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/trl/trainer/ppo_trainer.py", line 179, in __init__
[rank2]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank2]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 292, in __init__
[rank2]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank2]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/state.py", line 887, in __init__
[rank2]:     raise ValueError(
[rank2]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library.
[rank0]:[W1121 01:56:32.261786959 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_
process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This con
straint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W1121 01:56:34.039000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 6618 closing signal SIGTERM
W1121 01:56:34.039000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 6619 closing signal SIGTERM
W1121 01:56:34.040000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 6620 closing signal SIGTERM
E1121 01:56:34.254000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 6617) of binary: /home/ec2-user/anaconda3/envs/py_310_conda/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/py_310_conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/scripts/ppo/ppo_tldr.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-21_01:56:34
  host      : ip-10-10-10-83.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 6617)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```


### Expected behavior

The code throws an error on how accelerator = Accelerator. 

### Checklist

- [X] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [X] I have included my system information
- [X] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [X] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [X] Any traceback provided is complete
@ccs96307
Copy link
Contributor

I encountered this issue previously and temporarily worked around it by adjusting the accelerate version to 0.34.2. Here are the versions I used:

  • accelerate==0.34.2
  • torch==2.5.1
  • transformers==4.46.2
  • deepspeed==0.15.4

@qgallouedec qgallouedec added 🐛 bug Something isn't working 🙋 help from community wanted Open invitation for community members to contribute 🏋 PPO Related to PPO labels Dec 13, 2024
@lzy37ld
Copy link

lzy37ld commented Dec 15, 2024

same issues above

@IQIUM
Copy link

IQIUM commented Dec 31, 2024

@lzy37ld
I fixed this problem by tweaking the torch version to 2.4.0, and maybe some other older version would have worked.

  • accelerate==0.34.2
  • torch==2.4.0
  • tansformers==4.46.2
  • deepspeed==0.15.4

@Superskyyy
Copy link
Contributor

This issue is still persisting, had to downgrade to make zero2/3 work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 🙋 help from community wanted Open invitation for community members to contribute 🏋 PPO Related to PPO
Projects
None yet
Development

No branches or pull requests

6 participants