PPO Example Script Accelerator error: initialize your accelerator via `accelerator = Accelerator()` #2377

hitzkrieg · 2024-11-21T02:18:57Z

System Info

Platform: Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.26
Python version: 3.10.15
PyTorch version: 2.5.1
CUDA device(s): Tesla T4, Tesla T4, Tesla T4, Tesla T4
Transformers version: 4.46.3
Accelerate version: 1.1.1
Accelerate config: not found
Datasets version: 3.1.0
HF Hub version: 0.26.2
TRL version: 0.12.1
bitsandbytes version: not installed
DeepSpeed version: 0.15.4
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: not installed
PEFT version: not installed

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

# Running the code on ml.g4dn.12xlarge instance 

# Setup environment 
conda create -n py_310_conda python="3.10" -y
conda activate py_310_conda
pip install trl 
pip install deepspeed

# Clone the repo
git clone https://github.com/huggingface/trl.git
cd trl 

# To Resolve some NVJIT link error (irrelevant for this issue)
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH

# Change num_processes to 4 in examples/accelerate_configs/deepspeed_zero2.yaml
# Launch the example script 
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
    examples/scripts/ppo/ppo_tldr.py \
    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
    --dataset_test_split validation \
    --output_dir models/minimal/ppo_tldr \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --total_episodes 1000000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    --local_rollout_forward_batch_size 16 \
    --missing_eos_penalty 1.0 \
    --stop_token eos \
    --eval_strategy steps \
    --eval_steps 100

outputs:

(py_310_conda) -sh-4.2$ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo/ppo_tldr.py --dataset_name trl-internal-testing/tldr-preference-sft-trl-style --dataset
_test_split validation --output_dir models/minimal/ppo_tldr --learning_rate 3e-6 --per_device_train_batch_size 16 --gradient_accumulation_steps 4 --total_episodes 1000000 --model_name_or_path EleutherAI/pythia-1b-
deduped --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr --local_rollout_forward_batch_size 16 --missing_eos_penalty 1.0 --sto
p_token eos
[2024-11-21 01:56:22,808] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] 
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] *****************************************
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune t
he variable for optimal performance in your application as needed. 
W1121 01:56:24.612000 6458 site-packages/torch/distributed/run.py:793] *****************************************
[2024-11-21 01:56:27,920] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:27,963] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:27,967] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:27,972] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-21 01:56:29,627] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 01:56:29,648] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 01:56:29,649] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 01:56:29,649] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-11-21 01:56:29,651] [INFO] [comm.py:652:init_distributed] cdb=None
[rank1]:[W1121 01:56:31.240210284 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W1121 01:56:31.281175904 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1121 01:56:31.363231178 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W1121 01:56:32.470602356 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if t
his rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/examples/scripts/ppo/ppo_tldr.py", line 165, in <module>
[rank0]:     trainer = PPOTrainer(
[rank0]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/trl/trainer/ppo_trainer.py", line 179, in __init__
[rank0]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank0]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 292, in __init__
[rank0]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank0]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/state.py", line 887, in __init__
[rank3]:     trainer = PPOTrainer(
[rank3]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/trl/trainer/ppo_trainer.py", line 179, in __init__
[rank3]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank3]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 292, in __init__
[rank3]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank3]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/state.py", line 887, in __init__
[rank3]:     raise ValueError(
[rank3]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/examples/scripts/ppo/ppo_tldr.py", line 165, in <module>
[rank2]:     trainer = PPOTrainer(
[rank2]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/home/ec2-user/SageMaker/trl_nov_20/trl/trl/trainer/ppo_trainer.py", line 179, in __init__
[rank2]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank2]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 292, in __init__
[rank2]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank2]:   File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/state.py", line 887, in __init__
[rank2]:     raise ValueError(
[rank2]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library.
[rank0]:[W1121 01:56:32.261786959 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_
process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This con
straint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W1121 01:56:34.039000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 6618 closing signal SIGTERM
W1121 01:56:34.039000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 6619 closing signal SIGTERM
W1121 01:56:34.040000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 6620 closing signal SIGTERM
E1121 01:56:34.254000 6458 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 6617) of binary: /home/ec2-user/anaconda3/envs/py_310_conda/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/py_310_conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/py_310_conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/scripts/ppo/ppo_tldr.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-21_01:56:34
  host      : ip-10-10-10-83.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 6617)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```


### Expected behavior

The code throws an error on how accelerator = Accelerator. 

### Checklist

- [X] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [X] I have included my system information
- [X] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [X] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [X] Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

ccs96307 · 2024-11-27T08:30:47Z

I encountered this issue previously and temporarily worked around it by adjusting the accelerate version to 0.34.2. Here are the versions I used:

accelerate==0.34.2
torch==2.5.1
transformers==4.46.2
deepspeed==0.15.4

lzy37ld · 2024-12-15T19:04:32Z

same issues above

IQIUM · 2024-12-31T06:06:39Z

@lzy37ld
I fixed this problem by tweaking the torch version to 2.4.0, and maybe some other older version would have worked.

accelerate==0.34.2
torch==2.4.0
tansformers==4.46.2
deepspeed==0.15.4

Superskyyy · 2025-01-14T22:58:05Z

This issue is still persisting, had to downgrade to make zero2/3 work.

qgallouedec added 🐛 bug Something isn't working 🙋 help from community wanted Open invitation for community members to contribute 🏋 PPO Related to PPO labels Dec 13, 2024

Superskyyy mentioned this issue Jan 20, 2025

Running TRL example doesn't work when using deepspeed on accelerate huggingface/accelerate#3354

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO Example Script Accelerator error: initialize your accelerator via `accelerator = Accelerator()` #2377

PPO Example Script Accelerator error: initialize your accelerator via `accelerator = Accelerator()` #2377

hitzkrieg commented Nov 21, 2024

ccs96307 commented Nov 27, 2024

lzy37ld commented Dec 15, 2024

IQIUM commented Dec 31, 2024

Superskyyy commented Jan 14, 2025

PPO Example Script Accelerator error: initialize your accelerator via accelerator = Accelerator() #2377

PPO Example Script Accelerator error: initialize your accelerator via accelerator = Accelerator() #2377

Comments

hitzkrieg commented Nov 21, 2024

System Info

Information

Tasks

Reproduction

ccs96307 commented Nov 27, 2024

lzy37ld commented Dec 15, 2024

IQIUM commented Dec 31, 2024

Superskyyy commented Jan 14, 2025

PPO Example Script Accelerator error: initialize your accelerator via `accelerator = Accelerator()` #2377

PPO Example Script Accelerator error: initialize your accelerator via `accelerator = Accelerator()` #2377