Skip to content

Conversation

@younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Dec 27, 2022

What does this PR do?

This PR integrates trl with accelerate to make it compatible with the tools provided by the library to be able to train models using PPOTrainer. This would enable users to train their models in mixed precision, using Data Parallelism etc in a very simple manner.
Users should design their own training script and run them using accelerate launch xxx.py based on the example scripts provided in examples/scripits.

This PR also integrates Data Parallelism paradigm, enabling users to benefit from multi-GPU training if they want to speedup training.

TODOs

  • add example script
  • Improve API
  • document code
  • document example (check accelerate examples)
  • Input safety checkers (very important)

DeepSpeed tests (check where it works)

  • zero-0
  • zero-1
  • zero-2
  • zero-3

#### Run PPO step
t = time.time()
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, timing, batch, rewards, t0, t, logs)
Copy link
Contributor Author

@younesbelkada younesbelkada Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve, we probably want a better way to log the stats

Comment on lines 272 to 275
if isinstance(v, torch.Tensor) and k != 'objective/kl':
# tensor_list = [torch.zeros_like(v) for _ in range(self.accelerator.num_processes)]
dist.all_reduce(v, dist.ReduceOp.SUM)
v /= self.accelerator.num_processes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me in a DP setup, each GPU will need to have its own replica of objective/kl since this is used to update the kl_ctl object above. That is why I prefered to not include it in the all_reduce operation but I just wanted to confirm

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @younesbelkada. My main comments are about DP. I think if we don't wrap the step inputs (queries/responses) in a dataloader we don't achieve proper DP. But maybe I am wrong?

model (torch.model): Hugging Face transformer GPT2 model with value head
ref_model (torch.model): Hugging Face transformer GPT2 refrence model used for KL penalty
tokenizer (tokenizer): Hugging Face tokenizer
ppo_params (dict or None): PPO parameters for training. Can include following keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace **config (=ppo_params) with explicit kwargs or setup TrainingArguments like in transformers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a follow up PR btw

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Dec 29, 2022

wandb run (multi-GPU) after the latest commit: https://wandb.ai/distill-bloom/trl/runs/1mps4h09?workspace=user-younesbelkada

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are pretty close - a few open questions and minor changs :)

stats (dict[str, Any]):
a dictionary of stats with the tensors gathered.
"""
import torch.distributed as dist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think?

# In a distributed setup, only logging needs to be performed on the main process
# check: https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html
# or: https://discuss.pytorch.org/t/use-distributed-data-parallel-correctly/82500/11
self.is_distributed = self.accelerator.distributed_type == "MULTI_GPU"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can use accelerates gather method we can probably get rid of this?


#### Compute sentiment score
t = time.time()
texts = [q + r for q,r in zip(batch['query'], batch['response'])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the remove columns method inside the trainer the query shouldn't be there anymore? since we don't pass the data through the model internally, we don't need to remove the columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@younesbelkada younesbelkada marked this pull request as ready for review December 29, 2022 17:27
@younesbelkada
Copy link
Contributor Author

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 29, 2022

The documentation is not available anymore as the PR was closed or merged.

@younesbelkada younesbelkada mentioned this pull request Dec 29, 2022
26 tasks
@younesbelkada younesbelkada merged commit b127900 into huggingface:master Dec 30, 2022
yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
* working v1

* add `accelerate` on requirements

* add `accelerate` on `setup.py`

* add `datasets` on `setup.py`

* small updates

- add docstring on most functions
- correct logging

* rm unneeded file

* replace with `generate`

* Update trl/trainer/accelerate_ppo.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* correct return

* add dataloader support

* add `wandb` to `setup.py`

* refactor

- remove `_build_dataset` method
- change name to `PPOTrainer`

* test

* fix test

* rename file

* refactor

* remove unneeded device assignment

* fix correct device assignment

* standardize docstrings

* add `wandb` on `dev`

* fix slow convergence

- random init seems to converge much faster

* oops

* revert fix

* revert patch

* remove unneeded reshape

* add input safety checker

* refactor

- added comments on example
- fixes CI test
- rewards should be a list of tensors
- clearer error messages
- remove build model method
- refactor log stats method

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* refactor

- added `PPOConfig` class
- docstring on `LengthSampler`
- fix test
- gather rewards when logging
- unwrap model when calling generate

* some refactor

* remove unneeded hack

* adapt dataset

* fix test

* remove rollout

* remove timing

* remove `shuffle=True`

* remove `LengthSampler` from trainer

* refactor

* remove text length sampler args from config

* change collate_fn

* fix silent bug

* rename

* move file

* refactor base trainer

* fix collate

* final bug

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants