Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support reward modeling and ppo #2093

Merged
merged 19 commits into from
Oct 10, 2024
Merged

support reward modeling and ppo #2093

merged 19 commits into from
Oct 10, 2024

Conversation

hjh0119
Copy link
Collaborator

@hjh0119 hjh0119 commented Sep 22, 2024

PR type

  • New Feature

PR information

support reward modeling for LLM and MLLM
support PPO for LLM

@hjh0119 hjh0119 changed the title [WIP] support reward modeling [WIP] support reward modeling and ppo Oct 8, 2024
@@ -98,7 +99,7 @@ def __init__(self,
optimizers=optimizers,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
**kwargs)
if not self.label_names:
if not hasattr(self, 'label_names') or not self.label_names:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PPOv2Trainer does not have the label_names attribute.

@hjh0119 hjh0119 changed the title [WIP] support reward modeling and ppo support reward modeling and ppo Oct 8, 2024
@@ -98,7 +98,8 @@ def prepare_model(model, args: SftArguments):
if args.resume_from_checkpoint is None:
handle_target_modules(model, args)
handle_modules_to_save(model, args)
if args.init_lora_weights and args.init_lora_weights.lower() in ('true', 'false'):
if args.init_lora_weights and isinstance(args.init_lora_weights,
str) and args.init_lora_weights.lower() in ('true', 'false'):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid errors in reward_model_args.

@@ -1196,6 +1196,10 @@ def _init_training_args(self) -> None:
if 'accelerator_config' in parameters:
kwargs['accelerator_config'] = {'dispatch_batches': False}

metric_for_best_model = 'rouge-l' if self.predict_with_generate else 'loss'
if hasattr(self, 'rlhf_type') and self.rlhf_type == 'ppo':
metric_for_best_model = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PPO training metrics, there are no metrics that start with "eval", set None here

@hjh0119 hjh0119 merged commit ad87774 into modelscope:main Oct 10, 2024
2 checks passed
@hjh0119 hjh0119 deleted the rm branch October 10, 2024 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants