support reward modeling and ppo #2093

hjh0119 · 2024-09-22T10:21:29Z

PR type

New Feature

PR information

support reward modeling for LLM and MLLM
support PPO for LLM

hjh0119 · 2024-10-08T11:35:30Z

swift/trainers/mixin.py

@@ -98,7 +99,7 @@ def __init__(self,
            optimizers=optimizers,
            preprocess_logits_for_metrics=preprocess_logits_for_metrics,
            **kwargs)
-        if not self.label_names:
+        if not hasattr(self, 'label_names') or not self.label_names:


PPOv2Trainer does not have the label_names attribute.

hjh0119 · 2024-10-09T08:59:29Z

swift/llm/tuner.py

@@ -98,7 +98,8 @@ def prepare_model(model, args: SftArguments):
        if args.resume_from_checkpoint is None:
            handle_target_modules(model, args)
            handle_modules_to_save(model, args)
-            if args.init_lora_weights and args.init_lora_weights.lower() in ('true', 'false'):
+            if args.init_lora_weights and isinstance(args.init_lora_weights,
+                                                     str) and args.init_lora_weights.lower() in ('true', 'false'):


Avoid errors in reward_model_args.

hjh0119 · 2024-10-09T10:13:54Z

swift/llm/utils/argument.py

@@ -1196,6 +1196,10 @@ def _init_training_args(self) -> None:
        if 'accelerator_config' in parameters:
            kwargs['accelerator_config'] = {'dispatch_batches': False}

+        metric_for_best_model = 'rouge-l' if self.predict_with_generate else 'loss'
+        if hasattr(self, 'rlhf_type') and self.rlhf_type == 'ppo':
+            metric_for_best_model = None


In the PPO training metrics, there are no metrics that start with "eval", set None here

hjh0119 added 7 commits September 18, 2024 10:34

reward trainer

b7cb219

Merge remote-tracking branch 'origin/main' into rm

487791c

udpate sft

3959caf

merge main

9f216bb

save value head

99dd245

remove unused code

543a484

fix mllm rm

dc48582

Jintao-Huang approved these changes Sep 23, 2024

View reviewed changes

ppo trainer

7300053

hjh0119 changed the title ~~[WIP] support reward modeling~~ [WIP] support reward modeling and ppo Oct 8, 2024

ppo for llm

01622fc

hjh0119 commented Oct 8, 2024

View reviewed changes

ppo template mixin

6061cf7

hjh0119 changed the title ~~[WIP] support reward modeling and ppo~~ support reward modeling and ppo Oct 8, 2024

doc

0e2e243

hjh0119 commented Oct 9, 2024

View reviewed changes

hjh0119 added 6 commits October 9, 2024 17:24

mllm doc'

51dc87f

typo

4f3823e

Merge branch 'main' into rm

503955e

readme

4496922

require version update

fb13b2b

fix import

ba30cd3

hjh0119 commented Oct 9, 2024

View reviewed changes

hjh0119 added 2 commits October 9, 2024 18:16

typo

c6c85fa

update

6b2ad2f

Jintao-Huang approved these changes Oct 9, 2024

View reviewed changes

tastelikefeet approved these changes Oct 10, 2024

View reviewed changes

hjh0119 merged commit ad87774 into modelscope:main Oct 10, 2024
2 checks passed

hjh0119 deleted the rm branch October 10, 2024 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support reward modeling and ppo #2093

support reward modeling and ppo #2093

hjh0119 commented Sep 22, 2024 •

edited

Loading

hjh0119 Oct 8, 2024

hjh0119 Oct 9, 2024

hjh0119 Oct 9, 2024

support reward modeling and ppo #2093

support reward modeling and ppo #2093

Conversation

hjh0119 commented Sep 22, 2024 • edited Loading

PR type

PR information

hjh0119 Oct 8, 2024

Choose a reason for hiding this comment

hjh0119 Oct 9, 2024

Choose a reason for hiding this comment

hjh0119 Oct 9, 2024

Choose a reason for hiding this comment

hjh0119 commented Sep 22, 2024 •

edited

Loading