Skip to content

Odds Ratio Preference Optimization

OpenMOSE edited this page Jan 19, 2025 · 4 revisions

Odds Ratio Preference Optimization (ORPO) for RWKV-LM-RLHF

Introduction

This document explains the overview of Odds Ratio Preference Optimization in RWKV-LM-RLHF, including dataset preparation and training methods.

1. Understanding ORPO

Technical Details

Odds Ratio Preference Optimization is a training method that aims to improve language model responses by learning from paired examples of preferred and non-preferred outputs. The method works by:

  1. Calculating probability ratios between chosen and rejected responses
  2. Optimizing the model to maximize the likelihood of preferred responses
  3. Simultaneously minimizing the likelihood of non-preferred responses
  4. Using a comparative learning approach to enhance response quality

Simple Explanation

Think of ORPO like teaching a student by showing them both good and bad examples. Instead of just saying "this is correct," we show them "this is better than that" comparisons. This helps the model learn not just what to do, but also what to avoid, making its responses more natural and appropriate.

2. Training Environment Requirements

Hardware Requirements

  • Recommended GPU: 24GB VRAM (RTX3090, RTX4090, AMD MI100)
    • ORPO requires approximately 2x compute power and VRAM compared to standard SFT
    • 16GB GPU might work (unconfirmed)

Software Requirements

  • Operating System: Ubuntu 22.04 or 24.04
  • Note: Recommended to disable Wayland for training stability

System Requirements

Hardware Requirements

  • Recommended GPU: 24GB VRAM (RTX3090, RTX4090, AMD MI100)
    • DPO requires approximately 2x the compute power and VRAM compared to standard SFT
    • Note: 16GB GPUs might work, but this is untested

Software Requirements

  • Operating System: Ubuntu 22.04 or 24.04
    • Recommendation: Disable Wayland for training stability

Implementation Guide

1. Model Preparation

RWKV-LM-RLHF supports:

  • RWKV v6 (Finch)
  • RWKV v7 (Goose)

Note: Currently no cross-compatibility between versions

How to try

Reference Model

For this guide, we use RWKV v6 1.6B model: Download Link

2. Dataset Preparation

Format Requirements

The dataset should be structured as a CSV file containing three columns:

  • Prompt
  • Chosen (preferred response)
  • Reject (non-preferred response)

Example CSV Format

prompt,chosen,reject
who are you?,i'm RWKV whats up?,'i'm an AI Assistant. how can i help you?

Dataset Examples

Sample Datasets

RWKV-LM-RLHF provides two types of sample datasets:

  1. Complete Dataset (with reject responses)
    Contains Prompt, Chosen, and Reject responses
    View Sample

  2. Base Dataset (without reject responses)
    Contains only Prompt and Chosen responses
    View Sample

Generating Reject Responses

The repository includes a utility to automatically generate reject responses using the base model. This is useful when you only have preferred responses available.

Command Structure

python rlhf_generate_reject_csv.py --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth' \
 --input_csv 'example/ORPO/input_csv/rlhf_example_dataset.csv' \
 --output_csv 'example/ORPO/output_csv/rlhf_example_dataset_withreject.csv' \
 --strategy 'cuda fp16' 

Parameters Explanation

  • --load_model: Path to the model used for generating reject responses
  • --input_csv: Path to input CSV file
    • Must contain columns: prompt, chosen, reject (reject can be empty)
  • --output_csv: Path where the generated dataset will be saved
  • --strategy: Model inference strategy
    • Default: 'cuda fp16'
    • For larger models (e.g., 14B): Use 'cuda fp16i8' to fit within VRAM constraints

Note: Ensure your input CSV maintains the required column structure (prompt, chosen, reject) even if the reject column is empty.

1. Data Preprocessing

First, we need to tokenize the dataset:

python rlhf_generate_save.py  --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth' \
 --input_csv 'example/ORPO/output_csv/rlhf_example_dataset_withreject.csv' \
 --output_save 'example/ORPO/output_save/rlhf_example_dataset.save' \
 --target_pair_count 60

Parameters Explanation

  • --load_model: Path to the base model
  • --input_csv: Path to CSV file containing prompt, chosen, and reject columns
  • --output_save: Path for saving processed dataset (used for training)
  • --target_pair_count: Number of pairs to process (recommend: 2x the number of pairs in CSV)

2. Training Configuration

Launch training with the following command:

python train.py --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth' \
 --wandb "RWKV-LM-RLHF 1B6-RLHF ORPO" --proj_dir "myfolder/Outputs/1B6-ORPO"\
 --infctx 0 \
 --vocab_size 65536 --ctx_len 2048 \
 --epoch_steps 1000 --epoch_count 1000 --epoch_begin 0 --epoch_save 1 \
 --micro_bsz 1 --n_layer 24 --n_embd 2048\
 --lr_init 5e-6 --lr_final 1e-6 \
 --warmup_steps 100 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 \
 --accelerator gpu --devices 1 --precision bf16 \
 --grad_cp 1 --my_testing "x060" \
 --strategy deepspeed_stage_2_offload \
 --layer_profile 'layerprofile/24_TEST.csv' \
 --quant 1 \
 --quant_mode 'nf4'\
 --gpu_arch 'cuda' \
 --orpo 1 \
 --orpo_alpha 0.01 \
 --rlhf_train_file 'example/ORPO/output_save/rlhf_example_dataset.save' \
 --rlhf_max_corpus_len 1024 \
 --accumulate_grad_batches 16

Key Training Parameters

Essential Parameters

  • --load_model: Path to base model
  • --ctx_len: CUDA kernel context length (set to 2x rlhf_max_corpus_len)
  • --lr_init: Initial learning rate (should be very low for RLHF)
  • --lr_final: Final learning rate (typically 1/5 of initial rate)
  • --orpo_alpha: ORPO Ratio (0-1.0, recommended: 0.0001 to 0.01)
  • --layer_profile: Detailed training strategy configurationcheck LayerProfile
  • --rlhf_max_corpus_len: Max Context Length(Prompt + chosen and Prompt + Reject)

Monitoring Training Progress

  1. Setup: Login to Weights & Biases (Wandb) for monitoring
  2. Key Metric: Watch the Pref-Percentage
    • Should start around 0.5
    • Gradually increase towards 1.0
    • Steady increase indicates successful training

Training Success Indicators

A successful ORPO training typically shows:

  • Preference percentage starting at ~0.5
  • Gradual, consistent increase
  • Movement toward 1.0
  • No sudden jumps or unstable behavior

Note: The layer profile configuration details can be found in the referenced documentation.