-
Notifications
You must be signed in to change notification settings - Fork 0
Odds Ratio Preference Optimization
This document explains the overview of Odds Ratio Preference Optimization in RWKV-LM-RLHF, including dataset preparation and training methods.
Odds Ratio Preference Optimization is a training method that aims to improve language model responses by learning from paired examples of preferred and non-preferred outputs. The method works by:
- Calculating probability ratios between chosen and rejected responses
- Optimizing the model to maximize the likelihood of preferred responses
- Simultaneously minimizing the likelihood of non-preferred responses
- Using a comparative learning approach to enhance response quality
Think of ORPO like teaching a student by showing them both good and bad examples. Instead of just saying "this is correct," we show them "this is better than that" comparisons. This helps the model learn not just what to do, but also what to avoid, making its responses more natural and appropriate.
-
Recommended GPU: 24GB VRAM (RTX3090, RTX4090, AMD MI100)
- ORPO requires approximately 2x compute power and VRAM compared to standard SFT
- 16GB GPU might work (unconfirmed)
- Operating System: Ubuntu 22.04 or 24.04
- Note: Recommended to disable Wayland for training stability
-
Recommended GPU: 24GB VRAM (RTX3090, RTX4090, AMD MI100)
- DPO requires approximately 2x the compute power and VRAM compared to standard SFT
- Note: 16GB GPUs might work, but this is untested
-
Operating System: Ubuntu 22.04 or 24.04
- Recommendation: Disable Wayland for training stability
RWKV-LM-RLHF supports:
- RWKV v6 (Finch)
- RWKV v7 (Goose)
Note: Currently no cross-compatibility between versions
For this guide, we use RWKV v6 1.6B model: Download Link
The dataset should be structured as a CSV file containing three columns:
- Prompt
- Chosen (preferred response)
- Reject (non-preferred response)
prompt,chosen,reject
who are you?,i'm RWKV whats up?,'i'm an AI Assistant. how can i help you?
RWKV-LM-RLHF provides two types of sample datasets:
-
Complete Dataset (with reject responses)
Contains Prompt, Chosen, and Reject responses
View Sample -
Base Dataset (without reject responses)
Contains only Prompt and Chosen responses
View Sample
The repository includes a utility to automatically generate reject responses using the base model. This is useful when you only have preferred responses available.
python rlhf_generate_reject_csv.py --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth' \
--input_csv 'example/ORPO/input_csv/rlhf_example_dataset.csv' \
--output_csv 'example/ORPO/output_csv/rlhf_example_dataset_withreject.csv' \
--strategy 'cuda fp16'
-
--load_model
: Path to the model used for generating reject responses -
--input_csv
: Path to input CSV file- Must contain columns: prompt, chosen, reject (reject can be empty)
-
--output_csv
: Path where the generated dataset will be saved -
--strategy
: Model inference strategy- Default: 'cuda fp16'
- For larger models (e.g., 14B): Use 'cuda fp16i8' to fit within VRAM constraints
Note: Ensure your input CSV maintains the required column structure (prompt, chosen, reject) even if the reject column is empty.
First, we need to tokenize the dataset:
python rlhf_generate_save.py --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth' \
--input_csv 'example/ORPO/output_csv/rlhf_example_dataset_withreject.csv' \
--output_save 'example/ORPO/output_save/rlhf_example_dataset.save' \
--target_pair_count 60
-
--load_model
: Path to the base model -
--input_csv
: Path to CSV file containing prompt, chosen, and reject columns -
--output_save
: Path for saving processed dataset (used for training) -
--target_pair_count
: Number of pairs to process (recommend: 2x the number of pairs in CSV)
Launch training with the following command:
python train.py --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth' \
--wandb "RWKV-LM-RLHF 1B6-RLHF ORPO" --proj_dir "myfolder/Outputs/1B6-ORPO"\
--infctx 0 \
--vocab_size 65536 --ctx_len 2048 \
--epoch_steps 1000 --epoch_count 1000 --epoch_begin 0 --epoch_save 1 \
--micro_bsz 1 --n_layer 24 --n_embd 2048\
--lr_init 5e-6 --lr_final 1e-6 \
--warmup_steps 100 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 \
--grad_cp 1 --my_testing "x060" \
--strategy deepspeed_stage_2_offload \
--layer_profile 'layerprofile/24_TEST.csv' \
--quant 1 \
--quant_mode 'nf4'\
--gpu_arch 'cuda' \
--orpo 1 \
--orpo_alpha 0.01 \
--rlhf_train_file 'example/ORPO/output_save/rlhf_example_dataset.save' \
--rlhf_max_corpus_len 1024 \
--accumulate_grad_batches 16
-
--load_model
: Path to base model -
--ctx_len
: CUDA kernel context length (set to 2xrlhf_max_corpus_len
) -
--lr_init
: Initial learning rate (should be very low for RLHF) -
--lr_final
: Final learning rate (typically 1/5 of initial rate) -
--orpo_alpha
: ORPO Ratio (0-1.0, recommended: 0.0001 to 0.01) -
--layer_profile
: Detailed training strategy configurationcheck LayerProfile -
--rlhf_max_corpus_len
: Max Context Length(Prompt + chosen and Prompt + Reject)
- Setup: Login to Weights & Biases (Wandb) for monitoring
-
Key Metric: Watch the Pref-Percentage
- Should start around 0.5
- Gradually increase towards 1.0
- Steady increase indicates successful training
A successful ORPO training typically shows:
- Preference percentage starting at ~0.5
- Gradual, consistent increase
- Movement toward 1.0
- No sudden jumps or unstable behavior
Note: The layer profile configuration details can be found in the referenced documentation.