Skip to content
OpenMOSE edited this page Jan 19, 2025 · 8 revisions

RWKV-LM-RLHF Official Wiki

Overview

RWKV-LM-RLHF is a complex implementation designed to maximize the potential of RWKV. This Wiki aims to provide clear documentation for newcomers to understand the project.

Our goal is to further explore RWKV's capabilities and develop a true Apache-2.0 model independent of Big Tech LLMs.

Architecture

RWKV-LM-RLHF primarily functions as a PEFT (Parameter-Efficient Fine-Tuning) trainer for the RWKV language model. Due to PEFT's inherent limitations, this repo supports hybrid training approaches including:

  • Full parameter layers
  • PEFT(Bone,LoRA)
  • State-tuning
  • Very flexible layer configuration check LayerProfile

Features

RLHF Implementation

1. DPO (Direct Preference Optimization) How to Use

A method that directly optimizes language models based on human preferences without requiring reward modeling or reinforcement learning.

2. ORPO (Odds Ratio Preference Optimization)How to Use

A preference optimization technique that uses odds ratios to improve model alignment with human preferences.

Training Methods

3. Pre-Instruct Tuning How to Use

Initial phase of instruction tuning that prepares the model for more specific instruction following.

4. Instruct Tuning How to Use

Fine-tuning process that teaches the model to follow specific instructions and generate appropriate responses.

5. Compressed Top-k Distillation

Knowledge distillation technique that:

  • Captures top-k logits from teacher model offline
  • Trains student model using KL divergence learning
  • Optimizes knowledge transfer while maintaining efficiency

Current Status

RWKV-LM-RLHF is currently in development. We welcome contributions from the community to enhance and improve the project. Together, we can advance RWKV's capabilities and impact.

Note: Detailed documentation for each feature can be found in their respective pages.

© 2025 OpenMOSE