Skip to content

Layer Profile

OpenMOSE edited this page Jan 19, 2025 · 1 revision

Layer Profile in RWKV-LM-RLHF

Overview

RWKV-LM-RLHF enables flexible customization of training parameters for each model layer to maximize learning outcomes with limited computational resources (VRAM). This includes fine-tuned control over layer freezing, full parameter training, PEFT methods, and learning rates.

Example Layer Profile Configuration

Layer Mode Rank Alpha Dropout Weight_lr_init Weight_lr_final Weight_decay State_lr_init State_lr_final
emb freeze 0 0 0.01 0.000001 0.0000001 0.01 0.05 0.01
0 bone 512 32 0.01 0.0001 0.00001 0.01 0.05 0.01
1 bone 512 32 0.01 0.0001 0.00001 0.01 0.05 0.01
2 bone 512 32 0.01 0.0001 0.00001 0.01 0.05 0.01
3 bone 512 32 0.01 0.0001 0.00001 0.01 0.05 0.01
4 bone 1024 32 0.01 0.0001 0.00001 0.01 0.05 0.01
5 bone 1024 32 0.01 0.0001 0.00001 0.01 0.05 0.01
6 bone 1024 32 0.01 0.0001 0.00001 0.01 0.05 0.01
7 bone 1024 32 0.01 0.0001 0.00001 0.01 0.05 0.01
8 full 0 32 0.01 0.0001 0.00001 0.01 0.05 0.01
9 full 0 32 0.01 0.0001 0.00001 0.01 0.05 0.01
10 full 0 32 0.01 0.0001 0.00001 0.01 0.05 0.01
11 full 0 32 0.01 0.0001 0.00001 0.01 0.05 0.01
head full 0 32 0.01 0.00001 0.000001 0.01 0.05 0.01

Configuration Details

In this example configuration:

  • Embedding layer is frozen
  • Layers 0-3 use PEFT (Bone512)
  • Layers 4-7 use PEFT (Bone1024)
  • Layers 8 through Head layer use full parameter tuning

Training Features

Dual Training Support

RWKV-LM-RLHF supports simultaneous training of weights and states. Learning rates can be configured independently for each layer.

Empirical Findings

  • Higher training resolution in layers closer to the Head typically yields better learning results
  • For RWKV x060, there have been reports of improved translation performance when fine-tuning only around layer 17

Recommendations

Parameter Selection

  • Configure settings based on the balance between learning speed and VRAM constraints
  • For PEFT applications, Bone is recommended as the primary choice (refer to Jl-er's paper for details)
  • LoRA is effective for scenarios with significantly limited datasets
    • Experiment with LoRA on approximately four layers near the Head layer

These configurations should be adjusted based on your specific requirements and resource constraints.