-
Notifications
You must be signed in to change notification settings - Fork 0
Layer Profile
OpenMOSE edited this page Jan 19, 2025
·
1 revision
RWKV-LM-RLHF enables flexible customization of training parameters for each model layer to maximize learning outcomes with limited computational resources (VRAM). This includes fine-tuned control over layer freezing, full parameter training, PEFT methods, and learning rates.
Layer | Mode | Rank | Alpha | Dropout | Weight_lr_init | Weight_lr_final | Weight_decay | State_lr_init | State_lr_final |
---|---|---|---|---|---|---|---|---|---|
emb | freeze | 0 | 0 | 0.01 | 0.000001 | 0.0000001 | 0.01 | 0.05 | 0.01 |
0 | bone | 512 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
1 | bone | 512 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
2 | bone | 512 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
3 | bone | 512 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
4 | bone | 1024 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
5 | bone | 1024 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
6 | bone | 1024 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
7 | bone | 1024 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
8 | full | 0 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
9 | full | 0 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
10 | full | 0 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
11 | full | 0 | 32 | 0.01 | 0.0001 | 0.00001 | 0.01 | 0.05 | 0.01 |
head | full | 0 | 32 | 0.01 | 0.00001 | 0.000001 | 0.01 | 0.05 | 0.01 |
In this example configuration:
- Embedding layer is frozen
- Layers 0-3 use PEFT (Bone512)
- Layers 4-7 use PEFT (Bone1024)
- Layers 8 through Head layer use full parameter tuning
RWKV-LM-RLHF supports simultaneous training of weights and states. Learning rates can be configured independently for each layer.
- Higher training resolution in layers closer to the Head typically yields better learning results
- For RWKV x060, there have been reports of improved translation performance when fine-tuning only around layer 17
- Configure settings based on the balance between learning speed and VRAM constraints
- For PEFT applications, Bone is recommended as the primary choice (refer to Jl-er's paper for details)
- LoRA is effective for scenarios with significantly limited datasets
- Experiment with LoRA on approximately four layers near the Head layer
These configurations should be adjusted based on your specific requirements and resource constraints.