Layer Profile

Layer Profile in RWKV-LM-RLHF

Overview

RWKV-LM-RLHF enables flexible customization of training parameters for each model layer to maximize learning outcomes with limited computational resources (VRAM). This includes fine-tuned control over layer freezing, full parameter training, PEFT methods, and learning rates.

Example Layer Profile Configuration

Layer	Mode	Rank	Alpha	Dropout	Weight_lr_init	Weight_lr_final	Weight_decay	State_lr_init	State_lr_final
emb	freeze	0	0	0.01	0.000001	0.0000001	0.01	0.05	0.01
0	bone	512	32	0.01	0.0001	0.00001	0.01	0.05	0.01
1	bone	512	32	0.01	0.0001	0.00001	0.01	0.05	0.01
2	bone	512	32	0.01	0.0001	0.00001	0.01	0.05	0.01
3	bone	512	32	0.01	0.0001	0.00001	0.01	0.05	0.01
4	bone	1024	32	0.01	0.0001	0.00001	0.01	0.05	0.01
5	bone	1024	32	0.01	0.0001	0.00001	0.01	0.05	0.01
6	bone	1024	32	0.01	0.0001	0.00001	0.01	0.05	0.01
7	bone	1024	32	0.01	0.0001	0.00001	0.01	0.05	0.01
8	full	0	32	0.01	0.0001	0.00001	0.01	0.05	0.01
9	full	0	32	0.01	0.0001	0.00001	0.01	0.05	0.01
10	full	0	32	0.01	0.0001	0.00001	0.01	0.05	0.01
11	full	0	32	0.01	0.0001	0.00001	0.01	0.05	0.01
head	full	0	32	0.01	0.00001	0.000001	0.01	0.05	0.01

Configuration Details

In this example configuration:

Embedding layer is frozen
Layers 0-3 use PEFT (Bone512)
Layers 4-7 use PEFT (Bone1024)
Layers 8 through Head layer use full parameter tuning

Training Features

Dual Training Support

RWKV-LM-RLHF supports simultaneous training of weights and states. Learning rates can be configured independently for each layer.

Empirical Findings

Higher training resolution in layers closer to the Head typically yields better learning results
For RWKV x060, there have been reports of improved translation performance when fine-tuning only around layer 17

Recommendations

Parameter Selection

Configure settings based on the balance between learning speed and VRAM constraints
For PEFT applications, Bone is recommended as the primary choice (refer to Jl-er's paper for details)
LoRA is effective for scenarios with significantly limited datasets
- Experiment with LoRA on approximately four layers near the Head layer

These configurations should be adjusted based on your specific requirements and resource constraints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly