-
Notifications
You must be signed in to change notification settings - Fork 0
Pre‐Instruct Tuning
OpenMOSE edited this page Jan 19, 2025
·
4 revisions
Pre-Instruct Tuning is a crucial step in training RWKV-LM-RLHF models to understand and follow instructions effectively. This process helps the model develop basic instruction-following capabilities before more advanced training stages.
Think of Pre-Instruct Tuning as teaching a student the basic rules of conversation. Just like how students learn to raise their hand before speaking in class, the AI model learns the basic patterns of how to respond to questions and follow instructions.
- Recommended: 24GB VRAM GPU
- Compatible GPUs: RTX3090, RTX4090, AMD MI100
- Possible compatibility with 16GB GPUs (unverified)
- Confirmed working on Ubuntu 22.04 and 24.04
- Recommendation: Disable Wayland for training stability
- RWKV v6 (Finch)
- RWKV v7 (Goose)
- Note: Currently no cross-compatibility between versions
Using RWKV v6 1.6B model as reference:
- Download link: RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth
- Structure: Pairs of Instruct and Output
- File format: JSONL
- Example format:
{"text":"User: Gooday! how are you?\n\n\x17Assistant: Gooday! I'm RWKV Assistant how can i help you?\n\n\x17"}
- End tokens:
- Basic end token: '\n\n'
- Recommended end token: '\n\n\x17'
- Rationale: Prevents confusion with common markdown double line breaks
- Recommended: 100k+ instruction-response pairs
- Benefits: Helps prevent overfitting
Convert folders containing JSONL datasets using the following command:
python sft_generate_h5.py --input_folder 'example/SFT/output_jsonl' \
--output_parquet 'example/SFT/output_h5/sftdataset.h5' \
--load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth'
-
--input_folder
: Directory containing JSONL files -
--output_parquet
: Output path for tokenized dataset (single file) -
--load_model
: Model path (currently required)
Note: H5py is used as the dataset container due to its excellent random access performance and low RAM usage.
python train.py --load_model "myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth" \
--wandb "RWKV-LM-RLHF 1B6 SFT Pre-Instruct" --proj_dir "myfolder/Outputs/1b6-pre-instruct" \
--infctx 0 \
--vocab_size 65536 --ctx_len 4096 \
--epoch_steps 200 --epoch_count 200 --epoch_begin 0 --epoch_save 1 \
--micro_bsz 1 --n_layer 24 --n_embd 2048 \
--lr_init 2e-5 --lr_final 1e-6 \
--warmup_steps 100 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision 'bf16' \
--grad_cp 1 --my_testing "x060" \
--strategy deepspeed_stage_2_offload \
--layer_profile 'layerprofile/24_TEST.csv' \
--quant 1 \
--quant_mode 'nf4'\
--gpu_arch 'cuda' \
--limited_lora 0 \
--sft 1 \
--smoothing 0.005 \
--random_mode 1 \
--optim '' \
--train_data_file 'example/SFT/output_h5/sftdataset.h5' \
--infctx_dataset_multiplier 8 \
--accumulate_grad_batches 16
-
--load_model
: Source model path -
--ctx_len
: Training context length -
--layer_profile
: Layer configuration file Check LayerProfile -
--smoothing
: Logits averaging coefficient (recommended range: 0.001-0.05)- Higher values improve transfer performance but may mix languages
-
--random_mode
: Enables random dataset selection -
--train_data_file
: Dataset path -
--infctx_dataset_multiplier
: Number of dataset items to load per training step
- Highly effective for large datasets
- Convergence time: ~3 days for 800k pairs on single RTX4090
- Enhanced multi-turn performance through
--infctx_dataset_multiplier
- Effectively creates multi-turn learning scenarios
- Significantly improves multi-turn conversation capabilities