Pre‐Instruct Tuning

Pre-Instruct Tuning for RWKV-LM-RLHF

Overview

Pre-Instruct Tuning is a crucial step in training RWKV-LM-RLHF models to understand and follow instructions effectively. This process helps the model develop basic instruction-following capabilities before more advanced training stages.

Simple Explanation

Think of Pre-Instruct Tuning as teaching a student the basic rules of conversation. Just like how students learn to raise their hand before speaking in class, the AI model learns the basic patterns of how to respond to questions and follow instructions.

Training Environment Requirements

Hardware Requirements

Recommended: 24GB VRAM GPU
- Compatible GPUs: RTX3090, RTX4090, AMD MI100
- Possible compatibility with 16GB GPUs (unverified)

Operating System

Confirmed working on Ubuntu 22.04 and 24.04
Recommendation: Disable Wayland for training stability

Model Preparation

Supported Models

RWKV v6 (Finch)
RWKV v7 (Goose)
Note: Currently no cross-compatibility between versions

Base Model

Using RWKV v6 1.6B model as reference:

Download link: RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth

Dataset Preparation

Format Requirements

Structure: Pairs of Instruct and Output
File format: JSONL
Example format:

{"text":"User: Gooday! how are you?\n\n\x17Assistant: Gooday! I'm RWKV Assistant how can i help you?\n\n\x17"}

Important Considerations

End tokens:
- Basic end token: '\n\n'
- Recommended end token: '\n\n\x17'
- Rationale: Prevents confusion with common markdown double line breaks

Dataset Size

Recommended: 100k+ instruction-response pairs
Benefits: Helps prevent overfitting

Dataset Tokenization and Training

Dataset Tokenization Process

Converting JSONL to H5py

Convert folders containing JSONL datasets using the following command:

python sft_generate_h5.py --input_folder 'example/SFT/output_jsonl' \
 --output_parquet 'example/SFT/output_h5/sftdataset.h5' \
 --load_model 'myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth'

Parameters

--input_folder: Directory containing JSONL files
--output_parquet: Output path for tokenized dataset (single file)
--load_model: Model path (currently required)

Note: H5py is used as the dataset container due to its excellent random access performance and low RAM usage.

Training Process

Basic Training Command

python train.py --load_model "myfolder/models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth" \
 --wandb "RWKV-LM-RLHF 1B6 SFT Pre-Instruct" --proj_dir "myfolder/Outputs/1b6-pre-instruct" \
 --infctx 0 \
 --vocab_size 65536 --ctx_len 4096 \
 --epoch_steps 200 --epoch_count 200 --epoch_begin 0 --epoch_save 1 \
 --micro_bsz 1 --n_layer 24 --n_embd 2048 \
 --lr_init 2e-5 --lr_final 1e-6 \
 --warmup_steps 100 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 \
 --accelerator gpu --devices 1 --precision 'bf16' \
 --grad_cp 1 --my_testing "x060" \
 --strategy deepspeed_stage_2_offload \
 --layer_profile 'layerprofile/24_TEST.csv' \
 --quant 1 \
 --quant_mode 'nf4'\
 --gpu_arch 'cuda' \
 --limited_lora 0 \
 --sft 1 \
 --smoothing 0.005 \
 --random_mode 1 \
 --optim '' \
 --train_data_file 'example/SFT/output_h5/sftdataset.h5' \
 --infctx_dataset_multiplier 8 \
 --accumulate_grad_batches 16

Key Training Parameters

--load_model: Source model path
--ctx_len: Training context length
--layer_profile: Layer configuration file Check LayerProfile
--smoothing: Logits averaging coefficient (recommended range: 0.001-0.05)
- Higher values improve transfer performance but may mix languages
--random_mode: Enables random dataset selection
--train_data_file: Dataset path
--infctx_dataset_multiplier: Number of dataset items to load per training step

Training Performance

Highly effective for large datasets
Convergence time: ~3 days for 800k pairs on single RTX4090
Enhanced multi-turn performance through --infctx_dataset_multiplier
- Effectively creates multi-turn learning scenarios
- Significantly improves multi-turn conversation capabilities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly