huggingface · regisss · Dec 12, 2024 · Dec 12, 2024 · Dec 12, 2024
@@ -926,6 +926,53 @@ python3 ../gaudi_spawn.py --world_size 8 --use_mpi peft_poly_seq2seq_with_genera
     --trust_remote_code
 ```
 
+### Training models with Long Sequence lengths
+We have added support for [Deepspeed Ulysses](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ulysses/README.md). This allows us to train large transformer models using very long sequence length inputs with limited HW resources. This feature has been tested using LLama3.1-8B & LLama3.1-70B fine-tuning with input sequence lengths of 32k on 8xGaudi3 cards. Reference command for LLama3.1-8B fine-tuning is shared below. 
+
+`--context_parallel_size` sets the number of cards single input sequences will get mapped to, e.g., setting `context_parallel_size=4` with `max_seq_len=32k` will result in each card processing input chunks of length 8k each (thereby reducing memory requirement for activations). This feature can be combined with Zero-3 to enable scaling not only to large sequence lengths but also to large size models.
+
+> [!NOTE]  
+> This feature is still in beta version and may not work out of the box for all transformer model architectures and configurations.
+
+```bash
+HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM=1   \
+python3 ../gaudi_spawn.py  \
+        --world_size 8  --use_deepspeed run_lora_clm.py \
+        --model_name_or_path meta-llama/Llama-3.1-8B \
+        --dataset_name tatsu-lab/alpaca \
+        --bf16 True \
+        --output_dir /tmp/lora_out \
+        --max_seq_len 32768 \
+        --per_device_train_batch_size 1 \
+        --per_device_eval_batch_size 1 \
+        --save_strategy no \
+        --learning_rate 0.0004 \
+        --warmup_ratio 0.03 \
+        --lr_scheduler_type "constant" \
+        --logging_steps 1 \
+        --dataset_concatenation \
+        --do_train \
+        --use_habana \
+        --throughput_warmup_steps 3 \
+        --lora_rank 8 \
+        --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
+        --attn_softmax_bf16 True \
+        --validation_split_percentage 4 \
+        --flash_attention_causal_mask True \
+        --evaluation_strategy epoch \
+        --pipelining_fwd_bwd \
+        --use_lazy_mode \
+        --use_flash_attention True \
+        --deepspeed llama3_ds_zero1_config.json \
+        --num_train_epochs 3 \
+        --eval_delay 3 \
+        --do_eval \
+        --lora_alpha 16 \
+        --lora_dropout 0.05 \
+        --gradient_accumulation_steps 4 \
+        --flash_attention_recompute True \
+        --context_parallel_size 4
+```
 
 ## Streaming