Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions examples/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -926,6 +926,53 @@ python3 ../gaudi_spawn.py --world_size 8 --use_mpi peft_poly_seq2seq_with_genera
--trust_remote_code
```

### Training models with Long Sequence lengths
We have added support for [Deepspeed Ulysses](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ulysses/README.md). This allows us to train large transformer models using very long sequence length inputs with limited HW resources. This feature has been tested using LLama3.1-8B & LLama3.1-70B fine-tuning with input sequence lengths of 32k on 8xGaudi3 cards. Reference command for LLama3.1-8B fine-tuning is shared below.

`--context_parallel_size` sets the number of cards single input sequences will get mapped to, e.g., setting `context_parallel_size=4` with `max_seq_len=32k` will result in each card processing input chunks of length 8k each (thereby reducing memory requirement for activations). This feature can be combined with Zero-3 to enable scaling not only to large sequence lengths but also to large size models.

> [!NOTE]
> This feature is still in beta version and may not work out of the box for all transformer model architectures and configurations.

```bash
HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM=1 \
python3 ../gaudi_spawn.py \
--world_size 8 --use_deepspeed run_lora_clm.py \
--model_name_or_path meta-llama/Llama-3.1-8B \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir /tmp/lora_out \
--max_seq_len 32768 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--save_strategy no \
--learning_rate 0.0004 \
--warmup_ratio 0.03 \
--lr_scheduler_type "constant" \
--logging_steps 1 \
--dataset_concatenation \
--do_train \
--use_habana \
--throughput_warmup_steps 3 \
--lora_rank 8 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--attn_softmax_bf16 True \
--validation_split_percentage 4 \
--flash_attention_causal_mask True \
--evaluation_strategy epoch \
--pipelining_fwd_bwd \
--use_lazy_mode \
--use_flash_attention True \
--deepspeed llama3_ds_zero1_config.json \
--num_train_epochs 3 \
--eval_delay 3 \
--do_eval \
--lora_alpha 16 \
--lora_dropout 0.05 \
--gradient_accumulation_steps 4 \
--flash_attention_recompute True \
--context_parallel_size 4
```

## Streaming

Expand Down