From 7b8baf8460cd2f2f64fe6a66f7124dbc07fd5250 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Quentin=20Gallou=C3=A9dec?= Date: Mon, 1 Dec 2025 18:03:21 +0000 Subject: [PATCH] Update how-to guide --- docs/source/reducing_memory_usage.md | 15 ++++++- docs/source/speeding_up_training.md | 64 +++++++++------------------- 2 files changed, 33 insertions(+), 46 deletions(-) diff --git a/docs/source/reducing_memory_usage.md b/docs/source/reducing_memory_usage.md index 663e7ef759a..5615c729724 100644 --- a/docs/source/reducing_memory_usage.md +++ b/docs/source/reducing_memory_usage.md @@ -116,10 +116,9 @@ trainer = SFTTrainer( PEFT can be combined with other memory reduction techniques such as quantization (4-bit or 8-bit) for even greater memory savings. See [PEFT Integration](peft_integration) for quantization examples. - ## Liger for reducing peak memory usage -> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%. +[Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%. For more information, see [Liger Kernel Integration](liger_kernel_integration). @@ -319,3 +318,15 @@ training_args = RLOOConfig(..., vllm_enable_sleep_mode=True) Offloading the vLLM weights and cache helps keep GPU memory usage low, which can be particularly beneficial when training large models or using limited GPU resources. However, waking the vLLM engine from sleep mode introduces some host–device transfer latency, which may slightly impact training speed. + +## Gradient checkpointing + +Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead. + +```python +from trl import SFTConfig + +training_args = SFTConfig(..., gradient_checkpointing=True) +``` + +Gradient checkpointing is available and activated by default across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing). diff --git a/docs/source/speeding_up_training.md b/docs/source/speeding_up_training.md index feff36cc7ec..7f101f41db8 100644 --- a/docs/source/speeding_up_training.md +++ b/docs/source/speeding_up_training.md @@ -103,20 +103,9 @@ You can customize the server configuration by passing additional arguments. For ## Optimized attention implementations -TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either locally installed backends (like Flash Attention 2) or pull pre-optimized kernels directly from the [Kernels Hub](kernels_hub). +TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either a pre-optimized kernels directly from the [Kernels Hub](kernels_hub) or a manually built attention backend. - - -To enable Flash Attention 2, pass `attn_implementation="flash_attention_2"` in the model initialization arguments: - -```python -from trl import SFTConfig - -training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "flash_attention_2"}) -``` - - You can use pre-optimized attention kernels from the Hub without manual compilation: @@ -129,40 +118,30 @@ training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "kernel Other options include `kernels-community/vllm-flash-attn3` and `kernels-community/paged-attention`. - - +Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub). -Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub) and [Reducing Memory Usage](reducing_memory_usage#padding-free). + + -## PEFT for parameter-efficient training +> [!WARNING] +> Manually building optimized attention backends is complex and time-consuming. It's never recommended unless absolutely necessary. Consider using Kernels from the Hub instead, as described in the previous section. -[PEFT](https://huggingface.co/docs/peft/index) (Parameter-Efficient Fine-Tuning) methods like LoRA significantly reduce memory usage and training time by only training a small number of adapter parameters instead of the full model. +If you have manually installed an optimized attention backend like Flash Attention 2, you can specify it in the training arguments: ```python -from peft import LoraConfig -from trl import SFTConfig, SFTTrainer - -peft_config = LoraConfig( - r=16, - lora_alpha=32, - lora_dropout=0.05, - target_modules=["q_proj", "v_proj"], -) - -trainer = SFTTrainer( - model="Qwen/Qwen2.5-0.5B", - peft_config=peft_config, - args=training_args, -) +from trl import SFTConfig + +training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "flash_attention_2"}) ``` -For more details, see [PEFT Integration](peft_integration). + + ## Liger Kernel for memory optimization Liger Kernel is a collection of Triton kernels designed for LLM training that can increase throughput by 20% and reduce memory usage by 60%. - + ```python @@ -199,21 +178,18 @@ training_args = KTOConfig(..., use_liger_kernel=True) ``` - - -For more information, see [Liger Kernel Integration](liger_kernel_integration). - -## Gradient checkpointing for memory savings - -Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead. + ```python -from trl import SFTConfig +from trl.experimental.gkd import GKDConfig -training_args = SFTConfig(..., gradient_checkpointing=True) +training_args = GKDConfig(..., use_liger_kernel=True) ``` -Gradient checkpointing is available across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing). + + + +For more information, see [Liger Kernel Integration](liger_kernel_integration). ## Mixed precision training