Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions docs/source/reducing_memory_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,10 +116,9 @@ trainer = SFTTrainer(

PEFT can be combined with other memory reduction techniques such as quantization (4-bit or 8-bit) for even greater memory savings. See [PEFT Integration](peft_integration) for quantization examples.


## Liger for reducing peak memory usage

> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
[Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.

For more information, see [Liger Kernel Integration](liger_kernel_integration).

Expand Down Expand Up @@ -319,3 +318,15 @@ training_args = RLOOConfig(..., vllm_enable_sleep_mode=True)
</hfoptions>

Offloading the vLLM weights and cache helps keep GPU memory usage low, which can be particularly beneficial when training large models or using limited GPU resources. However, waking the vLLM engine from sleep mode introduces some host–device transfer latency, which may slightly impact training speed.

## Gradient checkpointing

Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead.

```python
from trl import SFTConfig

training_args = SFTConfig(..., gradient_checkpointing=True)
```

Gradient checkpointing is available and activated by default across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).
Comment on lines +321 to +332

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section was moved from "speeding up"

64 changes: 20 additions & 44 deletions docs/source/speeding_up_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,20 +103,9 @@ You can customize the server configuration by passing additional arguments. For

## Optimized attention implementations

TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either locally installed backends (like Flash Attention 2) or pull pre-optimized kernels directly from the [Kernels Hub](kernels_hub).
TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either a pre-optimized kernels directly from the [Kernels Hub](kernels_hub) or a manually built attention backend.

<hfoptions id="attention examples">
<hfoption id="Flash Attention 2">

To enable Flash Attention 2, pass `attn_implementation="flash_attention_2"` in the model initialization arguments:

```python
from trl import SFTConfig

training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "flash_attention_2"})
```

</hfoption>
<hfoption id="Kernels from Hub">

You can use pre-optimized attention kernels from the Hub without manual compilation:
Expand All @@ -129,40 +118,30 @@ training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "kernel

Other options include `kernels-community/vllm-flash-attn3` and `kernels-community/paged-attention`.

</hfoption>
</hfoptions>
Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub).

Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub) and [Reducing Memory Usage](reducing_memory_usage#padding-free).
</hfoption>
<hfoption id="Manual build">

## PEFT for parameter-efficient training
> [!WARNING]
> Manually building optimized attention backends is complex and time-consuming. It's never recommended unless absolutely necessary. Consider using Kernels from the Hub instead, as described in the previous section.

[PEFT](https://huggingface.co/docs/peft/index) (Parameter-Efficient Fine-Tuning) methods like LoRA significantly reduce memory usage and training time by only training a small number of adapter parameters instead of the full model.
If you have manually installed an optimized attention backend like Flash Attention 2, you can specify it in the training arguments:

```python
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
)

trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
peft_config=peft_config,
args=training_args,
)
from trl import SFTConfig

training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "flash_attention_2"})
```

For more details, see [PEFT Integration](peft_integration).
</hfoption>
</hfoptions>

## Liger Kernel for memory optimization

Liger Kernel is a collection of Triton kernels designed for LLM training that can increase throughput by 20% and reduce memory usage by 60%.

<hfoptions id="liger examples">
<hfoptions id="liger">
<hfoption id="SFT">

```python
Expand Down Expand Up @@ -199,21 +178,18 @@ training_args = KTOConfig(..., use_liger_kernel=True)
```

</hfoption>
</hfoptions>

For more information, see [Liger Kernel Integration](liger_kernel_integration).

## Gradient checkpointing for memory savings

Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead.
<hfoption id="GKD">

```python
from trl import SFTConfig
from trl.experimental.gkd import GKDConfig

training_args = SFTConfig(..., gradient_checkpointing=True)
training_args = GKDConfig(..., use_liger_kernel=True)
```

Gradient checkpointing is available across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).
</hfoption>
</hfoptions>

For more information, see [Liger Kernel Integration](liger_kernel_integration).

## Mixed precision training

Expand Down
Loading