Merge remote-tracking branch 'origin/main' into flux_ae

pytorch · Dec 2, 2024 · aa8b751 · aa8b751
2 parents e7504b6 + 32e265d
commit aa8b751
Show file tree

Hide file tree

Showing 33 changed files with 521 additions and 84 deletions.
diff --git a/.github/workflows/build_linux_wheels.yaml b/.github/workflows/build_linux_wheels.yaml
@@ -36,6 +36,8 @@ jobs:
     with:
       repository: pytorch/torchtune
       ref: ""
+      test-infra-repository: pytorch/test-infra
+      test-infra-ref: main
       package-name: torchtune
       build-matrix: ${{ needs.generate-matrix.outputs.matrix }}
       pre-script: .github/scripts/pre_build_script.sh

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -113,6 +113,7 @@ torchtune tutorials.
    recipes/recipes_overview
    recipes/lora_finetune_single_device
    recipes/qat_distributed
+   recipes/dpo
 
 .. toctree::
    :glob:

diff --git a/docs/source/recipes/dpo.rst b/docs/source/recipes/dpo.rst
@@ -0,0 +1,75 @@
+.. _dpo_recipe_label:
+
+====================================
+Direct Preference Optimization
+====================================
+
+This recipe supports several `Direct Preference Optimization <https://arxiv.org/abs/2305.18290>`_ (DPO)-style fine-tuning techniques.
+These techniques aim to steer (or `align <https://en.wikipedia.org/wiki/AI_alignment>`_) a model towards some desirable behaviours.
+For example, a common goal is to train language models to produce safe and honest outputs,
+or to be `helpful and harmless <https://arxiv.org/abs/2204.05862>`_.
+
+To see the best results when using this recipe, it may be helpful to first fine-tune your model with using supervised fine-tuning to ensure your model is
+on-distribution for the domain you're interested in. To do this, check out our other fine-tuning recipes in the :ref:`recipe overview <recipes_overview_label>` which
+support a variety of SFT paradigms.
+
+After supervised fine-tuning, here is an example of DPO with Llama 3.1 8B:
+
+.. note::
+
+    You may need to be granted access to the Llama model you're interested in. See
+    :ref:`here <download_llama_label>` for details on accessing gated repositories.
+
+
+.. code-block:: bash
+
+    tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --ignore-patterns "original/consolidated.00.pth"
+    --HF_TOKEN <HF_TOKEN>
+
+    # run on a single device
+    tune run lora_dpo_single_device --config llama3_1/8B_lora_dpo_single_device
+
+    # run on two gpus
+    tune run --nproc_per_node 2 lora_dpo_distributed --config llama3_1/8B_lora_dpo
+
+It's easy to get started with this recipe with your dataset of choice, including custom local datasets,
+and datasets from Hugging Face. Check out our primer on :ref:`preference datasets <preference_dataset_usage_label>` to
+see how to do this.
+
+For this recipe we include different DPO-style losses:
+
+* :class:`Direct Preference Optimization <torchtune.rlhf.loss.DPOLoss>` (DPO) loss [#]_. The DPO loss function
+  increases the relative log-probabilities of preferred to un-preferred responses, whilst using log probabilities
+  from a reference model to prevent policy degradation during training. Alongside RLHF, this is the most commonly used
+  alignment technique and is used to train a growing number of state-of-the-art LLMs e.g. Llama3.1, Gemma 2, Qwen2, etc.
+  This is a good starting point for alignment fine-tuning.
+* :class:`Statistical Rejection Sampling Optimization <torchtune.rlhf.loss.RSOLoss>` (RSO) or "hinge" loss [#]_.
+  RSO builds on concepts from support vector machines and DPO, applying a margin-based approach that penalizes
+  low-quality responses while ensuring a significant gap between chosen and un-chosen log probabilities.
+
+To use any of these, simply use the ``loss`` config entry or flag through the :ref:`cli_label`:
+
+.. code-block:: bash
+
+    tune run lora_dpo_single_device --config llama2/7B_lora_dpo_single_device \
+    loss=torchtune.modules.loss.RSOLoss \
+    gamma=0.5
+
+.. todo (@SalmanMohammadi) point to an example repo for SimPO
+
+For a deeper understanding of the different levers you can pull when using this recipe,
+see our documentation for the different PEFT training paradigms we support:
+
+* :ref:`glossary_lora`
+* :ref:`glossary_qlora`
+* :ref:`glossary_dora`
+
+Many of our other memory optimization features can be used in this recipe. You can learn more about all of our memory optimization features in our :ref:`memory optimization overview<memory_optimization_overview_label>`.
+
+.. rubric:: References:
+
+.. [#] Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S. and Finn, C., 2024.
+         Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
+.. [#] Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P.J. and Liu, J., 2023.
+         Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
diff --git a/docs/source/recipes/lora_finetune_single_device.rst b/docs/source/recipes/lora_finetune_single_device.rst
@@ -8,7 +8,7 @@ This recipe supports finetuning on next-token prediction tasks using parameter e
 such as :ref:`glossary_lora` and :ref:`glossary_qlora`. These techniques
 significantly reduce memory consumption during training whilst still maintaining competitive performance.
 
-We provide configs which you can get up and running quickly. Here is an example with llama 3.1 8B:
+We provide configs which you can get up and running quickly. Here is an example with Llama 3.1 8B:
 
 .. note::
 

diff --git a/docs/source/recipes/recipes_overview.rst b/docs/source/recipes/recipes_overview.rst
@@ -28,7 +28,7 @@ Our recipes include:
 * Single-device full fine-tuning
 * Distributed full fine-tuning
 * Distributed LoRA fine-tuning
-* Direct Preference Optimization (DPO)
+* :ref:`Direct Preference Optimization (DPO) <dpo_recipe_label>`
 * Proximal Policy Optimization (PPO)
 * :ref:`Distributed Quantization-Aware Training (QAT)<qat_distributed_recipe_label>`.
 

diff --git a/recipes/configs/llama2/7B_lora_dpo.yaml b/recipes/configs/llama2/7B_lora_dpo.yaml
@@ -83,4 +83,7 @@ log_peak_memory_stats: True
 # Environment
 device: cuda
 dtype: bf16
+
+# Memory management
 enable_activation_checkpointing: True  # True reduces memory
+enable_activation_offloading: False  # True reduces memory
diff --git a/recipes/configs/llama2/7B_lora_dpo_single_device.yaml b/recipes/configs/llama2/7B_lora_dpo_single_device.yaml
@@ -80,4 +80,7 @@ log_peak_memory_stats: True
 # Environment
 device: cuda
 dtype: bf16
+
+# Memory management
 enable_activation_checkpointing: True  # True reduces memory
+enable_activation_offloading: False  # True reduces memory
diff --git a/recipes/configs/llama3/70B_lora.yaml b/recipes/configs/llama3/70B_lora.yaml
@@ -105,6 +105,7 @@ device: cuda
 dtype: bf16
 enable_activation_checkpointing: True  # True reduces memory
 enable_activation_offloading: False  # True reduces memory
+# custom_sharded_layers: ['tok_embeddings', 'output']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
 
 # Profiler (disabled)
 profiler:

diff --git a/recipes/configs/llama3_1/70B_lora.yaml b/recipes/configs/llama3_1/70B_lora.yaml
@@ -104,6 +104,7 @@ device: cuda
 dtype: bf16
 enable_activation_checkpointing: True  # True reduces memory
 enable_activation_offloading: False  # True reduces memory
+# custom_sharded_layers: ['tok_embeddings', 'output']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
 
 # Profiler (disabled)
 profiler:

diff --git a/recipes/configs/llama3_1/8B_lora_dpo.yaml b/recipes/configs/llama3_1/8B_lora_dpo.yaml
@@ -0,0 +1,92 @@
+# Config for multi-device LoRA DPO alignment in lora_dpo_distributed.py
+# using a Llama2 7B model
+#
+# This config assumes that you've run the following command before launching
+# this run:
+#   tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"
+#
+# To launch on 2 devices, run the following command from root:
+#   tune run --nnodes 1 --nproc_per_node 2 lora_dpo_distributed --config llama3_1/8B_lora_dpo
+#
+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+#   tune run --nnodes 1 --nproc_per_node 2 lora_dpo_distributed --config llama3_1/8B_lora_dpo checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config works best when the model is being fine-tuned on 2+ GPUs.
+# For single device LoRA DPO alignment please use  llama3_1/8B_lora_dpo_single_device
+
+# Model Arguments
+model:
+  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
+  lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
+  apply_lora_to_mlp: True
+  apply_lora_to_output: False
+  lora_rank: 8  # higher increases accuracy and memory
+  lora_alpha: 16  # usually alpha=2*rank
+  lora_dropout: 0.0
+
+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.llama3.llama3_tokenizer
+  path: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
+  max_seq_len: null
+
+checkpointer:
+  _component_: torchtune.training.FullModelHFCheckpointer
+  checkpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
+  checkpoint_files: [
+    model-00001-of-00004.safetensors,
+    model-00002-of-00004.safetensors,
+    model-00003-of-00004.safetensors,
+    model-00004-of-00004.safetensors
+  ]
+  recipe_checkpoint: null
+  output_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
+  model_type: LLAMA3
+resume_from_checkpoint: False
+save_adapter_weights_only: False
+
+# Dataset and Sampler
+dataset:
+  _component_: torchtune.datasets.stack_exchange_paired_dataset
+seed: null
+shuffle: True
+batch_size: 4
+
+# Optimizer and Scheduler
+optimizer:
+  _component_: torch.optim.AdamW
+  fused: True
+  weight_decay: 0.05
+  lr: 5e-4
+lr_scheduler:
+  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
+  num_warmup_steps: 100
+
+loss:
+  _component_: torchtune.rlhf.loss.DPOLoss
+  beta: 0.1
+  label_smoothing: 0
+
+# Training
+epochs: 1
+max_steps_per_epoch: 1000
+gradient_accumulation_steps: 8  # Use to increase virtual batch size
+compile: False  # pytorch compile, set to true for better perf/memory
+
+# Logging
+output_dir: /tmp/lora_dpo_output/
+metric_logger:
+  _component_: torchtune.training.metric_logging.DiskLogger
+  log_dir: ${output_dir}
+log_every_n_steps: 1
+log_peak_memory_stats: True
+
+# Environment
+device: cuda
+dtype: bf16
+
+# Memory management
+enable_activation_checkpointing: True  # True reduces memory
+enable_activation_offloading: False  # True reduces memory
diff --git a/recipes/configs/llama3_1/8B_lora_dpo_single_device.yaml b/recipes/configs/llama3_1/8B_lora_dpo_single_device.yaml
@@ -0,0 +1,89 @@
+# Config for single device LoRA DPO alignment in lora_dpo_single_device.py
+# using a Llama2 7B model
+#
+# This config assumes that you've run the following command before launching
+# this run:
+#   tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"
+#
+# To launch on a single device, run the following command from root:
+#   tune run lora_dpo_single_device --config llama3_1/8B_lora_dpo_single_device
+#
+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+#   tune run lora_dpo_single_device --config llama3_1/8B_lora_dpo_single_device checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config works only for training on single device.
+
+# Model Arguments
+model:
+  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
+  lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
+  apply_lora_to_mlp: True
+  apply_lora_to_output: False
+  lora_rank: 8  # higher increases accuracy and memory
+  lora_alpha: 16  # usually alpha=2*rank
+  lora_dropout: 0.0
+
+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.llama3.llama3_tokenizer
+  path: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
+  max_seq_len: null
+
+checkpointer:
+  _component_: torchtune.training.FullModelHFCheckpointer
+  checkpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
+  checkpoint_files: [
+    model-00001-of-00004.safetensors,
+    model-00002-of-00004.safetensors,
+    model-00003-of-00004.safetensors,
+    model-00004-of-00004.safetensors
+  ]
+  recipe_checkpoint: null
+  output_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
+  model_type: LLAMA3
+resume_from_checkpoint: False
+save_adapter_weights_only: False
+
+# Dataset and Sampler
+dataset:
+  _component_: torchtune.datasets.stack_exchange_paired_dataset
+seed: null
+shuffle: True
+batch_size: 4
+
+# Optimizer and Scheduler
+optimizer:
+  _component_: torch.optim.AdamW
+  fused: True
+  weight_decay: 0.05
+  lr: 5e-4
+lr_scheduler:
+  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
+  num_warmup_steps: 100
+
+loss:
+  _component_: torchtune.rlhf.loss.DPOLoss
+
+# Training
+epochs: 1
+max_steps_per_epoch: 1000
+gradient_accumulation_steps: 8  # Use to increase virtual batch size
+compile: False  # pytorch compile, set to true for better perf/memory
+
+# Logging
+output_dir: /tmp/lora_dpo_output/
+metric_logger:
+  _component_: torchtune.training.metric_logging.DiskLogger
+  log_dir: ${output_dir}
+log_every_n_steps: 1
+log_peak_memory_stats: True
+
+# Environment
+device: cuda
+dtype: bf16
+
+# Memory management
+enable_activation_checkpointing: True  # True reduces memory
+enable_activation_offloading: False  # True reduces memory
diff --git a/recipes/configs/qwen2_5/32B_lora.yaml b/recipes/configs/qwen2_5/32B_lora.yaml
@@ -97,6 +97,7 @@ device: cuda
 dtype: bf16
 enable_activation_checkpointing: False  # True reduces memory
 enable_activation_offloading: False  # True reduces memory
+# custom_sharded_layers: ['tok_embeddings']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
 
 # Show case the usage of pytorch profiler
 # Set enabled to False as it's only needed for debugging training

diff --git a/recipes/configs/qwen2_5/72B_lora.yaml b/recipes/configs/qwen2_5/72B_lora.yaml
@@ -117,6 +117,7 @@ device: cuda
 dtype: bf16
 enable_activation_checkpointing: True  # True reduces memory
 enable_activation_offloading: False  # True reduces memory
+# custom_sharded_layers: ['tok_embeddings']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
 
 # Show case the usage of pytorch profiler
 # Set enabled to False as it's only needed for debugging training

diff --git a/recipes/full_finetune_distributed.py b/recipes/full_finetune_distributed.py
@@ -122,15 +122,6 @@ def __init__(self, cfg: DictConfig) -> None:
                 "full fp16 training is not supported with this recipe. Please use bf16 or fp32 instead."
             )
 
-        if (
-            cfg.get("fsdp_cpu_offload", False)
-            and cfg.optimizer.get("fused", False)
-            and not utils.torch_version_ge("2.4.0")
-        ):
-            raise RuntimeError(
-                "Using fused optimizer on CPU is only supported in PyTorch nightly."
-            )
-
         # logging attributes
         self._output_dir = cfg.output_dir
         self._log_every_n_steps = cfg.get("log_every_n_steps", 1)