diff --git a/README.md b/README.md index 044c9cd954..34c674ab4f 100644 --- a/README.md +++ b/README.md @@ -1,81 +1,95 @@ -# Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s - - -- [Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-reinforcer-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s) - - [Features](#features) - - [Installation](#installation) - - [Quick start](#quick-start) - - [SFT](#sft) - - [Single Node](#single-node) - - [Multi-node](#multi-node) - - [GRPO](#grpo) - - [Single Node](#single-node-1) - - [Multi-node](#multi-node-1) - - [Cluster Start](#cluster-start) - -**Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters. +# NeMo RL: Scalable and Efficient Post-Training on NVIDIA GPUs + +**NeMo RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters. What you can expect: -- **Seamless integration with HuggingFace** for ease of use, allowing users to leverage a wide range of pre-trained models and tools. -- **High-performance implementation with Megatron core**, supporting various parallelism techniques for large models (>100B) and large context lengths. +- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools. +- **High-performance implementation with Megatron Core**, supporting various parallelism techniques for large models (>100B) and large context lengths. - **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations. - **Flexibility** with a modular design that allows easy integration and customization. - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples. -## Features - -_✅ Available now | 🔜 Coming in v0.2_ - -- ✅ **Fast Generation** - vLLM backend for optimized inference -- ✅ **HuggingFace Integration** - Works with 1-8B models (Qwen1.5, Llama) -- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure -- ✅ **Environment Support** - Support for multi-environment training. -- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning) -- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state) -- 🔜 **Larger Model Support** - Native PyTorch support for models up to 70B parameters -- 🔜 **Advanced Parallelism** - FSDP2, TP, SP, and sequence packing for efficient training -- 🔜 **Environment Isolation** - Dependency isolation between components -- 🔜 **DPO Algorithm** - Direct Preference Optimization for alignment - -## Installation +## Table of Contents + +- [Key Features](#key-features) +- [Install NeMo RL](#install-nemo-rl) +- [Quickstart](#quickstart) +- [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft) + - [Run Single Node SFT](#run-single-node-sft) + - [Run Multi-node SFT](#run-multi-node-sft) + - [Group Relative Policy Optimization (GRPO)](#group-relative-policy-optimization-grpo) + - [Run Single Node GRPO](#run-single-node-grpo) + - [Run Multi-node GRPO](#run-multi-node-grpo) +- [Set Up Clusters](#set-up-clusters) +- [Contributing](#contributing) +- [Licenses](#licenses) + +## Key Features + +_✅ Available Now | 🔜 Coming Soon (v0.2)_ + +- ✅ **Fast Generation:** Utilizes vLLM backend for optimized inference during evaluation and rollout. +- ✅ **Hugging Face Integration:** Seamlessly integrates with Hugging Face Transformers, supporting a wide range of pre-trained models (e.g., Qwen1.5, Llama models up to 8B parameters). +- ✅ **Scalable Distributed Training:** Leverages Fully Sharded Data Parallelism (FSDP) and a Ray-based infrastructure for efficient multi-GPU and multi-node training. +- ✅ **Multi-Environment Support:** Enables training across diverse environments and datasets. +- ✅ **Reinforcement Learning Algorithms:** Implements Group Relative Policy Optimization (GRPO) for effective preference alignment. +- ✅ **Supervised Fine-Tuning (SFT):** Supports standard supervised fine-tuning for instruction following and task adaptation. +- ✅ **Worker Isolation:** Ensures process isolation between RL actors, preventing unintended global state interference. +- 🔜 **Larger Model Support:** Native PyTorch support for models up to 70B parameters. +- 🔜 **Advanced Parallelism Techniques:** Implementation of FSDP2, Tensor Parallelism (TP), Pipeline Parallelism (PP), and sequence packing for enhanced training efficiency. +- 🔜 **Environment Isolation:** Provides dependency isolation between different components of the training pipeline. +- 🔜 **Direct Preference Optimization (DPO):** Integration of the Direct Preference Optimization algorithm for more direct preference learning. + +## Install NeMo RL + +Use of the `uv` Python package manager is required for setup. Python 3.12 or a compatible version is also required. ```sh -# For faster setup we use `uv` +# Install uv pip install uv -# Specify a virtual env that uses Python 3.12 -uv venv -p python3.12.9 .venv -# Install NeMo-Reinforcer with vllm +# Create a virtual environment with Python 3.12 +uv venv -p python3.12 .venv + +# Activate the virtual environment (optional, but recommended for consistency) +# source .venv/bin/activate # On Linux/macOS +# .venv\Scripts\activate # On Windows + +# Install NeMo RL with vLLM support uv pip install -e .[vllm] -# Install NeMo-Reinforcer with dev/test dependencies -uv pip install -e '.[dev,test]' -# Use uv run to launch any runs. -# Note that it is recommended to not activate the venv and instead use `uv run` since -# it ensures consistent environment usage across different shells and sessions. +# To install with development and testing dependencies: +# uv pip install -e '.[dev,test]' + +# Running scripts with `uv run` ensures a consistent environment. # Example: uv run python examples/run_grpo_math.py ``` -## Quick start +**Important Notes:** -**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. +- Use the `uv run ` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions. +- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware. -### SFT +## Quickstart -We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). +Before running any experiments, remember to set your `HF_HOME` environment variable and your `WANDB_API_KEY` if you intend to use Weights & Biases for logging. For accessing Llama models, you might also need to log in using `huggingface-cli login`. -#### Single Node +## Supervised Fine-Tuning (SFT) -The default SFT experiment is configured to run on a single GPU. To launch the experiment, +We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). + +#### Run Single Node SFT + +The default SFT configuration is set to run on a single GPU. To start the experiment: ```sh uv run python examples/run_sft.py ``` -This trains `Llama3.2-1B` on one GPU using the SQUAD dataset. +This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU. -If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration. We also switch to an 8B Llama base model and increase the batch size: +To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size: ```sh uv run python examples/run_sft.py \ @@ -85,9 +99,9 @@ uv run python examples/run_sft.py \ cluster.gpus_per_node=8 ``` -Refer to [sft.yaml](examples/configs/sft.yaml) for a full list of parameters that can be overridden. +Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden. -#### Multi-node +#### Run Multi-node SFT For distributed training across multiple nodes: @@ -97,7 +111,7 @@ export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache ``` ```sh -# Run from the root of NeMo-Reinforcer repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=2 # Add a timestamp to make each job name unique TIMESTAMP=$(date +%Y%m%d_%H%M%S) @@ -118,11 +132,11 @@ sbatch \ ray.sub ``` -### GRPO +### Group Relative Policy Optimization (GRPO) -We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset. +We provide a reference GRPO experiment configuration for training on math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset. -#### Single Node +#### Run Single Node GRPO To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`: @@ -131,7 +145,7 @@ To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`: uv run python examples/run_grpo_math.py ``` -By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus, +By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs, ```sh # Run the GRPO math example using a 1B parameter model using 8 GPUs @@ -150,10 +164,10 @@ uv run python examples/run_grpo_math.py \ logger.num_val_samples_to_print=10 \ ``` -#### Multi-node +#### Run Multi-node GRPO ```sh -# Run from the root of NeMo-Reinforcer repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=2 # Add a timestamp to make each job name unique TIMESTAMP=$(date +%Y%m%d_%H%M%S) @@ -174,6 +188,16 @@ sbatch \ ray.sub ``` -## Cluster Start +## Set Up Clusters + +For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation. + +## Contributing + +We welcome contributions to NeMo RL\! Please see our [Contributing Guidelines](https://github.com/NVIDIA/reinforcer/blob/main/CONTRIBUTING.md) for more information on how to get involved. + +## Licenses + +NVIDIA NeMo RL is licensed under the [Apache License 2.0](https://github.com/NVIDIA/reinforcer/blob/main/LICENSE). -Please visit [Cluster Start](docs/cluster.md) for how to get started on Slurm or Kubernetes. +NeMo is licensed under the [NVIDIA AI PRODUCT AGREEMENT](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). By pulling and using the container, you accept the terms and conditions of this license. \ No newline at end of file diff --git a/docs/adding_new_models.md b/docs/adding_new_models.md index c39642ea69..764b204599 100644 --- a/docs/adding_new_models.md +++ b/docs/adding_new_models.md @@ -1,10 +1,10 @@ -# Adding New Models +# Add New Models -This guide outlines how to integrate and validate a new model within **NeMo-Reinforcer**. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines. +This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines. ## Importance of Log Probability Consistency in Training and Inference -In on-policy RL, we sample tokens (actions) from the latest version of the policy, meaning the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation. +In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation. As an example, we would see errors in naive KL estimation: @@ -24,33 +24,33 @@ where samples are drawn as $x \sim \pi_{\text{sampling-framework}}$ as a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the sampling framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{sampling-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient. -## Understanding Discrepancies Between Backends +## Understand Discrepancies Between Backends When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant: - **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8). - - Training may use mixed precision while the inference backend may not - - High-precision training with FP8 inference may not be numerically stable for certain models - - Differences can occur at the layer level, with some layers in FP32 while others use lower precision + - Training may use mixed precision, while the inference backend may not. + - High-precision training with FP8 inference may not be numerically stable for certain models. + - Differences can occur at the layer level, with some layers in FP32, while others use lower precision. - **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented. - - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends - - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences - - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability + - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends. + - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences. + - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability. - **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation. - - In some cases, disabling the inference backend cache can resolve discrepancies + - In some cases, disabling the inference backend cache can resolve discrepancies. -- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations +- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations. -- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`) +- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`). - **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities. - - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels + - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels. -- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect - - If weights are reshaped or reordered incorrectly, generations tend to be very wrong - - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses +- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect. + - If weights are reshaped or reordered incorrectly, generations tend to be very wrong. + - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses. - **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1` @@ -69,7 +69,7 @@ When validating Hugging Face-based models, perform the following checks: Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches. - **Test parallelism** - Verify consistency with other parallelism settings. + Verify consistency with other parallelism settings. - **Variance** Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance. @@ -96,7 +96,7 @@ When validating Hugging Face-based models, perform the following checks: ### Additional Validation - **Compare Megatron outputs** - Ensure the Megatron forward pass aligns with HuggingFace and the generation log probabilities from inference backends like **vLLM**. + Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**. - **Parallel settings** Match the same parallelism configurations used for the HuggingFace-based tests. @@ -120,4 +120,4 @@ When validating your model, you should analyze the results across different conf --- -By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-Reinforcer**'s requirements. \ No newline at end of file +By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL. \ No newline at end of file diff --git a/docs/cluster.md b/docs/cluster.md index d683de9ac2..70a88a4eb7 100644 --- a/docs/cluster.md +++ b/docs/cluster.md @@ -1,12 +1,10 @@ -# Cluster start +# Set Up Clusters -- [Cluster start](#cluster-start) - - [Slurm](#slurm) - - [Batched Job Submission](#batched-job-submission) - - [Interactive Launching](#interactive-launching) - - [Kubernetes](#kubernetes) +This guide explains how to initialize NeMo RL clusters. -## Slurm +## Slurm (Batched and Interactive) + + The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively. ### Batched Job Submission @@ -35,17 +33,16 @@ Which will print the `SLURM_JOB_ID`: ```text Submitted batch job 1980204 ``` -Make note of the the job submission number. Once the job begins you can track it's process in the driver logs which you can `tail`: +Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`: ```sh tail -f 1980204-logs/ray-driver.log ``` :::{note} `UV_CACHE_DIR` defaults to `$SLURM_SUBMIT_DIR/uv_cache` and is mounted to head and worker nodes -to ensure fast `venv` creation. +to ensure fast `venv` creation. -If you would like to override it to somewhere else all head/worker nodes can access, you may set it -via: +You can override the default location by setting it to a path accessible by all head and worker nodes via: ```sh ... @@ -58,10 +55,10 @@ sbatch ... \ ### Interactive Launching :::{tip} -A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the SLURM job queue. This means during debugging sessions, you can avoid submitting a new `sbatch` command each time and instead debug and re-submit your Reinforcer job directly from the interactive session. +A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new `sbatch` command each time. Instead, you can debug and re-submit your job directly from the interactive session. ::: -To run interactively, launch the same command as the [Batched Job Submission](#batched-job-submission) except omit the `COMMAND` line: +To run interactively, launch the same command as [Batched Job Submission](#batched-job-submission), but omit the `COMMAND` line: ```sh # Run from the root of NeMo-Reinforcer repo NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0) @@ -82,12 +79,12 @@ Which will print the `SLURM_JOB_ID`: ```text Submitted batch job 1980204 ``` -Once the ray cluster is up, a script should be created to attach to the ray head node, -which you can use launch experiments. +Once the Ray cluster is up, a script should be created to attach to the Ray head node, +which you can use to launch experiments. ```sh bash 1980204-attach.sh ``` -Now that you are on the head node, you can launch the command like so: +Now that you are on the head node, you can launch the command as follows: ```sh uv venv .venv uv pip install -e . @@ -97,3 +94,5 @@ uv run ./examples/run_grpo_math.py ## Kubernetes TBD + +The following code provides instructions on how to use Kubernetes to run your jobs. \ No newline at end of file diff --git a/docs/design_docs/chat_datasets.md b/docs/design_docs/chat_datasets.md index 43e2801fdc..7fe570b99a 100644 --- a/docs/design_docs/chat_datasets.md +++ b/docs/design_docs/chat_datasets.md @@ -1,8 +1,10 @@ # Data Format -## HuggingFace Chat Datasets +This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information. -HuggingFace chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. `messages` should be a list of dictionaries, each with a `role` and `content` key. `role` is typically one of `system`, `user`, and `assistant`. For example: +## Hugging Face Chat Datasets + +Hugging Face chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. The `messages` should be a list of dictionaries, each with a `role` and `content` key. The `role` typically has one of the following values: `system`, `user`, and `assistant`. For example: ```json { @@ -23,9 +25,9 @@ HuggingFace chat datasets are expected to have the following structure: Each exa } ``` -### Chat Templates +## Chat Templates -Formatting the data in this way allows us to take advantage of HuggingFace tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details. +Formatting the data with chat templates allows us to take advantage of the Hugging Face tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details. By default, `apply_chat_template` attempts to apply the `chat_template` associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated `chat_template`s, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to `apply_chat_template`. The following is an example using a simple template which prepends a role header to each turn: @@ -58,4 +60,4 @@ assert output == expected_output :hide: ``` -For more details on creating chat templates, refer to the [HuggingFace documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template). \ No newline at end of file +For more details on creating chat templates, refer to the [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template). \ No newline at end of file diff --git a/docs/design_docs/checkpointing.md b/docs/design_docs/checkpointing.md index 9b9a6f6826..12eb4e35b4 100644 --- a/docs/design_docs/checkpointing.md +++ b/docs/design_docs/checkpointing.md @@ -1,11 +1,14 @@ -# Checkpointing with HuggingFace Models +# Checkpointing with Hugging Face Models ## Checkpoint Format -Reinforcer provides two checkpoint formats for HuggingFace models: Torch distributed and HuggingFace format. Torch distributed is used by default for efficiency, and HuggingFace format is provided for compatibility with HuggingFace's `AutoModel.from_pretrained` API. Note that HuggingFace format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a HuggingFace checkpoint only at the end of training. -There are two ways to get a Reinforcer checkpoint in HuggingFace format. +NeMo RL provides two checkpoint formats for Hugging Face models: Torch distributed and Hugging Face format. Torch distributed is used by default for efficiency, while Hugging Face format is provided for compatibility with Hugging Face's `AutoModel.from_pretrained` API. Note that Hugging Face format checkpoints save only the model weights, excluding the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a Hugging Face checkpoint only at the end of training. -1. (Recommended) Save the HuggingFace checkpoint directly by passing `save_hf=True` to `HFPolicy`'s `save_checkpoint`: +## Generate a NeMo RL Checkpoint in Hugging Face Format + +There are two ways to get a NeMo RL checkpoint in Hugging Face format. + +1. (Recommended) Save the Hugging Face checkpoint directly by passing `save_hf=True` to `HFPolicy`'s `save_checkpoint`: ```python policy.save_checkpoint( @@ -15,7 +18,7 @@ There are two ways to get a Reinforcer checkpoint in HuggingFace format. save_hf=True, ) ``` -2. Convert a Torch distributed checkpoint checkpoint to HuggingFace format after training. We provide a conversion script for this purpose. +2. Convert a Torch-distributed checkpoint to Hugging Face format after training. We provide a conversion script for this purpose. ```python uv run examples/convert_dcp_to_hf.py --config= --dcp-ckpt-path= --hf-ckpt-path= diff --git a/docs/design_docs/design_and_philosophy.md b/docs/design_docs/design_and_philosophy.md index e9fead87e8..16398dcd34 100644 --- a/docs/design_docs/design_and_philosophy.md +++ b/docs/design_docs/design_and_philosophy.md @@ -1,54 +1,54 @@ # Design and Philosophy -In this section, we will describe the problems this library aims to solve and motivate/dicuss the Reinforcer APIs. + +This section introduces the NeMo RL APIs and addresses the challenges of online Reinforcement Learning (RL). Coordinating various software components, known as RL Actors, requires effective resource allocation, isolation, coordination, and communication. Our design philosophy focuses on creating modular abstractions for these tasks, ensuring scalability from one GPU to thousands, regardless of the RL Actor's implementation. ## Motivation -Online RL requires coordinating a lot of different pieces of software/models + +Online RL demands the coordination of a wide range of software components and models, for example: - Policy Model/Training Framework -- Fast inference Framework (vLLM, SGLANG, TRT-LLM) +- Fast Inference Framework (vLLM, SGLANG, TRT-LLM) - Reward Environments, Critics, etc. We refer to each of these pieces of software as an **RL Actor**. -Fundamentally, we need to be able to do 4 things between these RL Actors: -- Resource them (provide GPUs/CPUs) -- Isolate them - - RL Actors may each set global variables or have conflicting dependencies, so they each need to live in an isolated process environment with configurable dependencies -- Coordinate them (control) -- Communicate between them (data) +Fundamentally, managing these RL Actors requires four key capabilities: +- Resource them (provide GPUs/CPUs). +- Isolate them: RL Actors need isolated process environments with configurable dependencies to avoid global variable or dependency conflicts. +- Coordinate them (control). +- Communicate between them (data). ## Design -We create composable and hackable abstractions for each layer of the tasks above -- Resourcing -> {py:class}`RayVirtualCluster ` -- Isolation -> {py:class}`RayWorkerGroup ` -- Coordination -> A Single-Process Controller using Ray -- Communication -> Data flows through one of the following: - - the single controller - - a communication scheme set-up by the controller such as +We address each of these tasks by providing composable and hackable abstractions at different layers: +- Resourcing: Handled by {py:class}`RayVirtualCluster `. +- Isolation: Implemented using {py:class}`RayWorkerGroup `. +- Coordination: Achieved with a single-process controller using Ray. +- Communication: Data flows through either a single controller or controller-managed mechanisms like: - NCCL Collectives - Multiprocess Queues -By creating a common interface for these 4 tasks, **RL algorithm code looks the same from 1 GPU to 1000 GPUs and does not care about the implementation of each RL Actor (Megatron, HF, Grad student with pen and paper)** +By creating a common interface for these four tasks, the RL algorithm code can scale seamlessly from 1 to 1000 GPUs and remain independent of the specific RL Actor (such as Megatron, Hugging Face, or abstract components like a grad student with pen and paper). ![actor-wg-worker-vc](../assets/actor-wg-worker-vc.png) ### {py:class}`RayVirtualCluster ` -VirtualCluster provides a basic abstraction on top of Ray Placement Groups that allow you to section off a part of your compute resources for WorkerGroups to run on as though they had their own cluster. They support running just one WorkerGroup on each VirtualCluster, or *colocation*, where multiple WorkerGroups share resources (i.e running policy training(hf) and generation(vllm) on the same GPUs in-turn). +The VirtualCluster abstraction builds upon Ray Placement Groups, allowing you to divide your compute resources so that WorkerGroup instances can run as if they were on their own cluster. This supports two modes: running just one WorkerGroup per VirtualCluster, or *colocation*, where multiple WorkerGroups share resources (for example, running policy training using Hugging Face and generation using vLLM on the same GPUs sequentially). + +Minimally, it has the following core API: -Minimally, it has has the following core API: ```python class RayVirtualCluster: """ Creates a virtual distributed cluster using Ray placement groups. This class simplifies distributed training setup by: - - Creating placement groups that represent logical compute nodes - - Allocating GPU and CPU resources for distributed workers - - Managing communication between distributed processes + - Creating placement groups that represent logical compute nodes. + - Allocating GPU and CPU resources for distributed workers. + - Managing communication between distributed processes. - - Bundle: A resource allocation unit (ex: 4 GPUs on a single node) - - Worker: A process that performs computation (model training/inference) - - Node: A physical or virtual machine containing multiple bundles + - Bundle: A resource allocation unit (ex: 4 GPUs on a single node). + - Worker: A process that performs computation (model training/inference). + - Node: A physical or virtual machine containing multiple bundles. """ def __init__(self, bundle_ct_per_node_list: List[int], {other args}): """ @@ -64,12 +64,13 @@ class RayVirtualCluster: This represents the "virtual cluster" - only nodes that are actually being used. Returns: - List of placement groups that have at least one bundle + List of placement groups that have at least one bundle. """ ``` ### {py:class}`RayWorkerGroup ` -All work is done by "Worker Processes"(Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by *RayWorkerGroup* +All work is done by "Worker Processes" (Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by the *RayWorkerGroup*. + ```python class RayWorkerGroup: """ @@ -77,18 +78,20 @@ class RayWorkerGroup: This class creates and manages Ray actor instances that run on resources allocated by a RayVirtualCluster. It handles: - - Worker creation and placement on specific GPU resources - - Setting up distributed training environment variables (rank, world size, etc.) - - Executing methods across all workers in parallel - - Collecting and aggregating results - - Support for tied worker groups where multiple workers process the same data + - Worker creation and placement on specific GPU resources. + - Setting up distributed training environment variables (rank, world size, etc.). + - Executing methods across all workers in parallel. + - Collecting and aggregating results. + - Support for tied worker groups where multiple workers process the same data. """ ``` `RayWorkerGroup` provides functions like `run_all_workers_single_data` and `run_all_workers_multiple_data` to control and communicate to individual worker processes. -### Single-Controller & Execution Diagram -We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of GRPO as though we were working on 1 GPU +### Single-Controller and Execution Diagram + +We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of Group Relative Policy Optimization (GRPO) as though we were working on 1 GPU. + ```python # data processing/transformations between each step omitted def grpo_train( @@ -106,7 +109,7 @@ def grpo_train( logprobs = policy.get_logprobs(generations) reference_logprobs = policy.get_reference_logprobs(generations) - training_data = calculate_grpo_trainnig_data(generations, logprobs, reference_logprobs, rewards) + training_data = calculate_grpo_training_data(generations, logprobs, reference_logprobs, rewards) policy.train(generations, logprobs, reference_logprobs, GRPOLossFn) ``` -For a real implementation of grpo (with valiation, checkpointing, memory movement, and the omitted data processing steps), see [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) +For a complete implementation of GRPO, including validation, checkpointing, memory movement, and the data processing steps not detailed here, see [grpo_train](../../nemo_reinforcer/algorithms/grpo.py). diff --git a/docs/design_docs/generation.md b/docs/design_docs/generation.md index 84f450c7cc..68bb3b63cc 100644 --- a/docs/design_docs/generation.md +++ b/docs/design_docs/generation.md @@ -1,6 +1,6 @@ -# Generation Module +# Token Generation -This doc explains the token generation interface and various backends for the NeMo Reinforcer framework. The generation system is designed with a unified interface that allows different backends (like VLLM, HuggingFace, SGLang, TRT-LLM) to provide token generation capabilities while adhering to the same API. +This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API. ## Generation Interface @@ -58,7 +58,7 @@ The core of the generation system is defined in `interfaces.py`, which establish pass ``` -A key thing to note about generation backends is that the generation backend takes in tokens and gives out tokens without dealing with the tokenizer. By ensuring that only tokens are communicated we eliminate the possibility of having different tokenizers (different versions/specs etc) for training and generation framework. +A key design principle for generation backends is that they process tokens directly, without involving the tokenizer. By ensuring that only tokens are exchanged, we eliminate the risk of inconsistencies arising from different tokenizer versions or specifications between the training and generation frameworks. ## VLLM Backend @@ -66,29 +66,29 @@ The VLLM backend (`models/generation/vllm.py`) implements the {py:class}`Generat ### VllmGeneration Class -The {py:class}`VllmGeneration ` class is the main implementation of the {py:class}`GenerationInterface ` for VLLM. It: +The {py:class}`VllmGeneration ` class is the main implementation of the {py:class}`GenerationInterface ` for VLLM. It performs the following functions: -1. Sets up VLLM workers in a distributed environment using Ray -2. Manages the lifecycle of these workers (initialization, generation, shutdown) -3. Distributes inputs to workers and collects outputs -4. Handles weight updates and synchronization +1. Sets up VLLM workers in a distributed environment using Ray. +2. Manages the lifecycle of these workers (initialization, generation, shutdown). +3. Distributes inputs to workers and collects outputs. +4. Handles weight updates and synchronization. ### VllmGenerationWorker The {py:class}`VllmGenerationWorker ` is a Ray actor that: -1. Initializes and manages a VLLM model instance -2. Performs the actual generation on a GPU -3. Supports dynamic weight updates through IPC handles -4. Implements sleep/wake mechanisms for efficient resource utilization +1. Initializes and manages a VLLM model instance. +2. Performs the actual generation on a GPU. +3. Supports dynamic weight updates through IPC handles. +4. Implements sleep/wake mechanisms for efficient resource utilization. ### Custom VLLM Extensions The {py:class}`UpdatableVllmInternalWorker ` class in `vllm_backend.py` extends the VLLM worker with additional capabilities: -1. Reporting device IDs to allow mapping of workers to specific GPUs -2. Updating weights from IPC handles for efficient weight sharing -3. Checking if weights have been updated correctly +1. Reporting device IDs to allow mapping of workers to specific GPUs. +2. Updating weights from IPC handles for efficient weight sharing. +3. Checking if weights have been updated correctly. ## Usage Example @@ -133,13 +133,13 @@ output = generator.generate(input_data, greedy=False) generator.finish_generation() ``` -## Extending with New Backends +## Extend with New Backends To add a new generation backend: -1. Create a new class that implements {py:class}`GenerationInterface ` -2. Implement the required methods: {py:method}`generate `, {py:method}`prepare_for_generation `, and {py:method}`finish_generation ` -3. Ensure your implementation works with the standard {py:class}`GenerationConfig ` and {py:class}`GenerationDatumSpec ` structures -4. Register your backend with the system (if needed) to make it accessible +1. Create a new class that implements {py:class}`GenerationInterface `. +2. Implement the required methods: {py:method}`generate `, {py:method}`prepare_for_generation `, and {py:method}`finish_generation `. +3. Ensure your implementation works with the standard {py:class}`GenerationConfig ` and {py:class}`GenerationDatumSpec ` structures. +4. Register your backend with the system (if needed) to make it accessible. This modular design allows for easy extension with new backends while maintaining a consistent interface for the rest of the system. diff --git a/docs/design_docs/logger.md b/docs/design_docs/logger.md index fa81c7c291..6b748a73fa 100644 --- a/docs/design_docs/logger.md +++ b/docs/design_docs/logger.md @@ -1,8 +1,10 @@ # Logger -## Requirements: +The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB and Tensorboard. -* Tracking distributed metrics with specified reductions (mean, max, etc) +## Requirements + +* Tracking distributed metrics with specified reductions (mean, max, etc.) * Tracking distributed timing with (usually) 'max' reduction across ranks * Logging: * WandB @@ -12,7 +14,7 @@ Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging. -To handle multiple logger backends, we will have a {py:class}`LoggerInterface ` interface that the {py:class}`TensorboardLogger ` and {py:class}`WandbLogger ` will implement: +To handle multiple logger backends, we will have a {py:class}`LoggerInterface ` that the {py:class}`TensorboardLogger ` and {py:class}`WandbLogger ` will implement. ```python class LoggerInterface(ABC): @@ -29,7 +31,7 @@ class LoggerInterface(ABC): pass ``` -A {py:class}`Logger ` wrapper class will also implement {py:class}`LoggerInterface ` and will contain a list of loggers it delegates to when writing logs. This will be the main class the user uses in the training loop. Usage example: +A {py:class}`Logger ` wrapper class will implement {py:class}`LoggerInterface ` and maintain a list of loggers to which it delegates logging tasks. This class will serve as the primary logging interface for users within the training loop. For example: ```python # Initialize logger with both wandb and tensorboard enabled @@ -57,7 +59,7 @@ logger.log_metrics({ ## Validation Pretty Logging -The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter: +The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter. ```python logger: @@ -68,9 +70,9 @@ logger: When `num_val_samples_to_print` is set to a value greater than 0, the logger will generate well-formatted text outputs for the specified number of validation samples. This is particularly useful for: -1. Quickly inspecting model generation quality during training -2. Comparing inputs and outputs side-by-side -3. Tracking validation sample performance over time +1. Quickly inspecting model generation quality during training. +2. Comparing inputs and outputs side-by-side. +3. Tracking validation sample performance over time. ### Example Output @@ -80,11 +82,11 @@ When enabled, the pretty logging will generate formatted text similar to: ## GPU Metric Logging -Reinforcer monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, Reinforcer directly polls GPU memory and utilization data and logs them to TensorBoard and/or Weights & Biases. +NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or WandB. -This approach allows us to offer the same GPU metric tracking on all loggers (not just wandb) and simplifies the implementation greatly. +This approach allows us to offer the same GPU metric tracking on all loggers (not just Wandb) and simplifies the implementation greatly. -This feature is enabled with the `monitor_gpus` configuration parameter and the frequency of collection and flushing to the loggers is controlled by `gpu_collection_interval` and `gpu_flush_interval` (both in seconds), respectively: +This feature is enabled with the `monitor_gpus` configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the `gpu_collection_interval` and `gpu_flush_interval` parameters, both specified in seconds. ```python logger: @@ -97,12 +99,12 @@ logger: ``` :::{note} -While monitoring through the remote workers is possible, it requires some delicate implementation details to make sure: -* sending logs back to driver does not incur a large overhead -* metrics are easily interpretable since we may be double counting due to colocated workers -* workers gracefully flush their logs in the event of failure -* the logging is the same for tensorboard and wandb -* some workers which spawn other workers correctly report the total usage of the grandchild worker - -These reasons lead us to the simple implementation of collecting on the driver -::: +While it is feasible to monitor using remote workers, the implementation requires careful attention to details to ensure: +* Logs sent back to the driver do not introduce significant overhead. +* Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers. +* Workers can gracefully flush their logs in case of failure. +* Logging behaves consistently across TensorBoard and Wandb. +* Workers that spawn other workers accurately report the total resource usage of any grandchild workers. + +Due to these complexities, we opted for a simpler approach: collecting metrics directly on the driver. +::: \ No newline at end of file diff --git a/docs/design_docs/padding.md b/docs/design_docs/padding.md index d5949cf3b5..c6d50715ff 100644 --- a/docs/design_docs/padding.md +++ b/docs/design_docs/padding.md @@ -1,12 +1,12 @@ -# Padding in NeMo Reinforcer +# Padding in NeMo RL ## Overview -This document explains padding in NeMo Reinforcer and why consistent padding is critical for the framework. +This document explains padding in NeMo RL and why consistent padding is critical for the framework. ## Padding Approach -NeMo Reinforcer uses **right padding** for all tensor operations, where padding tokens are added to the right/end of sequences: +NeMo RL uses **right padding** for all tensor operations, where padding tokens are added to the right/end of sequences: ``` [101, 2054, 2003, 0, 0] # Length 3 @@ -15,9 +15,9 @@ NeMo Reinforcer uses **right padding** for all tensor operations, where padding ``` This approach: -1. **Naturally aligns with LLM processing**: Tokens are processed from left to right -2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors -3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value +1. **Naturally aligns with LLM processing**: Tokens are processed from left to right. +2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors. +3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value. ## Right-Padded Generation Example @@ -35,9 +35,9 @@ Corresponding logprobs: |-- zeros for input --| |- gen logprobs -| |pad| ``` -## Verifying Right Padding +## Verify Right Padding -NeMo Reinforcer provides utilities to verify correct padding: +NeMo RL provides utilities to verify correct padding. For example: ```{testcode} import torch @@ -79,20 +79,20 @@ if not is_right_padded: ``` The {py:class}`verify_right_padding() ` function checks that: -1. All padding (zeros or padding token provided by the user) appears after valid tokens -2. The padding starts at the position specified by the length tensor +1. All padding (zeros or padding token provided by the user) appears after valid tokens. +2. The padding starts at the position specified by the length tensor. The function automatically detects whether you're passing input or output data: -- For input data: Requires `input_ids` and `input_lengths` fields -- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths` +- For input data: Requires `input_ids` and `input_lengths` fields. +- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths`. ## Best Practices -1. **Always Use Right Padding**: All components expect this format +1. **Always Use Right Padding**: All components expect this format. -2. **Track Length Tensors**: Include appropriate length tensors with your data +2. **Track Length Tensors**: Include appropriate length tensors with your data. -3. **Verify Padding**: Use {py:class}`verify_right_padding() ` when in doubt +3. **Verify Padding**: Use {py:class}`verify_right_padding() ` when in doubt. -4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations +4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations. diff --git a/docs/design_docs/uv.md b/docs/design_docs/uv.md index 64976c0b8c..cf929c396a 100644 --- a/docs/design_docs/uv.md +++ b/docs/design_docs/uv.md @@ -1,38 +1,38 @@ -# `uv` in NeMo-Reinforcer +# uv in NeMo RL -Using `uv` for Dependency Management in NeMo-Reinforcer +We recommend using the `uv` Python package installer for managing dependencies in NeMo RL. ## Overview -`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document outlines why we've adopted `uv` for package management in our repository, particularly for NeMo Reinforcer, and how it helps us manage dependencies across Ray clusters. +`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document explains why we've adopted `uv` for package management in our repository, particularly for NeMo RL, and how it helps us manage dependencies across Ray clusters. ## Why `uv`? ### Speed and Efficiency -- Written in Rust, making it significantly faster than traditional Python package managers -- Optimized caching mechanisms that reduce redundant downloads and installations -- Quick environment creation and switching, enabling rapid development cycles +- Written in Rust, making it significantly faster than traditional Python package managers. +- Optimized caching mechanisms that reduce redundant downloads and installations. +- Quick environment creation and switching, enabling rapid development cycles. ### Isolated Environments -- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages -- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies -- Ensures consistent behavior across different machines and deployment environments +- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages. +- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies. +- Ensures consistent behavior across different machines and deployment environments. ### Dependency Management in Ray Clusters -- Enables management of heterogeneous Python environments across a Ray cluster -- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires -- Simplifies propagation of environments to worker nodes without manual setup on each node +- Enables management of heterogeneous Python environments across a Ray cluster. +- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires. +- Simplifies propagation of environments to worker nodes without manual setup on each node. ### Container-Free Flexibility -- Frees us from having to publish many containers for different dependency combinations -- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically -- Reduces infrastructure complexity and maintenance overhead +- Frees us from having to publish many containers for different dependency combinations. +- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically. +- Reduces infrastructure complexity and maintenance overhead. -## Implementation in NeMo Reinforcer +## Implementation in NeMo RL ### Worker Configuration @@ -61,14 +61,14 @@ If you need a different Python executable configuration, you can override the de ## How It Works -When a Reinforcer job is started: +When a NeMo RL job is started: -1. The driver script creates several {py:class}`RayWorkerGroup `s. -2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder ` +1. The driver script creates several {py:class}`RayWorkerGroup `. +2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder `. 3. Before the worker class is instantiated by the `RayWorkerBuilder`, if (1) `DEFAULT_PY_EXECUTABLE` is defined on the worker class (decorated with `@ray.remote`) and (2) it starts with `uv`; a `venv` is created with all the dependencies it needs and the `runtime_env["py_executable"]` is replaced with the `venv`'s python interpreter. This approach allows a fast start-up and maintains dependency isolation. It also has the added benefit of having all the virtual environments local under `./venvs`. ## Conclusion -Using `uv` for dependency management in NeMo Reinforcer provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems while enabling heterogeneous environments that can be tailored to specific workloads. +Using `uv` for dependency management in NeMo RL provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems, while enabling heterogeneous environments that can be tailored to specific workloads. diff --git a/docs/docker.md b/docs/docker.md index 37548ff282..11f1b0650a 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -1,22 +1,28 @@ -# Building Docker Images +# Build Docker Images + +This guide provides two methods for building Docker images: the base image, ideal for specifying Python dependencies at runtime, and the hermetic image, which includes default dependencies for offline use. + +## Base Image -### Base Image If you only need the base image with ray + uv, you can build it like so: + ```sh cd docker/ docker buildx build --target base -t reinforcer -f Dockerfile .. ``` -This is **our recommendation** as it is a small image and allows you to specify your python dependencies at runtime. +This is **our recommendation** as it is a small image and allows you to specify your Python dependencies at runtime. + +## Hermetic Image + +The Docker image build without a target stage will include all of the default dependencies to get started. -### Hermetic Image -The docker image build without a target stage will include all of the default dependencies to get started. ```sh cd docker/ docker buildx build -t reinforcer -f Dockerfile .. ``` -This image sets up the python environment for you, so you do not have to use `uv` if you don't need +This image sets up the Python environment for you, so you do not have to use `uv` if you don't need any other packages. This image is useful in situations where you may not have network connectivity to re-download packages. diff --git a/docs/documentation.md b/docs/documentation.md index c94239f213..5a434d27d5 100644 --- a/docs/documentation.md +++ b/docs/documentation.md @@ -7,9 +7,9 @@ - [Writing Tests in Python Docstrings](#writing-tests-in-python-docstrings) -## Building +## Build the Documentation -The following sections describe how to set up and build the NeMo-Reinforcer documentation. +The following sections describe how to set up and build the NeMo RL documentation. Switch to the documentation source folder and generate HTML output. @@ -23,9 +23,9 @@ uv run --group docs sphinx-build . _build/html ## Live Building -When writing documentation it can be helpful to serve the documentation and have it update live while you edit. +When writing documentation, it can be helpful to serve the documentation and have it update live while you edit. -To do so run: +To do so, run: ```sh cd docs/ @@ -35,16 +35,16 @@ uv run --group docs sphinx-autobuild . _build/html --port 12345 --host 0.0.0.0 Open a web browser and go to `http://${HOST_WHERE_SPHINX_COMMAND_RUN}:12345` to view the output. -## Running Tests in Python Docstrings +## Run Tests in Python Docstrings -We also run tests in our python docstrings. You can run them with: +We also run tests in our Python docstrings. You can run them with: ```sh cd docs/ uv run --group docs sphinx-build -b doctest . _build/doctest ``` -## Writing Tests in Python Docstrings +## Write Tests in Python Docstrings Any code in triple backtick blocks with the `{doctest}` directive will be tested. The format follows Python's doctest module syntax, where `>>>` indicates Python input and the following line shows the expected output. Here's an example: diff --git a/docs/guides/eval.md b/docs/guides/eval.md index 8ac5ab5675..c6750340f3 100644 --- a/docs/guides/eval.md +++ b/docs/guides/eval.md @@ -1,8 +1,13 @@ # Evaluation +This document explains how to use an evaluation script for assessing model capabilities. + ## Start Evaluation +To run the evaluation, you can use the default configuration file or specify a custom one. + ### Start Script + ```sh # To run the evaluation with default config (examples/configs/eval.yaml) uv run python examples/run_eval.py @@ -23,11 +28,12 @@ score=0.10 (3.0/30) ============================================================ ``` -## Configuration +## Example Configuration File -An example Evaluation configuration file can be found [here](../../examples/configs/eval.yaml). +You can find an example evaluation configuration file [here](../../examples/configs/eval.yaml). ### Prompt Template Configuration + Always remember to use the same `prompt_file` and `system_prompt_file` that were used during training. For open-source models, we recommend setting `prompt_file=null` and `system_prompt_file=null` to allow them to use their native chat templates. diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md index 6ace84876d..12a7b979f1 100644 --- a/docs/guides/grpo.md +++ b/docs/guides/grpo.md @@ -1,37 +1,37 @@ -# An in-depth walkthrough of GRPO in Reinforcer +# Use GRPO + +This document explains how to use General Reinforcement Policy Optimization (GRPO) within the NeMo RL framework. It includes a quickstart section for launching a GRPO run and detailed instructions on handling data, model training, fast generation, and overall resource flow. ## Quickstart: Launch a GRPO Run -If you want to get running quickly, the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py) has an example implementation of using GRPO to train a model on math problems. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). +To get started quickly, use the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or via Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](../cluster.md). We recommend launching the job using `uv`: + ```bash uv run examples/run_grpo_math.py --config {overrides} ``` -If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo.yaml) -**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. +If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo.yaml). + +**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). Additionally, perform a `huggingface-cli login` for Llama models. + +## Prepare the Data -## Now, for the details: +We support training with multiple RL "Environments" simultaneously. -In this guide, we'll walk through we handle -* Data -* Model training -* Fast generation -* Overall Resource Flow +An [Environment](../../nemo_reinforcer/environments/interfaces.py) is an object that processes a state/action history and returns an updated state and rewards for each step. These environments run as Ray Remote Actors, such as the [MathEnvironment](../../nemo_reinforcer/environments/math_environment.py). -### Data -We support training with multiple RL "Environments" at the same time. +To enable multi-environment training, the system requires the following: -An [Environment](../../nemo_reinforcer/environments/interfaces.py) is an object that accepts a state/action history and returns an update state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](../../nemo_reinforcer/environments/math_environment.py). +* The available reinforcement learning environments. +* The routing of data to the appropriate environments. +* Information on how to format your dataset for processing. -To support this, we need to know: -* What environments you have -* Which data should go to which environments -* How to prepare the data from your dataset into a form we can use +### Common Data Format -#### Common Data Format We define a [DatumSpec](../../nemo_reinforcer/data/interfaces.py) that holds all relevant information for each training example: + ```python class DatumSpec(TypedDict): message_log: LLMMessageLogType @@ -43,9 +43,11 @@ class DatumSpec(TypedDict): __extra__: Any # This allows additional fields of any type ``` -#### Data Processors -We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task. -For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_reinforcer/data/interfaces.py) +### Data Processors + +We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code." + +For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_reinforcer/data/interfaces.py). ```python def my_data_processor( @@ -56,14 +58,19 @@ def my_data_processor( idx: int, ) -> DatumSpec: ``` -We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py) -#### Putting it all together: +We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py). + +## Put It All Together + GRPO expects datasets to have the following form: + ```json {"task_name": "math", } ``` -Then, you can set data up as such: + +Then, you can set the data up as follows: + ```python base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"] tokenizer = AutoTokenizer.from_pretrained(policy_config["model_name"]) @@ -81,15 +88,17 @@ dataset = AllTaskProcessedDataset( max_seq_length=data_config["max_input_seq_length"], ) ``` -Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples. +Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples. + +## Policy Model -### Policy Model We define a [PolicyInterface]() that contains everything you need to train a Policy model. This Policy object holds a [RayWorkerGroup](../../nemo_reinforcer/distributed/worker_groups.py) of SPMD (1 proc/gpu) processes that run HF/MCore, all coordinated by this object so it appears to you like 1 GPU! -### Fast Generation +## Fast Generation + We support vLLM through the [VllmGeneration](../../nemo_reinforcer/models/generation/vllm.py) class right now. The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop. \ No newline at end of file diff --git a/docs/guides/sft.md b/docs/guides/sft.md index 4d452b109d..23fe351ea6 100644 --- a/docs/guides/sft.md +++ b/docs/guides/sft.md @@ -1,16 +1,20 @@ -# Supervised Fine-tuning in Reinforcer +# Supervised Fine-Tuning in NeMo RL + +This document explains how to perform SFT within NeMo RL. It outlines key operations, including initiating SFT runs, managing experiment configurations using YAML, and integrating custom datasets that conform to the required structure and attributes. ## Launch an SFT Run -The script [examples/run_sft.py](../../examples/run_sft.py) can be used to launch an experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). +The script, [examples/run_sft.py](../../examples/run_sft.py), can be used to launch an experiment. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). Be sure to launch the job using `uv`. The command to launch an SFT job is as follows: + ```bash uv run examples/run_sft.py --config ``` + If not specified, `config` will default to [examples/configs/sft.yaml](../../examples/configs/sft.yaml). -## Configuration +## Example Configuration File Reinforcer allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](../../examples/configs/sft.yaml). @@ -21,15 +25,16 @@ uv run examples/run_sft.py \ cluster.gpus_per_node=1 \ logger.wandb.name="sft-dev-1-gpu" ``` -**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. + +**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). Additionally, perform a `huggingface-cli login` for Llama models. ## Datasets -SFT datasets in Reinforcer are encapsulated using classes. Each SFT data class is expected to have the following attributes: +SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is expected to have the following attributes: 1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below. 2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset as well as the `custom_template` for this dataset. More on custom templates below. -SFT datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](../design_docs/chat_datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_reinforcer/data/hf_datasets/squad.py) has an example: +SFT datasets are expected to follow the Hugging Face chat format. Refer to the [chat dataset document](../design_docs/chat_datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_reinforcer/data/hf_datasets/squad.py) has an example: ```python def format_squad(data): @@ -51,7 +56,7 @@ def format_squad(data): } ``` -Reinforcer SFT uses HuggingFace chat templates to format the individual examples. If you would like to use a custom template, create a string template in [jinja format](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template) and pass it to the dataset's `TaskDataSpec`. For example, +To run SFT with NeMo RL, you must use Hugging Face chat templates to format the individual examples. If you would like to use a custom template, create a string template in [jinja format](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template) and pass it to the dataset's `TaskDataSpec`. For example: ```python custom_template = ( @@ -63,7 +68,7 @@ task_spec = TaskDataSpec( ) ``` -By default, NeMo-Reinforcer has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. +By default, NeMo RL has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. Adding a new dataset is a straightforward process. As long as your custom dataset has the `formatted_ds` and `task_spec` attributes described above, it can serve as a drop-in replacement for Squad and OpenAssistant. \ No newline at end of file diff --git a/docs/local_workstation.md b/docs/local_workstation.md index 3e252694a0..b59afaa721 100644 --- a/docs/local_workstation.md +++ b/docs/local_workstation.md @@ -1,6 +1,4 @@ -# Local Workstation - -## Launching Locally +# Run on Your Local Workstation When launching examples locally with `uv`, {py:class}`init_ray() ` will first attempt to connect to an existing cluster. If none is found, it will start a local one and connect to it using all available GPU and CPU resources on your node. @@ -17,7 +15,7 @@ In the logs, you will see that Ray has started a local cluster instance, along w INFO:nemo_reinforcer.distributed.virtual_cluster:Started local cluster with: {'node:__internal_head__': 1.0, 'CPU': 24.0, 'object_store_memory': 80448493977.0, 'accelerator_type:RTX': 1.0, 'memory': 177713152615.0, 'GPU': 1.0, 'node:10.0.0.1': 1.0} ``` -To control the GPUs ray uses locally more granularly, please use `CUDA_VISIBLE_DEVICES`: +To have more precise control over the GPUs Ray uses locally, please use `CUDA_VISIBLE_DEVICES`: ```sh # Use the 0th and 3rd indexed GPU (for a total of 2 GPUs) diff --git a/docs/testing.md b/docs/testing.md index 570c9c8696..46cd3d0b9b 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -1,4 +1,6 @@ -# Testing Reinforcer +# Test NeMo RL + +This guide outlines how to test NeMo RL using unit and functional tests, detailing steps for local or Docker-based execution, dependency setup, and metric tracking to ensure effective and reliable testing. ## Unit Tests @@ -15,16 +17,16 @@ uv run bash tests/run_unit.sh ``` :::{note} -Tests can also be run on SLURM with `ray.sub`, but note that some tests will be skipped +Tests can also be run on Slurm with `ray.sub`, but note that some tests will be skipped due to no GPUs being located on the head node. To run the full suite of tests, please launch on a regular GPU allocation. ::: -### Running Unit Tests in a Hermetic Environment +### Run Unit Tests in a Hermetic Environment For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) or where environmental configuration may be problematic, tests can be run -in docker with this script: +in Docker with this script: ```sh CONTAINER=... bash tests/run_unit_in_docker.sh @@ -32,9 +34,10 @@ CONTAINER=... bash tests/run_unit_in_docker.sh The required `CONTAINER` can be built by following the instructions in the [docker documentation](docker.md). -### Tracking metrics in unit tests +### Track Metrics in Unit Tests Unit tests may also log metrics to a fixture. The fixture is called `tracker` and has the following API: + ```python # Track an arbitrary metric (must be json serializable) tracker.track(metric_name, metric_value) @@ -47,6 +50,7 @@ tracker.get_max_mem() Including the `tracker` fixture also tracks the elapsed time for the test implicitly. Here is an example test: + ```python def test_exponentiate(tracker): starting_mem = tracker.get_max_mem() @@ -61,6 +65,7 @@ def test_exponentiate(tracker): ``` Which would produce this file in `tests/unit/unit_results.json`: + ```json { "exit_status": 0, @@ -97,7 +102,7 @@ jq -r '[.start_time, .git_commit, .metrics["test_hf_ray_policy::test_hf_policy_g ``` ::: -## Functional tests +## Functional Tests :::{important} Functional tests may require multiple GPUs to run. See each script to understand the requirements. @@ -124,11 +129,11 @@ whether they pass or fail. Here is an example: └────────┴────────────────────────────────┴───────────────────┴─────────┘ ``` -### Running Functional Tests in a Hermetic Environment +### Run Functional Tests in a Hermetic Environment For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) or where environmental configuration may be problematic, tests can be run -in docker with this script: +in Docker with this script: ```sh CONTAINER=... bash run_functional_in_docker.sh functional/sft.sh