NVIDIA-NeMo · jgerh · Apr 11, 2025 · Apr 11, 2025 · Apr 11, 2025 · Apr 11, 2025
@@ -1,81 +1,95 @@
-# Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s
-
-<!-- markdown all in one -->
-- [Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-reinforcer-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
-  - [Features](#features)
-  - [Installation](#installation)
-  - [Quick start](#quick-start)
-    - [SFT](#sft)
-      - [Single Node](#single-node)
-      - [Multi-node](#multi-node)
-    - [GRPO](#grpo)
-      - [Single Node](#single-node-1)
-      - [Multi-node](#multi-node-1)
-  - [Cluster Start](#cluster-start)
-
-**Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
+# NeMo RL: Scalable and Efficient Post-Training on NVIDIA GPUs
+
+**NeMo RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
 
 What you can expect:
 
-- **Seamless integration with HuggingFace** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
-- **High-performance implementation with Megatron core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
+- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
+- **High-performance implementation with Megatron Core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
 - **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations.
 - **Flexibility** with a modular design that allows easy integration and customization.
 - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
 
-## Features
-
-_✅ Available now | 🔜 Coming in v0.2_
-
-- ✅ **Fast Generation** - vLLM backend for optimized inference
-- ✅ **HuggingFace Integration** - Works with 1-8B models (Qwen1.5, Llama)
-- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure
-- ✅ **Environment Support** - Support for multi-environment training.
-- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
-- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state)
-- 🔜 **Larger Model Support** - Native PyTorch support for models up to 70B parameters
-- 🔜 **Advanced Parallelism** - FSDP2, TP, SP, and sequence packing for efficient training
-- 🔜 **Environment Isolation** - Dependency isolation between components
-- 🔜 **DPO Algorithm** - Direct Preference Optimization for alignment
-
-## Installation
+## Table of Contents
+
+- [Key Features](#key-features)
+- [Install NeMo RL](#install-nemo-rl)
+- [Quickstart](#quickstart)
+- [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
+    - [Run Single Node SFT](#run-single-node-sft)
+    - [Run Multi-node SFT](#run-multi-node-sft)
+  - [Group Relative Policy Optimization (GRPO)](#group-relative-policy-optimization-grpo)
+    - [Run Single Node GRPO](#run-single-node-grpo)
+    - [Run Multi-node GRPO](#run-multi-node-grpo)
+- [Set Up Clusters](#set-up-clusters)
+- [Contributing](#contributing)
+- [Licenses](#licenses)
+
+## Key Features
+
+_✅ Available Now | 🔜 Coming Soon (v0.2)_
+
+- ✅ **Fast Generation:** Utilizes vLLM backend for optimized inference during evaluation and rollout.
+- ✅ **Hugging Face Integration:** Seamlessly integrates with Hugging Face Transformers, supporting a wide range of pre-trained models (e.g., Qwen1.5, Llama models up to 8B parameters).
+- ✅ **Scalable Distributed Training:** Leverages Fully Sharded Data Parallelism (FSDP) and a Ray-based infrastructure for efficient multi-GPU and multi-node training.
+- ✅ **Multi-Environment Support:** Enables training across diverse environments and datasets.
+- ✅ **Reinforcement Learning Algorithms:** Implements Group Relative Policy Optimization (GRPO) for effective preference alignment.
+- ✅ **Supervised Fine-Tuning (SFT):** Supports standard supervised fine-tuning for instruction following and task adaptation.
+- ✅ **Worker Isolation:** Ensures process isolation between RL actors, preventing unintended global state interference.
+- 🔜 **Larger Model Support:** Native PyTorch support for models up to 70B parameters.
+- 🔜 **Advanced Parallelism Techniques:** Implementation of FSDP2, Tensor Parallelism (TP), Pipeline Parallelism (PP), and sequence packing for enhanced training efficiency.
+- 🔜 **Environment Isolation:** Provides dependency isolation between different components of the training pipeline.
+- 🔜 **Direct Preference Optimization (DPO):** Integration of the Direct Preference Optimization algorithm for more direct preference learning.
+
+## Install NeMo RL
+
+Use of the `uv` Python package manager is required for setup. Python 3.12 or a compatible version is also required.
 
 ```sh
-# For faster setup we use `uv`
+# Install uv
 pip install uv
 
-# Specify a virtual env that uses Python 3.12
-uv venv -p python3.12.9 .venv
-# Install NeMo-Reinforcer with vllm
+# Create a virtual environment with Python 3.12
+uv venv -p python3.12 .venv
+
+# Activate the virtual environment (optional, but recommended for consistency)
+# source .venv/bin/activate  # On Linux/macOS
+# .venv\Scripts\activate  # On Windows
+
+# Install NeMo RL with vLLM support
 uv pip install -e .[vllm]
-# Install NeMo-Reinforcer with dev/test dependencies
-uv pip install -e '.[dev,test]'
 
-# Use uv run to launch any runs. 
-# Note that it is recommended to not activate the venv and instead use `uv run` since
-# it ensures consistent environment usage across different shells and sessions.
+# To install with development and testing dependencies:
+# uv pip install -e '.[dev,test]'
+
+# Running scripts with `uv run` ensures a consistent environment.
 # Example: uv run python examples/run_grpo_math.py
 ```
 
-## Quick start
+**Important Notes:**
 
-**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+- Use the `uv run <command>` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
+- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
 
-### SFT
+## Quickstart
 
-We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
+Before running any experiments, remember to set your `HF_HOME` environment variable and your `WANDB_API_KEY` if you intend to use Weights & Biases for logging. For accessing Llama models, you might also need to log in using `huggingface-cli login`.
 
-#### Single Node
+## Supervised Fine-Tuning (SFT)
 
-The default SFT experiment is configured to run on a single GPU. To launch the experiment,
+We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
+
+#### Run Single Node SFT
+
+The default SFT configuration is set to run on a single GPU. To start the experiment:
 
 ```sh
 uv run python examples/run_sft.py
 ```
 
-This trains `Llama3.2-1B` on one GPU using the SQUAD dataset.
+This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU.
 
-If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration. We also switch to an 8B Llama base model and increase the batch size:
+To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:
 
 ```sh
 uv run python examples/run_sft.py \
@@ -85,9 +99,9 @@ uv run python examples/run_sft.py \
   cluster.gpus_per_node=8
 ```
 
-Refer to [sft.yaml](examples/configs/sft.yaml) for a full list of parameters that can be overridden.
+Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.
 
-#### Multi-node
+#### Run Multi-node SFT
 
 For distributed training across multiple nodes:
 
@@ -97,7 +111,7 @@ export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
 ```
 
 ```sh
-# Run from the root of NeMo-Reinforcer repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=2
 # Add a timestamp to make each job name unique
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
@@ -118,11 +132,11 @@ sbatch \
     ray.sub
 ```
 
-### GRPO
+### Group Relative Policy Optimization (GRPO)
 
-We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
+We provide a reference GRPO experiment configuration for training on math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
 
-#### Single Node
+#### Run Single Node GRPO
 
 To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
 
@@ -131,7 +145,7 @@ To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
 uv run python examples/run_grpo_math.py
 ```
 
-By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
+By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,
 
 ```sh
 # Run the GRPO math example using a 1B parameter model using 8 GPUs
@@ -150,10 +164,10 @@ uv run python examples/run_grpo_math.py \
   logger.num_val_samples_to_print=10 \
 ```
 
-#### Multi-node
+#### Run Multi-node GRPO
 
 ```sh
-# Run from the root of NeMo-Reinforcer repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=2
 # Add a timestamp to make each job name unique
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
@@ -174,6 +188,16 @@ sbatch \
     ray.sub
 ```
 
-## Cluster Start
+## Set Up Clusters
+
+For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation.
+
+## Contributing
+
+We welcome contributions to NeMo RL\! Please see our [Contributing Guidelines](https://github.com/NVIDIA/reinforcer/blob/main/CONTRIBUTING.md) for more information on how to get involved.
+
+## Licenses
+
+NVIDIA NeMo RL is licensed under the [Apache License 2.0](https://github.com/NVIDIA/reinforcer/blob/main/LICENSE).
 
-Please visit [Cluster Start](docs/cluster.md) for how to get started on Slurm or Kubernetes.
+NeMo is licensed under the [NVIDIA AI PRODUCT AGREEMENT](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). By pulling and using the container, you accept the terms and conditions of this license.
@@ -1,10 +1,10 @@
-# Adding New Models
+# Add New Models
 
-This guide outlines how to integrate and validate a new model within **NeMo-Reinforcer**. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
+This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
 
 ## Importance of Log Probability Consistency in Training and Inference
 
-In on-policy RL, we sample tokens (actions) from the latest version of the policy, meaning the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
+In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
 
 As an example, we would see errors in naive KL estimation:
 
@@ -24,33 +24,33 @@ where samples are drawn as $x \sim \pi_{\text{sampling-framework}}$
 
 as a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the sampling framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{sampling-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
 
-## Understanding Discrepancies Between Backends
+## Understand Discrepancies Between Backends
 
 When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant:
 
 - **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8).
-  - Training may use mixed precision while the inference backend may not
-  - High-precision training with FP8 inference may not be numerically stable for certain models
-  - Differences can occur at the layer level, with some layers in FP32 while others use lower precision
+  - Training may use mixed precision, while the inference backend may not.
+  - High-precision training with FP8 inference may not be numerically stable for certain models.
+  - Differences can occur at the layer level, with some layers in FP32, while others use lower precision.
 
 - **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented.
-  - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends
-  - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences
-  - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability
+  - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends.
+  - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences.
+  - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability.
 
 - **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation.
-  - In some cases, disabling the inference backend cache can resolve discrepancies
+  - In some cases, disabling the inference backend cache can resolve discrepancies.
 
-- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations
+- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations.
 
-- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`)
+- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`).
 
 - **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities.
-  - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels
+  - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels.
 
-- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect
-  - If weights are reshaped or reordered incorrectly, generations tend to be very wrong
-  - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses
+- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect.
+  - If weights are reshaped or reordered incorrectly, generations tend to be very wrong.
+  - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses.
 
 - **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1`
 
@@ -69,7 +69,7 @@ When validating Hugging Face-based models, perform the following checks:
   Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches.
 
 - **Test parallelism**  
-  Verify consistency with other parallelism settings. 
+  Verify consistency with other parallelism settings.
 
 - **Variance**  
   Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance.
@@ -96,7 +96,7 @@ When validating Hugging Face-based models, perform the following checks:
 ### Additional Validation
 
 - **Compare Megatron outputs**  
-  Ensure the Megatron forward pass aligns with HuggingFace and the generation log probabilities from inference backends like **vLLM**.
+  Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**.
 
 - **Parallel settings**  
   Match the same parallelism configurations used for the HuggingFace-based tests.  
@@ -120,4 +120,4 @@ When validating your model, you should analyze the results across different conf
 
 ---
 
-By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-Reinforcer**'s requirements.
+By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL.