Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 67 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,66 @@ uv pip install -e '.[dev,test]'

**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.

### SFT

We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).

#### Single Node

The experiment is set up to run on 8 GPUs. If using a machine that has access to 8 GPUs, you can launch the experiment as follows:

```sh
uv run python examples/run_sft.py
```

This trains `Llama3.1-8B` on 8 GPUs. To run on a single GPU, we'll have to override a few of the experiment settings. We replace the 8B model with a smaller 1B model, decrease the batch size, and update the cluster configuration to use a single gpu:

```sh
uv run python examples/run_sft.py \
policy.model_name="meta-llama/Llama-3.2-1B" \
policy.train_global_batch_size=16 \
sft.val_global_batch_size=16 \
cluster.gpus_per_node=1
```

Refer to [sft.yaml](examples/configs/sft.yaml) for a full list of parameters that can be overridden.

#### Multi-node

For distributed training across multiple nodes:

Set `UV_CACHE_DIR` to a directory that can be read from all workers before running any uv run command.
```sh
export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
```

```sh
# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# SFT experiment uses Llama-3.1-8B model
COMMAND="uv pip install -e .; uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
```

### GRPO

We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.

#### Single GPU
#### Single Node

To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:

Expand All @@ -67,25 +122,28 @@ To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
uv run python examples/run_grpo_math.py
```

By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides:
By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,

```sh
# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
cluster.gpus_per_node=8
```

You can override any of the parameters listed in the yaml configuration file. For example,

```sh
uv run python examples/run_grpo_math.py \
policy.model_name="Qwen/Qwen2-1.5B" \
checkpointing.checkpoint_dir="results/qwen1_5b_math" \
logger.wandb_enabled=True \
logger.wandb.name="grpo-qwen1_5b_math" \
logger.num_val_samples_to_print=10
logger.num_val_samples_to_print=10 \
```

#### Multi-node

For distributed training across multiple nodes:

Set `UV_CACHE_DIR` to a directory that can be read from all workers before running any uv run command.
```sh
export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
```
For the general multi-node setup, refer to the [SFT multi-node](#multi-node) documentation. The only thing that differs from SFT is the `COMMAND`:

```sh
# Run from the root of NeMo-Reinforcer repo
Expand Down