diff --git a/README.md b/README.md index 1dacff454c..3381fef9f7 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ - [GRPO](#grpo) - [Single Node](#single-node) - [Multi-node](#multi-node) + - [GRPO Qwen2.5-32B](#grpo-qwen25-32b) - [SFT](#sft) - [Single Node](#single-node-1) - [Multi-node](#multi-node-1) @@ -28,7 +29,7 @@ What you can expect: ## Features -✅ _Available now_ | 🔜 _Coming in v0.2_ +✅ _Available now_ | 🔜 _Coming in v0.3_ - ✅ **Fast Generation** - vLLM backend for optimized inference - ✅ **HuggingFace Integration** - Works with 1-8B models (Qwen1.5, Llama) @@ -36,10 +37,16 @@ What you can expect: - ✅ **Environment Support** - Support for multi-environment training. - ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning) - ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state) + - ✅ **DPO Algorithm** - Direct Preference Optimization for alignment -- 🔜 **Larger Model Support** - Native PyTorch support for models up to 70B parameters -- 🔜 **Advanced Parallelism** - FSDP2, TP, SP, and sequence packing for efficient training -- 🔜 **Environment Isolation** - Dependency isolation between components +- ✅ **Larger Model Support** - Native PyTorch support for models up to 32B parameters +- ✅ **Advanced Parallelism** - FSDP2, TP, SP, and sequence packing for efficient training +- ✅ **Environment Isolation** - Dependency isolation between components + +- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models +- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core +- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models +- 🔜 **MoE Models** - Support DeepseekV3 and Llama4 ## Prerequisuites @@ -99,12 +106,33 @@ uv run python examples/run_grpo_math.py \ ```sh # Run from the root of NeMo-Reinforcer repo NUM_ACTOR_NODES=2 -# Add a timestamp to make each job name unique -TIMESTAMP=$(date +%Y%m%d_%H%M%S) # grpo_math_8b uses Llama-3.1-8B-Instruct model COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \ -UV_CACHE_DIR=YOUR_UV_CACHE_DIR \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +##### GRPO Qwen2.5-32B + +```sh +# Run from the root of NeMo-Reinforcer repo +NUM_ACTOR_NODES=16 + +# Download Qwen before the job starts to avoid spending time downloading during the training loop +HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B + +# Ensure HF_HOME is included in your MOUNTS +HF_HOME=/path/to/hf_home \ +COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \ CONTAINER=YOUR_CONTAINER \ MOUNTS="$PWD:$PWD" \ sbatch \ @@ -148,8 +176,6 @@ Refer to `examples/configs/sft.yaml` for a full list of parameters that can be o ```sh # Run from the root of NeMo-Reinforcer repo NUM_ACTOR_NODES=2 -# Add a timestamp to make each job name unique -TIMESTAMP=$(date +%Y%m%d_%H%M%S) COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \ CONTAINER=YOUR_CONTAINER \ @@ -208,8 +234,6 @@ For distributed DPO training across multiple nodes, modify the following script # Run from the root of NeMo-Reinforcer repo ## number of nodes to use for your job NUM_ACTOR_NODES=2 -# Add a timestamp to make each job name unique -TIMESTAMP=$(date +%Y%m%d_%H%M%S) COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \ RAY_DEDUP_LOGS=0 \