From 2308c0ef14f9c25edcd009b37505d075bc7209f3 Mon Sep 17 00:00:00 2001 From: Bihan Rana Date: Sat, 24 May 2025 10:56:27 +0545 Subject: [PATCH 1/4] Add dstack example --- docs/start/multinode.rst | 113 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst index e278840956b..2069df92e4c 100644 --- a/docs/start/multinode.rst +++ b/docs/start/multinode.rst @@ -71,6 +71,119 @@ Slurm ----- TBD +dstack +------ +If you want to run multi-node training with `dstack `_ on NVIDIA Cluster, you need to follow the following steps. + +1. Prerequisite + + Once dstack is `installed `_, go ahead clone the repo, and run dstack init. + + .. code-block:: bash + + $ git clone https://github.com/dstackai/dstack + $ cd dstack + $ dstack init + +2. Create fleet + + Before submitting distributed training runs, make sure to create a fleet + with a ``placement`` set to ``cluster``. + + .. note:: + + For more details on how to use clusters with ``dstack``, check the + `Clusters `_ guide. + +3. Run a Ray cluster + + If you want to use Ray with ``dstack``, you have to first run a Ray cluster. + + The task below runs a Ray cluster on an existing fleet: + + .. code-block:: yaml + + type: task + name: ray-verl-cluster + + nodes: 2 + + env: + - WANDB_API_KEY + - PYTHONUNBUFFERED=1 + - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + + image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2 + commands: + - git clone https://github.com/volcengine/verl + - cd verl + - pip install --no-deps -e . + - pip install hf_transfer hf_xet + - | + if [ $DSTACK_NODE_RANK = 0 ]; then + python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k + python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" + ray start --head --port=6379; + else + ray start --address=$DSTACK_MASTER_NODE_IP:6379 + fi + + # Expose Ray dashboard port + ports: + - 8265 + + resources: + gpu: 80GB:8 + shm_size: 128GB + + # Save checkpoints on the instance + volumes: + - /checkpoints:/checkpoints + +4. Submit Ray jobs + + Before you can submit Ray jobs, ensure to install ``ray`` locally: + + .. code-block:: shell + + $ pip install ray + + Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``: + + .. code-block:: shell + + $ RAY_ADDRESS=http://localhost:8265 + $ ray job submit \ + -- python3 -m verl.trainer.main_ppo \ + data.train_files=/root/data/gsm8k/train.parquet \ + data.val_files=/root/data/gsm8k/test.parquet \ + data.train_batch_size=256 \ + data.max_prompt_length=512 \ + data.max_response_length=256 \ + actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.actor.ppo_mini_batch_size=64 \ + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ + actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ + critic.optim.lr=1e-5 \ + critic.model.path=Qwen/Qwen2.5-7B-Instruct \ + critic.ppo_micro_batch_size_per_gpu=4 \ + algorithm.kl_ctrl.kl_coef=0.001 \ + trainer.project_name=ppo_training \ + trainer.experiment_name=qwen-2.5-7B \ + trainer.val_before_train=False \ + trainer.default_hdfs_dir=null \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=2 \ + trainer.default_local_dir=/checkpoints \ + trainer.save_freq=10 \ + trainer.test_freq=10 \ + trainer.total_epochs=15 2>&1 | tee verl_demo.log \ + trainer.resume_mode=disable + How to debug? --------------------- From 87339db211b811d6dc0747ffa73dc040ba2f17b4 Mon Sep 17 00:00:00 2001 From: Bihan Rana Date: Mon, 26 May 2025 19:57:57 +0545 Subject: [PATCH 2/4] Update dstack example --- docs/start/multinode.rst | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst index 2069df92e4c..3e0696f4bb7 100644 --- a/docs/start/multinode.rst +++ b/docs/start/multinode.rst @@ -73,27 +73,23 @@ TBD dstack ------ -If you want to run multi-node training with `dstack `_ on NVIDIA Cluster, you need to follow the following steps. +`dstack `_ simplifies distributed-training by providing streamlined alternative to K8s/Slurm. +To run multi-node training jobs with dstack , you need to follow the following steps. 1. Prerequisite - Once dstack is `installed `_, go ahead clone the repo, and run dstack init. + Once dstack is `installed `_, initialize the directory as a repo with ``dstack init``. .. code-block:: bash - $ git clone https://github.com/dstackai/dstack - $ cd dstack + $ mkdir dstack_task && cd dstack_task $ dstack init 2. Create fleet - Before submitting distributed training runs, make sure to create a fleet - with a ``placement`` set to ``cluster``. + Before submitting distributed training jobs, make sure to create a `fleet `_. + dstack supports various cloud providers through ``cloud fleets`` and on-prem servers through ``SSH fleets``. - .. note:: - - For more details on how to use clusters with ``dstack``, check the - `Clusters `_ guide. 3. Run a Ray cluster @@ -184,6 +180,8 @@ If you want to run multi-node training with `dstack `_ on NV trainer.total_epochs=15 2>&1 | tee verl_demo.log \ trainer.resume_mode=disable +For more details on how to use ``dstack``, check the `dstack documentation `_. + How to debug? --------------------- From 634c491ca41354209b00e9b79c6e09d2ce31fbf7 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Mon, 26 May 2025 17:55:42 +0200 Subject: [PATCH 3/4] Updated `dstack` example --- docs/start/multinode.rst | 191 ++++++++++++++++++++------------------- 1 file changed, 99 insertions(+), 92 deletions(-) diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst index 3e0696f4bb7..7182967f4b0 100644 --- a/docs/start/multinode.rst +++ b/docs/start/multinode.rst @@ -73,114 +73,121 @@ TBD dstack ------ -`dstack `_ simplifies distributed-training by providing streamlined alternative to K8s/Slurm. -To run multi-node training jobs with dstack , you need to follow the following steps. +`dstackai/dstack `_ is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments +without the need to use K8S or Slurm. -1. Prerequisite +Prerequisite +~~~~~~~~~~~~ +Once dstack is `installed `_, initialize the directory as a repo with ``dstack init``. - Once dstack is `installed `_, initialize the directory as a repo with ``dstack init``. +.. code-block:: bash - .. code-block:: bash + mkdir myproject && cd myproject + dstack init - $ mkdir dstack_task && cd dstack_task - $ dstack init +**Create a fleet** -2. Create fleet - - Before submitting distributed training jobs, make sure to create a `fleet `_. - dstack supports various cloud providers through ``cloud fleets`` and on-prem servers through ``SSH fleets``. +Before submitting distributed training jobs, create a `dstack` `fleet `_. +Run a Ray cluster task +~~~~~~~~~~~~~~~~~~~~~~ -3. Run a Ray cluster +Once the fleet is created, define a Ray cluster task, e.g. in ``ray-cluster.dstack.yml``: - If you want to use Ray with ``dstack``, you have to first run a Ray cluster. +.. code-block:: yaml - The task below runs a Ray cluster on an existing fleet: - - .. code-block:: yaml + type: task + name: ray-verl-cluster + + nodes: 2 + + env: + - WANDB_API_KEY + - PYTHONUNBUFFERED=1 + - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + + image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2 + commands: + - git clone https://github.com/volcengine/verl + - cd verl + - pip install --no-deps -e . + - pip install hf_transfer hf_xet + - | + if [ $DSTACK_NODE_RANK = 0 ]; then + python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k + python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" + ray start --head --port=6379; + else + ray start --address=$DSTACK_MASTER_NODE_IP:6379 + fi - type: task - name: ray-verl-cluster + # Expose Ray dashboard port + ports: + - 8265 - nodes: 2 + resources: + gpu: 80GB:8 + shm_size: 128GB - env: - - WANDB_API_KEY - - PYTHONUNBUFFERED=1 - - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 - - image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2 - commands: - - git clone https://github.com/volcengine/verl - - cd verl - - pip install --no-deps -e . - - pip install hf_transfer hf_xet - - | - if [ $DSTACK_NODE_RANK = 0 ]; then - python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k - python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" - ray start --head --port=6379; - else - ray start --address=$DSTACK_MASTER_NODE_IP:6379 - fi - - # Expose Ray dashboard port - ports: - - 8265 - - resources: - gpu: 80GB:8 - shm_size: 128GB - - # Save checkpoints on the instance - volumes: - - /checkpoints:/checkpoints - -4. Submit Ray jobs - - Before you can submit Ray jobs, ensure to install ``ray`` locally: + # Save checkpoints on the instance + volumes: + - /checkpoints:/checkpoints + +Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`. + +.. code-block:: bash + + dstack apply -f ray-cluster.dstack.yml + +As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution + +Submit Ray jobs +~~~~~~~~~~~~~~~ + +Before you can submit Ray jobs, ensure to install `ray`` locally: - .. code-block:: shell +.. code-block:: shell - $ pip install ray + pip install ray - Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``: +Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``: - .. code-block:: shell - - $ RAY_ADDRESS=http://localhost:8265 - $ ray job submit \ - -- python3 -m verl.trainer.main_ppo \ - data.train_files=/root/data/gsm8k/train.parquet \ - data.val_files=/root/data/gsm8k/test.parquet \ - data.train_batch_size=256 \ - data.max_prompt_length=512 \ - data.max_response_length=256 \ - actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \ - actor_rollout_ref.actor.optim.lr=1e-6 \ - actor_rollout_ref.actor.ppo_mini_batch_size=64 \ - actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ - actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ - actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ - actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ - actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ - critic.optim.lr=1e-5 \ - critic.model.path=Qwen/Qwen2.5-7B-Instruct \ - critic.ppo_micro_batch_size_per_gpu=4 \ - algorithm.kl_ctrl.kl_coef=0.001 \ - trainer.project_name=ppo_training \ - trainer.experiment_name=qwen-2.5-7B \ - trainer.val_before_train=False \ - trainer.default_hdfs_dir=null \ - trainer.n_gpus_per_node=8 \ - trainer.nnodes=2 \ - trainer.default_local_dir=/checkpoints \ - trainer.save_freq=10 \ - trainer.test_freq=10 \ - trainer.total_epochs=15 2>&1 | tee verl_demo.log \ - trainer.resume_mode=disable - -For more details on how to use ``dstack``, check the `dstack documentation `_. +.. code-block:: shell + + $ RAY_ADDRESS=http://localhost:8265 + $ ray job submit \ + -- python3 -m verl.trainer.main_ppo \ + data.train_files=/root/data/gsm8k/train.parquet \ + data.val_files=/root/data/gsm8k/test.parquet \ + data.train_batch_size=256 \ + data.max_prompt_length=512 \ + data.max_response_length=256 \ + actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.actor.ppo_mini_batch_size=64 \ + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ + actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ + critic.optim.lr=1e-5 \ + critic.model.path=Qwen/Qwen2.5-7B-Instruct \ + critic.ppo_micro_batch_size_per_gpu=4 \ + algorithm.kl_ctrl.kl_coef=0.001 \ + trainer.project_name=ppo_training \ + trainer.experiment_name=qwen-2.5-7B \ + trainer.val_before_train=False \ + trainer.default_hdfs_dir=null \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=2 \ + trainer.default_local_dir=/checkpoints \ + trainer.save_freq=10 \ + trainer.test_freq=10 \ + trainer.total_epochs=15 2>&1 | tee verl_demo.log \ + trainer.resume_mode=disable + + +For more details on how `dstack` works, check out its `documentation `_. How to debug? --------------------- From 0a362cf5c566171b8fad3251fdda7b68326ab3c7 Mon Sep 17 00:00:00 2001 From: Bihan Rana Date: Mon, 26 May 2025 22:00:06 +0545 Subject: [PATCH 4/4] Minor Update --- docs/start/multinode.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst index 7182967f4b0..6caa53c3b29 100644 --- a/docs/start/multinode.rst +++ b/docs/start/multinode.rst @@ -144,7 +144,7 @@ As long as the `dstack apply` is attached, you can use `localhost:8265` to submi Submit Ray jobs ~~~~~~~~~~~~~~~ -Before you can submit Ray jobs, ensure to install `ray`` locally: +Before you can submit Ray jobs, ensure to install `ray` locally: .. code-block:: shell