From 2308c0ef14f9c25edcd009b37505d075bc7209f3 Mon Sep 17 00:00:00 2001
From: Bihan  Rana <bihan@Bihans-MacBook-Pro.local>
Date: Sat, 24 May 2025 10:56:27 +0545
Subject: [PATCH 1/4] Add dstack example

---
 docs/start/multinode.rst | 113 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst
index e278840956b..2069df92e4c 100644
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
@@ -71,6 +71,119 @@ Slurm
 -----
 TBD
 
+dstack
+------
+If you want to run multi-node training with `dstack <https://dstack.ai/>`_ on NVIDIA Cluster, you need to follow the following steps.
+
+1. Prerequisite
+
+   Once dstack is `installed <https://dstack.ai/docs/installation/>`_, go ahead clone the repo, and run dstack init. 
+
+   .. code-block:: bash
+
+       $ git clone https://github.com/dstackai/dstack
+       $ cd dstack
+       $ dstack init
+
+2. Create fleet
+   
+   Before submitting distributed training runs, make sure to create a fleet
+   with a ``placement`` set to ``cluster``.
+
+   .. note::
+
+       For more details on how to use clusters with ``dstack``, check the
+       `Clusters <https://dstack.ai/docs/guides/clusters>`_ guide.
+
+3. Run a Ray cluster
+
+   If you want to use Ray with ``dstack``, you have to first run a Ray cluster.
+
+   The task below runs a Ray cluster on an existing fleet:
+   
+   .. code-block:: yaml
+
+        type: task
+        name: ray-verl-cluster
+
+        nodes: 2
+
+        env:
+          - WANDB_API_KEY
+          - PYTHONUNBUFFERED=1
+          - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+        
+        image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
+        commands:
+          - git clone https://github.com/volcengine/verl
+          - cd verl
+          - pip install --no-deps -e .
+          - pip install hf_transfer hf_xet
+          - |
+            if [ $DSTACK_NODE_RANK = 0 ]; then
+                python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
+                python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" 
+                ray start --head --port=6379;
+            else
+                ray start --address=$DSTACK_MASTER_NODE_IP:6379
+            fi
+
+        # Expose Ray dashboard port
+        ports:
+          - 8265
+
+        resources:
+          gpu: 80GB:8
+          shm_size: 128GB
+
+        # Save checkpoints on the instance
+        volumes:
+          - /checkpoints:/checkpoints
+
+4. Submit Ray jobs
+
+   Before you can submit Ray jobs, ensure to install ``ray`` locally:
+   
+   .. code-block:: shell
+
+       $ pip install ray
+
+   Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``:
+   
+   .. code-block:: shell
+
+       $ RAY_ADDRESS=http://localhost:8265
+       $ ray job submit \
+         -- python3 -m verl.trainer.main_ppo \
+            data.train_files=/root/data/gsm8k/train.parquet \
+            data.val_files=/root/data/gsm8k/test.parquet \
+            data.train_batch_size=256 \
+            data.max_prompt_length=512 \
+            data.max_response_length=256 \
+            actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
+            actor_rollout_ref.actor.optim.lr=1e-6 \
+            actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+            actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+            actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+            actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+            actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+            actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+            critic.optim.lr=1e-5 \
+            critic.model.path=Qwen/Qwen2.5-7B-Instruct \
+            critic.ppo_micro_batch_size_per_gpu=4 \
+            algorithm.kl_ctrl.kl_coef=0.001 \
+            trainer.project_name=ppo_training \
+            trainer.experiment_name=qwen-2.5-7B \
+            trainer.val_before_train=False \
+            trainer.default_hdfs_dir=null \
+            trainer.n_gpus_per_node=8 \
+            trainer.nnodes=2 \
+            trainer.default_local_dir=/checkpoints \
+            trainer.save_freq=10 \
+            trainer.test_freq=10 \
+            trainer.total_epochs=15 2>&1 | tee verl_demo.log \
+            trainer.resume_mode=disable
+
 How to debug?
 ---------------------
 

From 87339db211b811d6dc0747ffa73dc040ba2f17b4 Mon Sep 17 00:00:00 2001
From: Bihan  Rana <bihan@Bihans-MacBook-Pro.local>
Date: Mon, 26 May 2025 19:57:57 +0545
Subject: [PATCH 2/4] Update dstack example

---
 docs/start/multinode.rst | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst
index 2069df92e4c..3e0696f4bb7 100644
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
@@ -73,27 +73,23 @@ TBD
 
 dstack
 ------
-If you want to run multi-node training with `dstack <https://dstack.ai/>`_ on NVIDIA Cluster, you need to follow the following steps.
+`dstack <https://github.com/dstackai/dstack>`_  simplifies distributed-training by providing streamlined alternative to K8s/Slurm.
+To run multi-node training jobs with dstack , you need to follow the following steps.
 
 1. Prerequisite
 
-   Once dstack is `installed <https://dstack.ai/docs/installation/>`_, go ahead clone the repo, and run dstack init. 
+   Once dstack is `installed <https://dstack.ai/docs/installation/>`_, initialize the directory as a repo with ``dstack init``. 
 
    .. code-block:: bash
 
-       $ git clone https://github.com/dstackai/dstack
-       $ cd dstack
+       $ mkdir dstack_task && cd dstack_task
        $ dstack init
 
 2. Create fleet
    
-   Before submitting distributed training runs, make sure to create a fleet
-   with a ``placement`` set to ``cluster``.
+   Before submitting distributed training jobs, make sure to create a `fleet <https://dstack.ai/docs/concepts/fleets/>`_.
+   dstack supports various cloud providers through ``cloud fleets`` and on-prem servers through ``SSH fleets``.
 
-   .. note::
-
-       For more details on how to use clusters with ``dstack``, check the
-       `Clusters <https://dstack.ai/docs/guides/clusters>`_ guide.
 
 3. Run a Ray cluster
 
@@ -184,6 +180,8 @@ If you want to run multi-node training with `dstack <https://dstack.ai/>`_ on NV
             trainer.total_epochs=15 2>&1 | tee verl_demo.log \
             trainer.resume_mode=disable
 
+For more details on how to use ``dstack``, check the `dstack documentation <https://dstack.ai/docs/>`_.
+
 How to debug?
 ---------------------
 

From 634c491ca41354209b00e9b79c6e09d2ce31fbf7 Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Mon, 26 May 2025 17:55:42 +0200
Subject: [PATCH 3/4] Updated `dstack` example

---
 docs/start/multinode.rst | 191 ++++++++++++++++++++-------------------
 1 file changed, 99 insertions(+), 92 deletions(-)

diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst
index 3e0696f4bb7..7182967f4b0 100644
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
@@ -73,114 +73,121 @@ TBD
 
 dstack
 ------
-`dstack <https://github.com/dstackai/dstack>`_  simplifies distributed-training by providing streamlined alternative to K8s/Slurm.
-To run multi-node training jobs with dstack , you need to follow the following steps.
+`dstackai/dstack <https://github.com/dstackai/dstack>`_ is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments
+without the need to use K8S or Slurm.
 
-1. Prerequisite
+Prerequisite
+~~~~~~~~~~~~
+Once dstack is `installed <https://dstack.ai/docs/installation>`_, initialize the directory as a repo with ``dstack init``. 
 
-   Once dstack is `installed <https://dstack.ai/docs/installation/>`_, initialize the directory as a repo with ``dstack init``. 
+.. code-block:: bash
 
-   .. code-block:: bash
+    mkdir myproject && cd myproject
+    dstack init
 
-       $ mkdir dstack_task && cd dstack_task
-       $ dstack init
+**Create a fleet**
 
-2. Create fleet
-   
-   Before submitting distributed training jobs, make sure to create a `fleet <https://dstack.ai/docs/concepts/fleets/>`_.
-   dstack supports various cloud providers through ``cloud fleets`` and on-prem servers through ``SSH fleets``.
+Before submitting distributed training jobs, create a `dstack` `fleet <https://dstack.ai/docs/concepts/fleets>`_.
 
+Run a Ray cluster task
+~~~~~~~~~~~~~~~~~~~~~~
 
-3. Run a Ray cluster
+Once the fleet is created, define a Ray cluster task, e.g. in ``ray-cluster.dstack.yml``:
 
-   If you want to use Ray with ``dstack``, you have to first run a Ray cluster.
+.. code-block:: yaml
 
-   The task below runs a Ray cluster on an existing fleet:
-   
-   .. code-block:: yaml
+    type: task
+    name: ray-verl-cluster
+
+    nodes: 2
+
+    env:
+        - WANDB_API_KEY
+        - PYTHONUNBUFFERED=1
+        - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    
+    image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
+    commands:
+        - git clone https://github.com/volcengine/verl
+        - cd verl
+        - pip install --no-deps -e .
+        - pip install hf_transfer hf_xet
+        - |
+        if [ $DSTACK_NODE_RANK = 0 ]; then
+            python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
+            python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" 
+            ray start --head --port=6379;
+        else
+            ray start --address=$DSTACK_MASTER_NODE_IP:6379
+        fi
 
-        type: task
-        name: ray-verl-cluster
+    # Expose Ray dashboard port
+    ports:
+        - 8265
 
-        nodes: 2
+    resources:
+        gpu: 80GB:8
+        shm_size: 128GB
 
-        env:
-          - WANDB_API_KEY
-          - PYTHONUNBUFFERED=1
-          - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-        
-        image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
-        commands:
-          - git clone https://github.com/volcengine/verl
-          - cd verl
-          - pip install --no-deps -e .
-          - pip install hf_transfer hf_xet
-          - |
-            if [ $DSTACK_NODE_RANK = 0 ]; then
-                python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
-                python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" 
-                ray start --head --port=6379;
-            else
-                ray start --address=$DSTACK_MASTER_NODE_IP:6379
-            fi
-
-        # Expose Ray dashboard port
-        ports:
-          - 8265
-
-        resources:
-          gpu: 80GB:8
-          shm_size: 128GB
-
-        # Save checkpoints on the instance
-        volumes:
-          - /checkpoints:/checkpoints
-
-4. Submit Ray jobs
-
-   Before you can submit Ray jobs, ensure to install ``ray`` locally:
+    # Save checkpoints on the instance
+    volumes:
+        - /checkpoints:/checkpoints
+
+Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`.
+
+.. code-block:: bash
+
+    dstack apply -f ray-cluster.dstack.yml
+
+As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution
+
+Submit Ray jobs
+~~~~~~~~~~~~~~~
+
+Before you can submit Ray jobs, ensure to install `ray`` locally:
    
-   .. code-block:: shell
+.. code-block:: shell
 
-       $ pip install ray
+    pip install ray
 
-   Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``:
+Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``:
    
-   .. code-block:: shell
-
-       $ RAY_ADDRESS=http://localhost:8265
-       $ ray job submit \
-         -- python3 -m verl.trainer.main_ppo \
-            data.train_files=/root/data/gsm8k/train.parquet \
-            data.val_files=/root/data/gsm8k/test.parquet \
-            data.train_batch_size=256 \
-            data.max_prompt_length=512 \
-            data.max_response_length=256 \
-            actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
-            actor_rollout_ref.actor.optim.lr=1e-6 \
-            actor_rollout_ref.actor.ppo_mini_batch_size=64 \
-            actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
-            actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
-            actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
-            actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
-            actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
-            critic.optim.lr=1e-5 \
-            critic.model.path=Qwen/Qwen2.5-7B-Instruct \
-            critic.ppo_micro_batch_size_per_gpu=4 \
-            algorithm.kl_ctrl.kl_coef=0.001 \
-            trainer.project_name=ppo_training \
-            trainer.experiment_name=qwen-2.5-7B \
-            trainer.val_before_train=False \
-            trainer.default_hdfs_dir=null \
-            trainer.n_gpus_per_node=8 \
-            trainer.nnodes=2 \
-            trainer.default_local_dir=/checkpoints \
-            trainer.save_freq=10 \
-            trainer.test_freq=10 \
-            trainer.total_epochs=15 2>&1 | tee verl_demo.log \
-            trainer.resume_mode=disable
-
-For more details on how to use ``dstack``, check the `dstack documentation <https://dstack.ai/docs/>`_.
+.. code-block:: shell
+
+    $ RAY_ADDRESS=http://localhost:8265
+    $ ray job submit \
+        -- python3 -m verl.trainer.main_ppo \
+        data.train_files=/root/data/gsm8k/train.parquet \
+        data.val_files=/root/data/gsm8k/test.parquet \
+        data.train_batch_size=256 \
+        data.max_prompt_length=512 \
+        data.max_response_length=256 \
+        actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+        critic.optim.lr=1e-5 \
+        critic.model.path=Qwen/Qwen2.5-7B-Instruct \
+        critic.ppo_micro_batch_size_per_gpu=4 \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.project_name=ppo_training \
+        trainer.experiment_name=qwen-2.5-7B \
+        trainer.val_before_train=False \
+        trainer.default_hdfs_dir=null \
+        trainer.n_gpus_per_node=8 \
+        trainer.nnodes=2 \
+        trainer.default_local_dir=/checkpoints \
+        trainer.save_freq=10 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15 2>&1 | tee verl_demo.log \
+        trainer.resume_mode=disable
+
+
+For more details on how `dstack` works, check out its `documentation <https://dstack.ai/docs>`_.
 
 How to debug?
 ---------------------

From 0a362cf5c566171b8fad3251fdda7b68326ab3c7 Mon Sep 17 00:00:00 2001
From: Bihan  Rana <bihan@Bihans-MacBook-Pro.local>
Date: Mon, 26 May 2025 22:00:06 +0545
Subject: [PATCH 4/4] Minor Update

---
 docs/start/multinode.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/start/multinode.rst b/docs/start/multinode.rst
index 7182967f4b0..6caa53c3b29 100644
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
@@ -144,7 +144,7 @@ As long as the `dstack apply` is attached, you can use `localhost:8265` to submi
 Submit Ray jobs
 ~~~~~~~~~~~~~~~
 
-Before you can submit Ray jobs, ensure to install `ray`` locally:
+Before you can submit Ray jobs, ensure to install `ray` locally:
    
 .. code-block:: shell