runpod · muhsinking · Apr 28, 2025 · Apr 3, 2025 · Apr 3, 2025 · Apr 4, 2025
diff --git a/docs/instant-clusters/axolotl.md b/docs/instant-clusters/axolotl.md
@@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and use it to fine-tune a la
 
 This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
 
-Follow the steps below to deploy your Cluster and start training your models efficiently.
+Follow the steps below to deploy a cluster and start training your models efficiently.
 
 ## Step 1: Deploy an Instant Cluster
 
@@ -19,35 +19,35 @@ Follow the steps below to deploy your Cluster and start training your models eff
 
 ## Step 2: Set up Axolotl on each Pod
 
-1. Click your Cluster to expand the list of Pods.
+1. Click your cluster to expand the list of Pods.
 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
 3. Click **Connect**, then click **Web Terminal**.
-4. Clone the Axolotl repository into the Pod's main directory:
+4. In the terminal that opens, run this command to clone the Axolotl repository into the Pod's main directory:
 
-```bash
-git clone https://github.com/axolotl-ai-cloud/axolotl
-```
+    ```bash
+    git clone https://github.com/axolotl-ai-cloud/axolotl
+    ```
 
 5. Navigate to the `axolotl` directory:
 
-```bash
-cd axolotl
-```
+    ```bash
+    cd axolotl
+    ```
 
 6. Install the required packages:
 
-```bash
-pip3 install -U packaging setuptools wheel ninja
-pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
-```
+    ```bash
+    pip3 install -U packaging setuptools wheel ninja
+    pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
+    ```
 
 7. Navigate to the `examples/llama-3` directory:
 
-```bash
-cd examples/llama-3
-```
+    ```bash
+    cd examples/llama-3
+    ```
 
-Repeat these steps for **each Pod** in your Cluster.
+Repeat these steps for **each Pod** in your cluster.
 
 ## Step 3: Start the training process on each Pod
 
@@ -90,11 +90,11 @@ Congrats! You've successfully trained a model using Axolotl on an Instant Cluste
 
 ## Step 4: Clean up
 
-If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.
+If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
 
 :::note
 
-You can monitor your Cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
+You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
 
 :::
 
@@ -103,7 +103,7 @@ You can monitor your Cluster usage and spending using the **Billing Explorer** a
 Now that you've successfully deployed and tested an Axolotl distributed training job on an Instant Cluster, you can:
 
 - **Fine-tune your own models** by modifying the configuration files in Axolotl to suit your specific requirements.
-- **Scale your training** by adjusting the number of Pods in your Cluster (and the size of their containers and volumes) to handle larger models or datasets.
+- **Scale your training** by adjusting the number of Pods in your cluster (and the size of their containers and volumes) to handle larger models or datasets.
 - **Try different optimization techniques** such as DeepSpeed, FSDP (Fully Sharded Data Parallel), or other distributed training strategies.
 
 For more information on fine-tuning with Axolotl, refer to the [Axolotl documentation](https://github.com/OpenAccess-AI-Collective/axolotl).
diff --git a/docs/instant-clusters/index.md b/docs/instant-clusters/index.md
@@ -24,8 +24,9 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a
 
 Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:
 
-- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch)
-- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl)
+- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch).
+- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl).
+- [Deploy an Instant Cluster with Slurm](/instant-clusters/axolotl).
 
 ## Use cases for Instant Clusters
 
@@ -69,12 +70,12 @@ The following environment variables are available in all Pods:
 | `PRIMARY_ADDR` / `MASTER_ADDR` | The address of the primary Pod.                                               |
 | `PRIMARY_PORT` / `MASTER_PORT` | The port of the primary Pod (all ports are available).                        |
 | `NODE_ADDR`                    | The static IP of this Pod within the cluster network.                         |
-| `NODE_RANK`                    | The Cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
-| `NUM_NODES`                    | The number of Pods in the Cluster.                                            |
+| `NODE_RANK`                    | The cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
+| `NUM_NODES`                    | The number of Pods in the cluster.                                            |
 | `NUM_TRAINERS`                 | The number of GPUs per Pod.                                                   |
 | `HOST_NODE_ADDR`               | Defined as `PRIMARY_ADDR:PRIMARY_PORT` for convenience.                       |
-| `WORLD_SIZE`                   | The total number of GPUs in the Cluster (`NUM_NODES` * `NUM_TRAINERS`).       |
+| `WORLD_SIZE`                   | The total number of GPUs in the cluster (`NUM_NODES` * `NUM_TRAINERS`).       |
 
-Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a Cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.
+Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.
 
 The variables `MASTER_ADDR`/`PRIMARY_ADDR` and `MASTER_PORT`/`PRIMARY_PORT` are equivalent. The `MASTER_*` variables provide compatibility with tools that expect these legacy names.
diff --git a/docs/instant-clusters/pytorch.md b/docs/instant-clusters/pytorch.md
@@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and run a multi-node process
 
 This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. 
 
-Follow the steps below to deploy your Cluster and start running distributed PyTorch workloads efficiently.
+Follow the steps below to deploy a cluster and start running distributed PyTorch workloads efficiently.
 
 ## Step 1: Deploy an Instant Cluster
 
@@ -19,22 +19,22 @@ Follow the steps below to deploy your Cluster and start running distributed PyTo
 
 ## Step 2: Clone the PyTorch demo into each Pod
 
-1. Click your Cluster to expand the list of Pods.
+1. Click your cluster to expand the list of Pods.
 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
 3. Click **Connect**, then click **Web Terminal**.
-4. Run this command to clone a basic `main.py` file into the Pod's main directory:
+4. In the terminal that opens, run this command to clone a basic `main.py` file into the Pod's main directory:
 
-```bash
-git clone https://github.com/murat-runpod/torch-demo.git
-```
+    ```bash
+    git clone https://github.com/murat-runpod/torch-demo.git
+    ```
 
-Repeat these steps for **each Pod** in your Cluster.
+Repeat these steps for **each Pod** in your cluster.
 
 ## Step 3: Examine the main.py file
 
 Let's look at the code in our `main.py` file:
 
-```python
+```python title="main.py"
 import os
 import torch
 import torch.distributed as dist
@@ -80,7 +80,7 @@ This is the minimal code necessary for initializing a distributed environment. T
 
 Run this command in the web terminal of **each Pod** to start the PyTorch process:
 
-```bash
+```bash title="launcher.sh"
 export NCCL_DEBUG=WARN
 torchrun \
   --nproc_per_node=$NUM_TRAINERS \
@@ -106,7 +106,7 @@ Running on rank 14/15 (local rank: 6), device: cuda:6
 Running on rank 10/15 (local rank: 2), device: cuda:2
 ```
 
-The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the Cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).
+The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).
 
 The specific number and order of ranks may be different in your terminal, and the global ranks listed will be different for each Pod.
 
@@ -116,7 +116,7 @@ This diagram illustrates how local and global ranks are distributed across multi
 
 ## Step 5: Clean up
 
-If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.
+If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
 
 :::note
 
@@ -128,10 +128,9 @@ You can monitor your cluster usage and spending using the **Billing Explorer** a
 
 Now that you've successfully deployed and tested a PyTorch distributed application on an Instant Cluster, you can:
 
-- **Adapt your own PyTorch code** to run on the Cluster by modifying the distributed initialization in your scripts.
-- **Scale your training** by adjusting the number of Pods in your Cluster to handle larger models or datasets.
+- **Adapt your own PyTorch code** to run on the cluster by modifying the distributed initialization in your scripts.
+- **Scale your training** by adjusting the number of Pods in your cluster to handle larger models or datasets.
 - **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
 - **Optimize performance** by experimenting with different distributed training strategies like Data Parallel (DP), Distributed Data Parallel (DDP), or Fully Sharded Data Parallel (FSDP).
 
 For more information on distributed training with PyTorch, refer to the [PyTorch Distributed Training documentation](https://pytorch.org/tutorials/beginner/dist_overview.html).
-
diff --git a/docs/instant-clusters/slurm.md b/docs/instant-clusters/slurm.md
@@ -0,0 +1,185 @@
+---
+title: Deploy with SLURM
+sidebar_position: 4
+description: Learn how to deploy an Instant Cluster and set up SLURM for distributed job scheduling.
+---
+
+# Deploy an Instant Cluster with SLURM
+
+This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on RunPod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
+
+Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently.
+
+## Requirements
+
+- You've created a [RunPod account](https://www.runpod.io/console/home) and funded it with sufficient credits.
+- You have basic familiarity with Linux command line.
+- You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/).
+
+## Step 1: Deploy an Instant Cluster
+
+1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface.
+2. Click **Create Cluster**.
+3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch).
+4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.
+
+## Step 2: Clone the SLURM demo into each Pod
+
+To connect to a Pod:
+
+1. On the Instant Clusters page, click on the cluster you created to expand the list of Pods.
+2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
+
+**On each Pod:**
+
+1. Click **Connect**, then click **Web Terminal**.
+2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory:
+
+    ```bash
+    https://github.com/pandyamarut/slurm_example.git
+    cd slurm_example
+    ```
+
+3. Run this command to the scripts executable:
+
+    ```bash
+    chmod +x create_gres_conf.sh create_slurm_conf.sh install.sh setup.sh test_batch.sh
+    ```
+
+## Step 3: Overview of SLURM demo scripts
+
+The repository contains several essential scripts for setting up SLURM. Let's examine what each script does:
+
+- `setup.sh`: Prepares the Pod environment with necessary dependencies and utilities for SLURM and distributed computing.
+- `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node.
+- `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup.
+- `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment.
+- `test_batch.sh`: A sample SLURM job script for testing cluster functionality.
+
+## Step 4: Run the setup script
+
+Run `setup.sh` **on each Pod** to prepare the development environment.
+
+```bash
+./setup.h
+```
+
+This script prepares the Pod development environment by installing Linux dependencies and setting up the Python virtual environment with `uv` (a [Python package manager](https://github.com/astral-sh/uv)) and Python 3.11.
+
+Uncomment the last line of `setup.h` if you want to setup GitHub on your Pods (run this command):
+
+```bash
+echo ./scripts/setup_github.sh "<YOUR_GITHUB_EMAIL>" "<YOUR_NAME>"
+```
+
+## Step 4: Get the hostname and IP address for each Pod
+
+Before running the installation script, you'll need to get the hostname and IP address for each Pod.
+
+**On each Pod:**
+
+1. Run this command to get the IP address of the node:
+
+    ```bash
+    echo $NODE_ADDR
+    ```
+
+    If this outputs `10.65.0.2`, this is the **primary node**. If it outputs `10.65.0.3`, this is the **secondary node**. 
+
+2. Run this command to get the Pod's hostname:
+
+    ```bash
+    echo $HOSTNAME
+    ```
+
+    This should output a string of random numbers and letters, similar to:
+
+    ```bash
+    4f653f31b496
+    ```
+
+3. Make a note of the hostname for the primary (`$NODE_ADDR` = `10.65.0.2`) and secondary (`$NODE_ADDR` = `10.65.0.3`) nodes.
+
+## Step 5: Install SLURM on each Pod
+
+Now run the installation script **on each Pod**:
+
+```bash
+./install.sh "[MUNGE_SECRET_KEY]" [HOSTNAME_PRIMARY] [HOSTNAME_SECONDARY] `10.65.0.2` `10.65.0.3`
+```
+
+Replace:
+- `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.
+- `[HOSTNAME_PRIMARY]` with the hostname of the primary node (`$NODE_ADDR` = `10.65.0.2`)
+- `[HOSTNAME_SECONDARY]` with the hostname of the secondary node (`$NODE_ADDR` = `10.65.0.3`).
+
+This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
+
+## Step 5: Start SLURM services
+
+1. **On the primary node** (`$NODE_ADDR` = `10.65.0.2`), run both SLURM services:
+
+    ```bash
+    sudo slurmctld -D
+    ```
+
+2. Use the web interface to open a second terminal **on the primary node** and run:
+
+    ```bash
+    sudo slurmd -D
+    ```
+
+3. **On the secondary node** (`$NODE_ADDR` = `10.65.0.3`), run:
+
+    ```bash
+    sudo slurmd -D
+    ```
+
+After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal.
+
+## Step 6: Test your SLURM Cluster
+
+1. Run this command **on the primary node** to check the status of your nodes:
+
+    ```bash
+    sinfo
+    ```
+
+    You should see output showing both nodes in your cluster, with a state of "idle" if everything is working correctly.
+
+2. Run this command to test GPU availability across both nodes:
+
+    ```bash
+    srun --nodes=2 --gres=gpu:1 nvidia-smi -L
+    ```
+
+    This command should list one GPU from each of your two nodes.
+
+## Step 7: Submit the SLURM job script
+
+Run the following command **on the primary node** (`$NODE_ADDR` = `10.65.0.2`) to submit the test job script and confirm that your cluster is working properly:
+
+```bash
+sbatch test_batch.sh
+```
+
+Check the output file created by the test (`test_simple_[JOBID].out`) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster.
+
+## Step 8: Clean up
+
+If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
+
+:::note
+
+You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
+
+:::
+
+## Next steps
+
+Now that you've successfully deployed and tested a SLURM cluster on RunPod, you can:
+
+- **Adapt your own distributed workloads** to run using SLURM job scripts.
+- **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets.
+- **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
+- **Optimize performance** by experimenting with different distributed training strategies.