diff --git a/docs/instant-clusters/axolotl.md b/docs/instant-clusters/axolotl.md index 7e29a955..aede936a 100644 --- a/docs/instant-clusters/axolotl.md +++ b/docs/instant-clusters/axolotl.md @@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and use it to fine-tune a la This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. -Follow the steps below to deploy your Cluster and start training your models efficiently. +Follow the steps below to deploy a cluster and start training your models efficiently. ## Step 1: Deploy an Instant Cluster @@ -19,35 +19,35 @@ Follow the steps below to deploy your Cluster and start training your models eff ## Step 2: Set up Axolotl on each Pod -1. Click your Cluster to expand the list of Pods. +1. Click your cluster to expand the list of Pods. 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod. 3. Click **Connect**, then click **Web Terminal**. -4. Clone the Axolotl repository into the Pod's main directory: +4. In the terminal that opens, run this command to clone the Axolotl repository into the Pod's main directory: -```bash -git clone https://github.com/axolotl-ai-cloud/axolotl -``` + ```bash + git clone https://github.com/axolotl-ai-cloud/axolotl + ``` 5. Navigate to the `axolotl` directory: -```bash -cd axolotl -``` + ```bash + cd axolotl + ``` 6. Install the required packages: -```bash -pip3 install -U packaging setuptools wheel ninja -pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]' -``` + ```bash + pip3 install -U packaging setuptools wheel ninja + pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]' + ``` 7. Navigate to the `examples/llama-3` directory: -```bash -cd examples/llama-3 -``` + ```bash + cd examples/llama-3 + ``` -Repeat these steps for **each Pod** in your Cluster. +Repeat these steps for **each Pod** in your cluster. ## Step 3: Start the training process on each Pod @@ -90,11 +90,11 @@ Congrats! You've successfully trained a model using Axolotl on an Instant Cluste ## Step 4: Clean up -If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges. +If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges. :::note -You can monitor your Cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab. +You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab. ::: @@ -103,7 +103,7 @@ You can monitor your Cluster usage and spending using the **Billing Explorer** a Now that you've successfully deployed and tested an Axolotl distributed training job on an Instant Cluster, you can: - **Fine-tune your own models** by modifying the configuration files in Axolotl to suit your specific requirements. -- **Scale your training** by adjusting the number of Pods in your Cluster (and the size of their containers and volumes) to handle larger models or datasets. +- **Scale your training** by adjusting the number of Pods in your cluster (and the size of their containers and volumes) to handle larger models or datasets. - **Try different optimization techniques** such as DeepSpeed, FSDP (Fully Sharded Data Parallel), or other distributed training strategies. For more information on fine-tuning with Axolotl, refer to the [Axolotl documentation](https://github.com/OpenAccess-AI-Collective/axolotl). diff --git a/docs/instant-clusters/index.md b/docs/instant-clusters/index.md index 04d42362..58c20c9f 100644 --- a/docs/instant-clusters/index.md +++ b/docs/instant-clusters/index.md @@ -24,8 +24,9 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework: -- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch) -- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl) +- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch). +- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl). +- [Deploy an Instant Cluster with Slurm](/instant-clusters/axolotl). ## Use cases for Instant Clusters @@ -69,12 +70,12 @@ The following environment variables are available in all Pods: | `PRIMARY_ADDR` / `MASTER_ADDR` | The address of the primary Pod. | | `PRIMARY_PORT` / `MASTER_PORT` | The port of the primary Pod (all ports are available). | | `NODE_ADDR` | The static IP of this Pod within the cluster network. | -| `NODE_RANK` | The Cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). | -| `NUM_NODES` | The number of Pods in the Cluster. | +| `NODE_RANK` | The cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). | +| `NUM_NODES` | The number of Pods in the cluster. | | `NUM_TRAINERS` | The number of GPUs per Pod. | | `HOST_NODE_ADDR` | Defined as `PRIMARY_ADDR:PRIMARY_PORT` for convenience. | -| `WORLD_SIZE` | The total number of GPUs in the Cluster (`NUM_NODES` * `NUM_TRAINERS`). | +| `WORLD_SIZE` | The total number of GPUs in the cluster (`NUM_NODES` * `NUM_TRAINERS`). | -Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a Cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node. +Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node. The variables `MASTER_ADDR`/`PRIMARY_ADDR` and `MASTER_PORT`/`PRIMARY_PORT` are equivalent. The `MASTER_*` variables provide compatibility with tools that expect these legacy names. diff --git a/docs/instant-clusters/pytorch.md b/docs/instant-clusters/pytorch.md index 2ebbb875..b8cd307a 100644 --- a/docs/instant-clusters/pytorch.md +++ b/docs/instant-clusters/pytorch.md @@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and run a multi-node process This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. -Follow the steps below to deploy your Cluster and start running distributed PyTorch workloads efficiently. +Follow the steps below to deploy a cluster and start running distributed PyTorch workloads efficiently. ## Step 1: Deploy an Instant Cluster @@ -19,22 +19,22 @@ Follow the steps below to deploy your Cluster and start running distributed PyTo ## Step 2: Clone the PyTorch demo into each Pod -1. Click your Cluster to expand the list of Pods. +1. Click your cluster to expand the list of Pods. 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod. 3. Click **Connect**, then click **Web Terminal**. -4. Run this command to clone a basic `main.py` file into the Pod's main directory: +4. In the terminal that opens, run this command to clone a basic `main.py` file into the Pod's main directory: -```bash -git clone https://github.com/murat-runpod/torch-demo.git -``` + ```bash + git clone https://github.com/murat-runpod/torch-demo.git + ``` -Repeat these steps for **each Pod** in your Cluster. +Repeat these steps for **each Pod** in your cluster. ## Step 3: Examine the main.py file Let's look at the code in our `main.py` file: -```python +```python title="main.py" import os import torch import torch.distributed as dist @@ -80,7 +80,7 @@ This is the minimal code necessary for initializing a distributed environment. T Run this command in the web terminal of **each Pod** to start the PyTorch process: -```bash +```bash title="launcher.sh" export NCCL_DEBUG=WARN torchrun \ --nproc_per_node=$NUM_TRAINERS \ @@ -106,7 +106,7 @@ Running on rank 14/15 (local rank: 6), device: cuda:6 Running on rank 10/15 (local rank: 2), device: cuda:2 ``` -The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the Cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example). +The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example). The specific number and order of ranks may be different in your terminal, and the global ranks listed will be different for each Pod. @@ -116,7 +116,7 @@ This diagram illustrates how local and global ranks are distributed across multi ## Step 5: Clean up -If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges. +If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges. :::note @@ -128,8 +128,8 @@ You can monitor your cluster usage and spending using the **Billing Explorer** a Now that you've successfully deployed and tested a PyTorch distributed application on an Instant Cluster, you can: -- **Adapt your own PyTorch code** to run on the Cluster by modifying the distributed initialization in your scripts. -- **Scale your training** by adjusting the number of Pods in your Cluster to handle larger models or datasets. +- **Adapt your own PyTorch code** to run on the cluster by modifying the distributed initialization in your scripts. +- **Scale your training** by adjusting the number of Pods in your cluster to handle larger models or datasets. - **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models. - **Optimize performance** by experimenting with different distributed training strategies like Data Parallel (DP), Distributed Data Parallel (DDP), or Fully Sharded Data Parallel (FSDP). diff --git a/docs/instant-clusters/slurm.md b/docs/instant-clusters/slurm.md new file mode 100644 index 00000000..f824ae12 --- /dev/null +++ b/docs/instant-clusters/slurm.md @@ -0,0 +1,140 @@ +--- +title: Deploy with SLURM +sidebar_position: 4 +description: Learn how to deploy an Instant Cluster and set up SLURM for distributed job scheduling. +--- + +# Deploy an Instant Cluster with SLURM + +This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on RunPod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. + +Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently. + +## Requirements + +- You've created a [RunPod account](https://www.runpod.io/console/home) and funded it with sufficient credits. +- You have basic familiarity with Linux command line. +- You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/). + +## Step 1: Deploy an Instant Cluster + +1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface. +2. Click **Create Cluster**. +3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch). +4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds. + +## Step 2: Clone demo and install SLURM on each Pod + +To connect to a Pod: + +1. On the Instant Clusters page, click on the cluster you created to expand the list of Pods. +2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod. + +**On each Pod:** + +1. Click **Connect**, then click **Web Terminal**. +2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory: + + ```bash + git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example + ``` + +3. Run this command to install SLURM: + + ```bash + apt update && apt install -y slurm-wlm slurm-client munge + ``` + +## Step 3: Overview of SLURM demo scripts + +The repository contains several essential scripts for setting up SLURM. Let's examine what each script does: + +- `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node. +- `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup. +- `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment. +- `test_batch.sh`: A sample SLURM job script for testing cluster functionality. + +## Step 4: Install SLURM on each Pod + +Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster. + +```bash +./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3 +``` + +This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. + +## Step 5: Start SLURM services + +:::tip + +If you're not sure which Pod is the primary node, run the command `echo $HOSTNAME` on the web terminal of each Pod and look for `node-0`. + +::: + +1. **On the primary node** (`node-0`), run both SLURM services: + + ```bash + slurmctld -D + ``` + +2. Use the web interface to open a second terminal **on the primary node** and run: + + ```bash + slurmd -D + ``` + +3. **On the secondary node** (`node-1`), run: + + ```bash + slurmd -D + ``` + +After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal. + +## Step 6: Test your SLURM Cluster + +1. Run this command **on the primary node** (`node-0`) to check the status of your nodes: + + ```bash + sinfo + ``` + + You should see output showing both nodes in your cluster, with a state of "idle" if everything is working correctly. + +2. Run this command to test GPU availability across both nodes: + + ```bash + srun --nodes=2 --gres=gpu:1 nvidia-smi -L + ``` + + This command should list all GPUs across both nodes. + +## Step 7: Submit the SLURM job script + +Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly: + +```bash +sbatch test_batch.sh +``` + +Check the output file created by the test (`test_simple_[JOBID].out`) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster. + +## Step 8: Clean up + +If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges. + +:::note + +You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab. + +::: + +## Next steps + +Now that you've successfully deployed and tested a SLURM cluster on RunPod, you can: + +- **Adapt your own distributed workloads** to run using SLURM job scripts. +- **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets. +- **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models. +- **Optimize performance** by experimenting with different distributed training strategies. \ No newline at end of file