-
Notifications
You must be signed in to change notification settings - Fork 28
Add tutorial for Instant Clusters + SLURM #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 12 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
8a5201a
Initial commit
muhsinking 5b21db1
Minor update
muhsinking 16bc931
Minor update
muhsinking a2248a9
Finish basic steps
muhsinking 446a358
Update formatting
muhsinking 2f34e9f
Fix formatting
muhsinking f96e457
Reorder placeholder explanations
muhsinking 7987dbd
Normalize cluster capitalization
muhsinking 90127fa
Remove link to slurm example repo
muhsinking 28812c0
Specify which node to run test on
muhsinking 307f55c
Add requirements section
muhsinking 3f6451a
Add title to pytorch code sample
muhsinking 0875e3c
Fix step numbers, remove sudo
muhsinking 19364f0
Merge branch 'main' into ic-slurm
muhsinking f3bb776
Remove step 4
muhsinking 8d68b82
Update installation instructions
muhsinking 39b3d1d
Remove step 4
muhsinking c36145d
Fix steps
muhsinking 594460f
Merge branch 'main' into ic-slurm
muhsinking af419d1
Merge branch 'main' into ic-slurm
muhsinking 2b286f5
Fix typo
muhsinking 9df958a
Merge branch 'ic-slurm' of https://github.com/runpod/docs into ic-slurm
muhsinking File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| --- | ||
| title: Deploy with SLURM | ||
| sidebar_position: 4 | ||
| description: Learn how to deploy an Instant Cluster and set up SLURM for distributed job scheduling. | ||
| --- | ||
|
|
||
| # Deploy an Instant Cluster with SLURM | ||
|
|
||
| This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on RunPod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. | ||
|
|
||
| Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - You've created a [RunPod account](https://www.runpod.io/console/home) and funded it with sufficient credits. | ||
| - You have basic familiarity with Linux command line. | ||
| - You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/). | ||
|
|
||
| ## Step 1: Deploy an Instant Cluster | ||
|
|
||
| 1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface. | ||
| 2. Click **Create Cluster**. | ||
| 3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch). | ||
| 4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds. | ||
|
|
||
| ## Step 2: Clone the SLURM demo into each Pod | ||
|
|
||
| To connect to a Pod: | ||
|
|
||
| 1. On the Instant Clusters page, click on the cluster you created to expand the list of Pods. | ||
| 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod. | ||
|
|
||
| **On each Pod:** | ||
|
|
||
| 1. Click **Connect**, then click **Web Terminal**. | ||
| 2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory: | ||
|
|
||
| ```bash | ||
| https://github.com/pandyamarut/slurm_example.git | ||
| cd slurm_example | ||
| ``` | ||
|
|
||
| 3. Run this command to the scripts executable: | ||
|
|
||
| ```bash | ||
| chmod +x create_gres_conf.sh create_slurm_conf.sh install.sh setup.sh test_batch.sh | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| ## Step 3: Overview of SLURM demo scripts | ||
|
|
||
| The repository contains several essential scripts for setting up SLURM. Let's examine what each script does: | ||
|
|
||
| - `setup.sh`: Prepares the Pod environment with necessary dependencies and utilities for SLURM and distributed computing. | ||
| - `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node. | ||
| - `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup. | ||
| - `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment. | ||
| - `test_batch.sh`: A sample SLURM job script for testing cluster functionality. | ||
|
|
||
| ## Step 4: Run the setup script | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Run `setup.sh` **on each Pod** to prepare the development environment. | ||
|
|
||
| ```bash | ||
| ./setup.h | ||
| ``` | ||
|
|
||
| This script prepares the Pod development environment by installing Linux dependencies and setting up the Python virtual environment with `uv` (a [Python package manager](https://github.com/astral-sh/uv)) and Python 3.11. | ||
|
|
||
| Uncomment the last line of `setup.h` if you want to setup GitHub on your Pods (run this command): | ||
|
|
||
| ```bash | ||
| echo ./scripts/setup_github.sh "<YOUR_GITHUB_EMAIL>" "<YOUR_NAME>" | ||
| ``` | ||
|
|
||
| ## Step 4: Get the hostname and IP address for each Pod | ||
|
|
||
| Before running the installation script, you'll need to get the hostname and IP address for each Pod. | ||
|
|
||
| **On each Pod:** | ||
|
|
||
| 1. Run this command to get the IP address of the node: | ||
|
|
||
| ```bash | ||
| echo $NODE_ADDR | ||
| ``` | ||
|
|
||
| If this outputs `10.65.0.2`, this is the **primary node**. If it outputs `10.65.0.3`, this is the **secondary node**. | ||
|
|
||
| 2. Run this command to get the Pod's hostname: | ||
|
|
||
| ```bash | ||
| echo $HOSTNAME | ||
| ``` | ||
|
|
||
| This should output a string of random numbers and letters, similar to: | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| 4f653f31b496 | ||
| ``` | ||
|
|
||
| 3. Make a note of the hostname for the primary (`$NODE_ADDR` = `10.65.0.2`) and secondary (`$NODE_ADDR` = `10.65.0.3`) nodes. | ||
|
|
||
| ## Step 5: Install SLURM on each Pod | ||
|
|
||
| Now run the installation script **on each Pod**: | ||
|
|
||
| ```bash | ||
| ./install.sh "[MUNGE_SECRET_KEY]" [HOSTNAME_PRIMARY] [HOSTNAME_SECONDARY] `10.65.0.2` `10.65.0.3` | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| Replace: | ||
| - `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster. | ||
| - `[HOSTNAME_PRIMARY]` with the hostname of the primary node (`$NODE_ADDR` = `10.65.0.2`) | ||
| - `[HOSTNAME_SECONDARY]` with the hostname of the secondary node (`$NODE_ADDR` = `10.65.0.3`). | ||
|
|
||
| This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. | ||
|
|
||
| ## Step 5: Start SLURM services | ||
|
|
||
| 1. **On the primary node** (`$NODE_ADDR` = `10.65.0.2`), run both SLURM services: | ||
|
|
||
| ```bash | ||
| sudo slurmctld -D | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| 2. Use the web interface to open a second terminal **on the primary node** and run: | ||
|
|
||
| ```bash | ||
| sudo slurmd -D | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| 3. **On the secondary node** (`$NODE_ADDR` = `10.65.0.3`), run: | ||
|
|
||
| ```bash | ||
| sudo slurmd -D | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal. | ||
|
|
||
| ## Step 6: Test your SLURM Cluster | ||
|
|
||
| 1. Run this command **on the primary node** to check the status of your nodes: | ||
|
|
||
| ```bash | ||
| sinfo | ||
| ``` | ||
|
|
||
| You should see output showing both nodes in your cluster, with a state of "idle" if everything is working correctly. | ||
|
|
||
| 2. Run this command to test GPU availability across both nodes: | ||
|
|
||
| ```bash | ||
| srun --nodes=2 --gres=gpu:1 nvidia-smi -L | ||
| ``` | ||
|
|
||
| This command should list one GPU from each of your two nodes. | ||
muhsinking marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Step 7: Submit the SLURM job script | ||
|
|
||
| Run the following command **on the primary node** (`$NODE_ADDR` = `10.65.0.2`) to submit the test job script and confirm that your cluster is working properly: | ||
|
|
||
| ```bash | ||
| sbatch test_batch.sh | ||
| ``` | ||
|
|
||
| Check the output file created by the test (`test_simple_[JOBID].out`) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster. | ||
|
|
||
| ## Step 8: Clean up | ||
|
|
||
| If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges. | ||
|
|
||
| :::note | ||
|
|
||
| You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab. | ||
|
|
||
| ::: | ||
|
|
||
| ## Next steps | ||
|
|
||
| Now that you've successfully deployed and tested a SLURM cluster on RunPod, you can: | ||
|
|
||
| - **Adapt your own distributed workloads** to run using SLURM job scripts. | ||
| - **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets. | ||
| - **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models. | ||
| - **Optimize performance** by experimenting with different distributed training strategies. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.