This suite of scripts is designed to perform stress testing and functional checks on GPUs within a containerized environment, typically managed by Slurm. It includes node-local hardware tests (GPU burn, NCCL bandwidth) and a multi-node distributed PyTorch training test.
Dockerfile
: Defines the container image with necessary dependencies including PyTorch, CUDA, NCCL, and tools likegit
,make
,gcc
required by the test scripts.requirements.txt
: Lists Python dependencies (e.g.,torch
,torchvision
).src/gpu_probe/runner.py
: Python script that orchestrates node-local tests. It executesrun_nccl.sh
andrun_gpu_burn.sh
.src/gpu_probe/train.py
: A PyTorch script that performs a simple distributed training routine (CIFAR10 dataset with ResNet50 model) usingDistributedDataParallel (DDP)
to verify multi-GPU/multi-node training functionality.run_nccl.sh
: A shell script that clones the NVIDIAnccl-tests
repository, builds them, and runsall_reduce_perf
to measure inter-GPU communication bandwidth on a node.run_gpu_burn.sh
: A shell script that clones thegpu-burn
utility, builds it, and runs it to stress-test GPUs for stability and thermal performance. The default duration is 30 seconds.entrypoint.sh
: The main entrypoint for the Docker container, which executessrc/gpu_probe/runner.py
.submit_gpu_probe.sbatch
: An example Slurm batch script that demonstrates how to run the GPU probe suite. It includes steps for:- Running node-local probes (
runner.py
) on the first allocated node. - Running the distributed training test (
train.py
) across all allocated nodes and GPUs usingtorchrun
.
- Running node-local probes (
- Docker: For building the container image.
- Slurm Workload Manager: For submitting and managing the job on a cluster.
- Pyxis/Enroot (or similar Slurm container integration): For running Docker images with Slurm (
srun --container-image=...
). - NVIDIA GPUs: Accessible to the Slurm cluster nodes.
- Container Registry: (Optional, if not using local image files like
.sqsh
) To host the built Docker image.
-
Navigate to the
gpu-probe
directory. -
Build the Docker image:
docker build -t your-registry/your-repo/gpu-probe:latest -f Dockerfile .
Replace
your-registry/your-repo/gpu-probe:latest
with your desired image name and tag.
-
Push Image (if using a registry): If you're using a central container registry, push the built image:
docker push your-registry/your-repo/gpu-probe:latest
-
Update sbatch Script: Edit
submit_gpu_probe.sbatch
:- Set the
DOCKER_IMAGE
variable to the correct path of your image in the registry (e.g.,DOCKER_IMAGE=your-registry/your-repo/gpu-probe:latest
). - If you are using a locally converted Enroot image (
.sqsh
file), updateDOCKER_IMAGE
to the absolute path of the.sqsh
file on the cluster nodes (e.g.,DOCKER_IMAGE=/path/to/shared/gpu-probe.sqsh
). - Adjust Slurm parameters (
--nodes
,--gres=gpu:X
,--mem
,--time
,--output
) as needed for your cluster and testing requirements. - Modify
TRAIN_ARGS
if you need different training parameters or a different shared--data_path
for the CIFAR10 dataset. Ensure the chosen--data_path
is on a shared filesystem accessible by all nodes with the same path.
- Set the
-
Submit the Slurm Job:
sbatch submit_gpu_probe.sbatch
The submit_gpu_probe.sbatch
script orchestrates the following:
-
Node-Local Probe (on first node):
srun
launches the container on the first allocated node.- The container's
entrypoint.sh
executespython -m gpu_probe.runner --test
. runner.py
then executes:run_nccl.sh
: Clones, builds, and runsnccl-tests
(all_reduce_perf
on 1 GPU by default). Output and parsed bandwidth are logged.run_gpu_burn.sh
: Clones, builds, and runsgpu-burn
.
- The exit code of
runner.py
(PROBE_RC
) indicates success (0) or failure (1) of these local tests.
-
Multi-Node Distributed Training:
srun
launchestorchrun
across all allocated nodes and GPUs.torchrun
executessrc/gpu_probe/train.py
for a DDP training session.- The script downloads CIFAR10 to the specified
--data_path
(master rank downloads, others wait) and trains a ResNet50 model for a few epochs/batches. - The exit code of this step (
TRAIN_RC
) indicates success or failure.
-
Final Result:
- The sbatch script checks
PROBE_RC
andTRAIN_RC
to determine overall success or failure of the probe.
- The sbatch script checks
- Slurm Output: The main log file is specified by
--output
insubmit_gpu_probe.sbatch
(default:/root/gpu_probe_%j.log
). This captures stdout/stderr from the sbatch script itself and thesrun
commands. - Script Logs:
runner.py
andtrain.py
use Python'slogging
module, which prints to stdout/stderr and will be captured in the Slurm output file. - NCCL Test Output:
run_nccl.sh
saves the raw output ofall_reduce_perf
to/tmp/nccl.txt
inside the container and also prints this content to stdout.
- GPU Burn Duration: Modify the
gpu_burn 30
command inrun_gpu_burn.sh
(30 is seconds). - NCCL Test Parameters: Adjust the
all_reduce_perf
arguments inrun_nccl.sh
(e.g.,-b
,-e
,-g
). The current bandwidth check inrun_nccl.sh
is commented out for diagnostic purposes but can be re-enabled and its threshold adjusted. - Training Parameters: Modify
TRAIN_ARGS
insubmit_gpu_probe.sbatch
to change epochs, batch count, learning rate, or data path fortrain.py
. - Container Base Image: The
Dockerfile
usesnvcr.io/nvidia/pytorch:24.07-py3
. This can be changed if a different PyTorch/CUDA version is needed. - Python Dependencies: Add to
requirements.txt
and rebuild the Docker image.