Work smarter, not harder
Important
The documentation below is a work in progress.
Please feel free to provide input / suggest changes !
There are two main, distinct components of ezpz
:
- π Python Library (
import ezpz
) - π Shell Utilities (
ezpz_*
)
designed to make life easy.
We provide a complete, entirely self-contained example in docs/example.md that walks through:
- Setting up a suitable python environment + installing
ezpz
into it - Launching a (simple) distributed training job across all available resources in your {slurm, PBS} job allocation.
The Shell Utilities can be roughly broken up further into two main components:
We provide a variety of helper functions designed to make your life
easier when working with job schedulers (e.g.Β PBS Pro
@ ALCF or
slurm
elsewhere).
All of these functions are:
To use these, we can source
the file directly via:
export PBS_O_WORKDIR=$(pwd) # if on ALCF
source /dev/stdin <<< $(curl 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh')
We would like to write our application in such a way that it is able to take full advantage of the resources allocated by the job scheduler.
That is to say, we want to have a single script with the ability to
dynamically launch
python applications across any number of
accelerators on any of the systems under consideration.
In order to do this, there is some basic setup and information gathering that needs to occur.
In particular, we need mechanisms for:
- Setting up a python environment
- Determining what system / machine weβre on
- + what job scheduler weβre using (e.g.Β
PBS Pro
@ ALCF orslurm
elsewhere)
- + what job scheduler weβre using (e.g.Β
- Determining how many nodes have been allocated in the current job
(
NHOSTS
)- + Determining how many accelerators exist on each of these nodes
(
NGPU_PER_HOST
)
- + Determining how many accelerators exist on each of these nodes
(
This allows us to calculate the total number of accelerators (GPUs) as:
where
With this we have everything we need to build the appropriate
{mpi
{run
, exec
}, slurm
} command for launching our python
application across them.
Now, there are a few functions in particular worth elaborating on.
Function | Description |
---|---|
ezpz_setup_env |
Wrapper around ezpz_setup_python && ezpz_setup_job
|
ezpz_setup_job |
Determine {NGPUS , NGPU_PER_HOST , NHOSTS }, build launch command alias |
ezpz_setup_python |
Wrapper around ezpz_setup_conda && ezpz_setup_venv_from_conda
|
ezpz_setup_conda |
Find and activate appropriate conda module to load2
|
ezpz_setup_venv_from_conda |
From ${CONDA_NAME} , build or activate the virtual env located in venvs/${CONDA_NAME}/
|
TableΒ 1: Shell Functions
Warning
Some of the ezpz_*
functions (e.g.Β ezpz_setup_python
), will try
to create / look for certain directories.
In an effort to be explicit, these directories will be defined
relative to a WORKING_DIR
(e.g.Β "${WORKING_DIR}/venvs/"
)
This WORKING_DIR
will be assigned to the first non-zero match found
below:
PBS_O_WORKDIR
: If found in environment, paths will be relative to thisSLURM_SUBMIT_DIR
: Next in line. If not @ ALCF, maybe usingslurm
β¦$(pwd)
: Otherwise, no worries. Use your actual working directory.
ezpz_setup_python
This will:
-
Automatically load and activate
conda
using theezpz_setup_conda
function.How this is done, in practice, varies from machine to machine:
-
ALCF3: Automatically load the most recent
conda
module and activate the base environment. -
Frontier: Load the appropriate AMD modules (e.g.Β
rocm
,RCCL
, etc.), and activate baseconda
-
Perlmutter: Load the appropriate
pytorch
module and activate environment -
Unknown: In this case, we will look for a
conda
,mamba
, ormicromamba
executable, and if found, use that to activate the base environment.
-
Tip
Using your own conda
If you are already in a conda environment when calling
ezpz_setup_python
then it will try and use this instead.
For example, if you have a custom conda
env at
~/conda/envs/custom
, then this would bootstrap the custom
conda environment and create the virtual env in venvs/custom/
-
Build (or activate, if found) a virtual environment on top of (the active) base
conda
environment.By default, it will try looking in:
-
$PBS_O_WORKDIR
, otherwise -
${SLURM_SUBMIT_DIR}
, otherwise $(pwd)
for a nested folder named
"venvs/${CONDA_NAME}"
.If this doesnβt exist, it will attempt to create a new virtual environment at this location using:
python3 -m venv venvs/${CONDA_NAME} --system-site-packages
(where weβve pulled in the
--system-site-packages
from conda). -
ezpz_setup_job
Now that we are in a suitable python environment, we need to construct the command that we will use to run python on each of our acceleratorss.
To do this, we need a few things:
- What machine weβre on (and what scheduler is it using i.e.Β {PBS, SLURM})
- How many nodes are available in our active job
- How many GPUs are on each of those nodes
- What type of GPUs are they
With this information, we can then use mpi{exec,run}
or srun
to
launch python across all of our accelerators.
Again, how this is done will vary from machine to machine and will depend on the job scheduler in use.
To identify where we are, we look at our $(hostname)
and see if weβre
running on one of the known machines:
- ALCF4: Using PBS Pro via
qsub
andmpiexec
/mpirun
.x4*
: Aurora- Aurora:
x4*
(oraurora*
on login nodes) - Sunspot:
x1*
(oruan*
) - Sophia:
sophia-*
- Polaris / Sirius:
x3*
- to determine between the two, we look at
"${PBS_O_HOST}"
- to determine between the two, we look at
-
OLCF: Using Slurm via
sbatch
/srun
.-
frontier*
: Frontier, using Slurm -
nid*
: Perlmutter, using Slurm
-
-
Unknown machine: If
$(hostname)
does not match one of these patterns we assume that we are running on an unknown machine and will try to usempirun
as our generic launch commandOnce we have this, we can:
-
Get
PBS_NODEFILE
from$(hostname)
:-
ezpz_qsme_running
: For each (running) job owned by${USER}
, print out both the jobid as well as a list of hosts the job is running on, e.g.:<jobid0> host00 host01 host02 host03 ... <jobid1> host10 host11 host12 host13 ... ...
-
ezpz_get_pbs_nodefile_from_hostname
: Look for$(hostname)
in the output from the above command to determine our${PBS_JOBID}
.Once weβve identified our
${PBS_JOBID}
we then know the location of our${PBS_NODEFILE}
since they are named according to:jobid=$(ezpz_qsme_running | grep "$(hostname)" | awk '{print $1}') prefix=/var/spool/pbs/aux match=$(/bin/ls "${prefix}" | grep "${jobid}") hostfile="${prefix}/${match}"
-
-
Identify number of available accelerators:
-
Launch and train across all your accelerators, using your favorite framework + backend combo.
ezpz
simplifies the process of:
-
Setting up + launching distributed training:
-
import ezpz as ez
-
RANK =
ez.setup_torch(backend=backend)
[forbackend
$\in$ {DDP
,deepspeed
,horovod
}]{.dim-text} -
RANK =
ez.get_rank()
-
LOCAL_RANK =
ez.get_local_rank()
-
WORLD_SIZE =
ez.get_world_size()
[(see
ezpz/dist.py
for more details).]{.dim-text} -
-
-
Using your favorite framework:
-
framework=pytorch
+backend={DDP, deepspeed, horovod}
-
framework=tensorflow
+backend=horovod
-
ez.get_torch_device()
: {cuda
,xpu
,mps
,cpu
} -
ez.get_torch_backend()
: {nccl
,ccl
,gloo
}
2ez π. (see frameworks for additional details)
-
-
Writing device agnostic code:
-
ezpz.get_torch_device()
>>> import ezpz as ez >>> DEVICE = ez.get_torch_device() >>> model = torch.nn.Linear(10, 10) >>> model.to(DEVICE) >>> x = torch.randn((10, 10), device=DEVICE) >>> y = model(x) >>> y.device device(type='mps', index=0)
-
-
Using
wandb
:ez.setup_wandb(project_name='ezpz')
- Full support for any {
device
+framework
+backend
}:- device: {
GPU
,XPU
,MPS
,CPU
} - framework: {
torch
,deepspeed
,horovod
,tensorflow
} - backend: {
DDP
,deepspeed
,horovod
}
- device: {
To install5:
python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
- π
ezpz
/src
/ezpz/
- π
bin/
:utils.sh
: Shell utilities forezpz
- π
conf/
:- βοΈ
config.yaml
: DefaultTrainConfig
object - βοΈ
ds_config.json
: DeepSpeed configuration
- βοΈ
- π
log/
: Logging configuration. - π
__about__.py
: Version information - π
__init__.py
: Main module - π
__main__.py
: Entry point - π
configs.py
: Configuration module - π
cria.py
: Baby Llama - π
dist.py
: Distributed training module - π
history.py
: History module - π
jobs.py
: Jobs module - π
model.py
: Model module - π
plot.py
: Plot modul - π
profile.py
: Profile module - π
runtime.py
: Runtime module - π
test.py
: Test module - π
test_dist.py
: Distributed training test module - π
train.py
: train module - π
trainer.py
: trainer module - π
utils.py
: utility module
- π
π /ezpz/src/ezpz/
β£ββ π bin/
β β£ββ π affinity.sh
β β£ββ π getjobenv
β β£ββ π savejobenv
β β£ββ π saveslurmenv
β β£ββ π setup.sh
β β£ββ π train.sh
β βββ π utils.sh
β£ββ π conf/
β β£ββ π hydra/
β β βββ π job_logging/
β β β£ββ βοΈ colorlog1.yaml
β β β£ββ βοΈ custom.yaml
β β βββ βοΈ enrich.yaml
β β£ββ π logdir/
β β βββ βοΈ default.yaml
β β£ββ βοΈ config.yaml
β β£ββ π ds_config.json
β βββ βοΈ ds_config.yaml
β£ββ π log/
β β£ββ π conf/
β β βββ π hydra/
β β βββ π job_logging/
β β βββ βοΈ enrich.yaml
β β£ββ π __init__.py
β β£ββ π __main__.py
β β£ββ π config.py
β β£ββ π console.py
β β£ββ π handler.py
β β£ββ π style.py
β β£ββ π test.py
β βββ π test_log.py
β£ββ π __about__.py
β£ββ π __init__.py
β£ββ π __main__.py
β£ββ π configs.py
β£ββ π cria.py
β£ββ π dist.py
β£ββ π history.py
β£ββ π jobs.py
β£ββ π loadjobenv.py
β£ββ π model.py
β£ββ π plot.py
β£ββ π profile.py
β£ββ π runtime.py
β£ββ π savejobenv.py
β£ββ π test.py
β£ββ π test_dist.py
β£ββ π train.py
β£ββ π trainer.py
βββ π utils.py
Footnotes
-
Plus this is useful for tab-completions in your shell, e.g.:
β©$ ezpz_<TAB> ezpz_check_and_kill_if_running ezpz_get_dist_launch_cmd ezpz_get_job_env --More--
-
This is system dependent. See
ezpz_setup_conda
β© -
Any of {Aurora, Polaris, Sophia, Sunspot, Sirius} β©
-
At ALCF, if our
$(hostname)
starts withx*
, weβre on a compute node. β© -
Note the
--require-virtualenv
isnβt strictly required, but I highly recommend to always try and work within a virtual environment, when possible. β©