🍋 `ezpz`

Work smarter, not harder

🐣 Getting Started
1. 📝 Example
🐚 Shell Utilities
1. 🏖️ Setup Shell Environment
  1. 🛠️ Setup Python
  2. 🧰 Setup Job
🐍 Python Library

Important

The documentation below is a work in progress.
Please feel free to provide input / suggest changes !

🐣 Getting Started

There are two main, distinct components of ezpz:

🐍 Python Library (import ezpz)
🐚 Shell Utilities (ezpz_*)

designed to make life easy.

📝 Example

We provide a complete, entirely self-contained example in docs/example.md that walks through:

Setting up a suitable python environment + installing ezpz into it
Launching a (simple) distributed training job across all available resources in your {slurm, PBS} job allocation.

🐚 Shell Utilities

The Shell Utilities can be roughly broken up further into two main components:

🛠️ Setup Python
🧰 Setup Job

We provide a variety of helper functions designed to make your life easier when working with job schedulers (e.g. PBS Pro @ ALCF or slurm elsewhere).

All of these functions are:

located in utils.sh
prefixed with ezpz_* (e.g. ezpz_setup_python)¹

To use these, we can source the file directly via:

export PBS_O_WORKDIR=$(pwd) # if on ALCF
source /dev/stdin <<< $(curl 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh')

We would like to write our application in such a way that it is able to take full advantage of the resources allocated by the job scheduler.

That is to say, we want to have a single script with the ability to dynamically launch python applications across any number of accelerators on any of the systems under consideration.

In order to do this, there is some basic setup and information gathering that needs to occur.

In particular, we need mechanisms for:

Setting up a python environment
Determining what system / machine we’re on
- + what job scheduler we’re using (e.g. PBS Pro @ ALCF or slurm elsewhere)
Determining how many nodes have been allocated in the current job (NHOSTS)
- + Determining how many accelerators exist on each of these nodes (NGPU_PER_HOST)

This allows us to calculate the total number of accelerators (GPUs) as: $N_{\mathrm{GPU}} = N_{\mathrm{HOST}} \times n_{\mathrm{GPU}}$

where $n_{\mathrm{GPU}} = N_{\mathrm{GPU}} / N_{\mathrm{HOST}}$ is the number of GPUs per host.

With this we have everything we need to build the appropriate {mpi{run, exec}, slurm} command for launching our python application across them.

Now, there are a few functions in particular worth elaborating on.

Function	Description
`ezpz_setup_env`	Wrapper around `ezpz_setup_python` `&&` `ezpz_setup_job`
`ezpz_setup_job`	Determine {`NGPUS`, `NGPU_PER_HOST`, `NHOSTS`}, build `launch` command alias
`ezpz_setup_python`	Wrapper around `ezpz_setup_conda` `&&` `ezpz_setup_venv_from_conda`
`ezpz_setup_conda`	Find and activate appropriate `conda` module to load²
`ezpz_setup_venv_from_conda`	From `${CONDA_NAME}`, build or activate the virtual env located in `venvs/${CONDA_NAME}/`

Table 1: Shell Functions

Warning

Where am I?

Some of the ezpz_* functions (e.g. ezpz_setup_python), will try to create / look for certain directories.

In an effort to be explicit, these directories will be defined relative to a WORKING_DIR (e.g. "${WORKING_DIR}/venvs/")

This WORKING_DIR will be assigned to the first non-zero match found below:

PBS_O_WORKDIR: If found in environment, paths will be relative to this
SLURM_SUBMIT_DIR: Next in line. If not @ ALCF, maybe using slurm…
$(pwd): Otherwise, no worries. Use your actual working directory.

🛠️ Setup Python

ezpz_setup_python

This will:

Automatically load and activate conda using the ezpz_setup_conda function.

How this is done, in practice, varies from machine to machine:
- ALCF³: Automatically load the most recent conda module and activate the base environment.
- Frontier: Load the appropriate AMD modules (e.g. rocm, RCCL, etc.), and activate base conda
- Perlmutter: Load the appropriate pytorch module and activate environment
- Unknown: In this case, we will look for a conda, mamba, or micromamba executable, and if found, use that to activate the base environment.

Tip

Using your own conda

If you are already in a conda environment when calling ezpz_setup_python then it will try and use this instead.

For example, if you have a custom conda env at ~/conda/envs/custom, then this would bootstrap the custom conda environment and create the virtual env in venvs/custom/

Build (or activate, if found) a virtual environment on top of (the active) base conda environment.

By default, it will try looking in:
- $PBS_O_WORKDIR, otherwise
- ${SLURM_SUBMIT_DIR}, otherwise
- $(pwd)
for a nested folder named "venvs/${CONDA_NAME}".

If this doesn’t exist, it will attempt to create a new virtual environment at this location using:
```
python3 -m venv venvs/${CONDA_NAME} --system-site-packages
```
(where we’ve pulled in the --system-site-packages from conda).

🧰 Setup Job

ezpz_setup_job

Now that we are in a suitable python environment, we need to construct the command that we will use to run python on each of our acceleratorss.

To do this, we need a few things:

What machine we’re on (and what scheduler is it using i.e. {PBS, SLURM})
How many nodes are available in our active job
How many GPUs are on each of those nodes
What type of GPUs are they

With this information, we can then use mpi{exec,run} or srun to launch python across all of our accelerators.

Again, how this is done will vary from machine to machine and will depend on the job scheduler in use.

To identify where we are, we look at our $(hostname) and see if we’re running on one of the known machines:

ALCF⁴: Using PBS Pro via qsub and mpiexec / mpirun.
- x4*: Aurora
- Aurora: x4* (or aurora* on login nodes)
- Sunspot: x1* (or uan*)
- Sophia: sophia-*
- Polaris / Sirius: x3*
  - to determine between the two, we look at "${PBS_O_HOST}"

OLCF: Using Slurm via sbatch / srun.
- frontier*: Frontier, using Slurm
- nid*: Perlmutter, using Slurm
Unknown machine: If $(hostname) does not match one of these patterns we assume that we are running on an unknown machine and will try to use mpirun as our generic launch command

Once we have this, we can:
1. Get PBS_NODEFILE from $(hostname):
  - ezpz_qsme_running: For each (running) job owned by ${USER}, print out both the jobid as well as a list of hosts the job is running on, e.g.:
```
<jobid0> host00 host01 host02 host03 ...
<jobid1> host10 host11 host12 host13 ...
...
```
  - ezpz_get_pbs_nodefile_from_hostname: Look for $(hostname) in the output from the above command to determine our ${PBS_JOBID}.
    
    Once we’ve identified our ${PBS_JOBID} we then know the location of our ${PBS_NODEFILE} since they are named according to:
```
jobid=$(ezpz_qsme_running | grep "$(hostname)" | awk '{print $1}')
prefix=/var/spool/pbs/aux
match=$(/bin/ls "${prefix}" | grep "${jobid}")
hostfile="${prefix}/${match}"
```
2. Identify number of available accelerators:

🐍 Python Library

👀 Overview

Launch and train across all your accelerators, using your favorite framework + backend combo.

ezpz simplifies the process of:

Setting up + launching distributed training:
- import ezpz as ez
  - RANK = ez.setup_torch(backend=backend) [for backend $\in$ {DDP, deepspeed, horovod}]{.dim-text}
  - RANK = ez.get_rank()
  - LOCAL_RANK = ez.get_local_rank()
  - WORLD_SIZE = ez.get_world_size()
  [(see ezpz/dist.py for more details).]{.dim-text}

Using your favorite framework:
- framework=pytorch + backend={DDP, deepspeed, horovod}
- framework=tensorflow + backend=horovod
- ez.get_torch_device(): {cuda, xpu, mps, cpu}
- ez.get_torch_backend(): {nccl, ccl, gloo}
2ez 😎. (see frameworks for additional details)

Writing device agnostic code:

ezpz.get_torch_device()

>>> import ezpz as ez
>>> DEVICE = ez.get_torch_device()
>>> model = torch.nn.Linear(10, 10)
>>> model.to(DEVICE)
>>> x = torch.randn((10, 10), device=DEVICE)
>>> y = model(x)
>>> y.device
device(type='mps', index=0)

Using wandb:
- ez.setup_wandb(project_name='ezpz')

Full support for any {device + framework + backend}:
- device: {GPU, XPU, MPS, CPU}
- framework: {torch, deepspeed, horovod, tensorflow}
- backend: {DDP, deepspeed, horovod}

Install

To install⁵:

python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv

📂 ezpz / src / ezpz/
- 📂 bin/:
  - utils.sh: Shell utilities for ezpz
- 📂 conf/:
  - ⚙️ config.yaml: Default TrainConfig object
  - ⚙️ ds_config.json: DeepSpeed configuration
- 📂 log/: Logging configuration.
- 🐍 __about__.py: Version information
- 🐍 __init__.py: Main module
- 🐍 __main__.py: Entry point
- 🐍 configs.py: Configuration module
- 🐍cria.py: Baby Llama
- 🐍dist.py: Distributed training module
- 🐍history.py: History module
- 🐍jobs.py: Jobs module
- 🐍model.py: Model module
- 🐍plot.py: Plot modul
- 🐍profile.py: Profile module
- 🐍runtime.py: Runtime module
- 🐍test.py: Test module
- 🐍test_dist.py: Distributed training test module
- 🐍train.py: train module
- 🐍trainer.py: trainer module
- 🐍utils.py: utility module

📂 /ezpz/src/ezpz/
┣━━ 📂 bin/
┃   ┣━━ 📄 affinity.sh
┃   ┣━━ 📄 getjobenv
┃   ┣━━ 📄 savejobenv
┃   ┣━━ 📄 saveslurmenv
┃   ┣━━ 📄 setup.sh
┃   ┣━━ 📄 train.sh
┃   ┗━━ 📄 utils.sh
┣━━ 📂 conf/
┃   ┣━━ 📂 hydra/
┃   ┃   ┗━━ 📂 job_logging/
┃   ┃       ┣━━ ⚙️ colorlog1.yaml
┃   ┃       ┣━━ ⚙️ custom.yaml
┃   ┃       ┗━━ ⚙️ enrich.yaml
┃   ┣━━ 📂 logdir/
┃   ┃   ┗━━ ⚙️ default.yaml
┃   ┣━━ ⚙️ config.yaml
┃   ┣━━ 📄 ds_config.json
┃   ┗━━ ⚙️ ds_config.yaml
┣━━ 📂 log/
┃   ┣━━ 📂 conf/
┃   ┃   ┗━━ 📂 hydra/
┃   ┃       ┗━━ 📂 job_logging/
┃   ┃           ┗━━ ⚙️ enrich.yaml
┃   ┣━━ 🐍 __init__.py
┃   ┣━━ 🐍 __main__.py
┃   ┣━━ 🐍 config.py
┃   ┣━━ 🐍 console.py
┃   ┣━━ 🐍 handler.py
┃   ┣━━ 🐍 style.py
┃   ┣━━ 🐍 test.py
┃   ┗━━ 🐍 test_log.py
┣━━ 🐍 __about__.py
┣━━ 🐍 __init__.py
┣━━ 🐍 __main__.py
┣━━ 🐍 configs.py
┣━━ 🐍 cria.py
┣━━ 🐍 dist.py
┣━━ 🐍 history.py
┣━━ 🐍 jobs.py
┣━━ 🐍 loadjobenv.py
┣━━ 🐍 model.py
┣━━ 🐍 plot.py
┣━━ 🐍 profile.py
┣━━ 🐍 runtime.py
┣━━ 🐍 savejobenv.py
┣━━ 🐍 test.py
┣━━ 🐍 test_dist.py
┣━━ 🐍 train.py
┣━━ 🐍 trainer.py
┗━━ 🐍 utils.py

Plus this is useful for tab-completions in your shell, e.g.:

$ ezpz_<TAB>
ezpz_check_and_kill_if_running
ezpz_get_dist_launch_cmd
ezpz_get_job_env
--More--

↩

This is system dependent. See ezpz_setup_conda ↩
Any of {Aurora, Polaris, Sophia, Sunspot, Sirius} ↩
At ALCF, if our $(hostname) starts with x*, we’re on a compute node. ↩
Note the --require-virtualenv isn’t strictly required, but I highly recommend to always try and work within a virtual environment, when possible. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 473 Commits
assets		assets
docs		docs
src/ezpz		src/ezpz
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.qmd		index.qmd
pyproject.toml		pyproject.toml
references.bib		references.bib
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍋 `ezpz`

🐣 Getting Started

📝 Example

🐚 Shell Utilities

Where am I?

🛠️ Setup Python

🧰 Setup Job

🐍 Python Library

👀 Overview

Install

About

Releases

Packages

Contributors 2

Languages

License

saforem2/ezpz

Folders and files

Latest commit

History

Repository files navigation

🍋 ezpz

🐣 Getting Started

📝 Example

🐚 Shell Utilities

Where am I?

🛠️ Setup Python

🧰 Setup Job

🐍 Python Library

👀 Overview

Install

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

🍋 `ezpz`

Packages