Skip to content

Commit

Permalink
Merge pull request #2492 from devitocodes/mpi_notebook_update
Browse files Browse the repository at this point in the history
examples: Enhance MPI notebook with paper material
  • Loading branch information
mloubout authored Nov 22, 2024
2 parents bedb1a9 + 30755a1 commit b95b323
Show file tree
Hide file tree
Showing 3 changed files with 77 additions and 33 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Key features include:
## Installation

The easiest way to try Devito is through Docker using the following commands:
```
```bash
# get the code
git clone https://github.com/devitocodes/devito.git
cd devito
Expand Down
45 changes: 19 additions & 26 deletions benchmarks/user/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,24 +77,28 @@ DEVITO_LANGUAGE=openmp
```
One has two options: either set it explicitly or prepend it to the Python
command. In the former case, assuming a bash shell:
```
```bash
export DEVITO_LANGUAGE=openmp
```
In the latter case:
```
```bash
DEVITO_LANGUAGE=openmp python benchmark.py ...
```

## Enabling MPI

To switch on MPI, one should set
```
```bash
DEVITO_MPI=1
```
and run with `mpirun -n number_of_processes python benchmark.py ...`

Devito supports multiple MPI schemes for halo exchange. See the `Tips` section
below.
Devito supports multiple MPI schemes for halo exchange.

* Devito's most prevalent MPI modes are three: `basic`, `diag2` and `full`.
and are respectively activated e.g., via `DEVITO_MPI=basic`.
These modes may perform better under different factors such as arithmetic intensity,
or number of fields used in the computation.

## The optimization level

Expand All @@ -109,7 +113,7 @@ lines a few sections below.

Auto-tuning can significantly improve the run-time performance of an Operator. It
can be enabled on an Operator basis:
```
```python
op = Operator(...)
op.apply(autotune=True)
```
Expand Down Expand Up @@ -162,52 +166,41 @@ Run with `DEVITO_LOGGING=DEBUG` to find out the specific performance
optimizations applied by an Operator, how auto-tuning is getting along, and to
emit more performance metrics.

## Tips

* The most powerful MPI mode is called "full", and is activated setting
`DEVITO_MPI=full` instead of `DEVITO_MPI=1`. The "full" mode supports
computation/communication overlap.
* When auto-tuning is enabled, one should always run in performance mode:
```
from devito import mode_performance
mode_perfomance()
```
This is automatically turned on by `benchmark.py`

## Example commands

The isotropic acoustic wave forward Operator in a `512**3` grid, space order
12, and a simulation time of 100ms:
```
```bash
python benchmark.py run -P acoustic -d 512 512 512 -so 12 --tn 100
```
Like before, but with auto-tuning in `basic` mode:
```
```bash
python benchmark.py run -P acoustic -d 512 512 512 -so 12 -a basic --tn 100
```
It is also possible to run a TTI forward operator -- here in a 512x402x890
grid:
```
```bash
python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100
```
Same as before, but telling devito not to use temporaries to store the
intermediate values which stem from mixed derivatives:
```
```bash
python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 --opt
"('advanced', {'cire-mingain: 1000000'})"
```
Do not forget to pin processes, especially on NUMA systems; below, we use
`numactl` to pin processes and threads to one specific NUMA domain.
```
```bash
numactl --cpubind=0 --membind=0 python benchmark.py ...
```
While a benchmark is running, you can have some useful programs running in
background in other shells. For example, to monitor pinning:
```
```bash
htop
```
or to keep the memory footprint under control:
```
```bash
watch numastat -m
```

Expand All @@ -218,7 +211,7 @@ This is often referred to as the ["JIT backdoor"
mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications).
With ``benchmark.py`` we can exploit this feature to manually hack and test the
code generated for a given benchmark. So, we first run a problem, for example
```
```bash
python benchmark.py run-jit-backdoor -P acoustic -d 512 512 512 -so 12 --tn 100
```
As you may expect, the ``run-jit-backdoor`` mode accepts exactly the same arguments
Expand All @@ -235,7 +228,7 @@ you will see the performance impact of your changes.
## Running on HPC clusters

`benchmark.py` can be used to evaluate MPI on multi-node systems:
```
```bash
mpiexec python benchmark.py ...
```
In `bench` mode, each MPI rank will produce a different `.json` file
Expand Down
63 changes: 57 additions & 6 deletions examples/mpi/overview.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"* Install an MPI distribution on your system, such as OpenMPI, MPICH, or Intel MPI (if not already available).\n",
"* Install some optional dependencies, including `mpi4py` and `ipyparallel`; from the root Devito directory, run\n",
"```bash\n",
"pip install -r requirements-optional.txt\n",
"pip install -r requirements-mpi.txt\n",
"```\n",
"* Create an `ipyparallel` MPI profile, by running our simple setup script. From the root directory, run\n",
"```bash\n",
Expand Down Expand Up @@ -119,7 +119,8 @@
"%%px\n",
"# Keep generated code as simple as possible\n",
"configuration['language'] = 'C'\n",
"# Fix platform so that this notebook can be tested by py.test --nbval\n",
"# Fix platform so that this notebook can have asserted output\n",
"# when tested by ``py.test --nbval\" in any platform\n",
"configuration['platform'] = 'knl7210'"
]
},
Expand Down Expand Up @@ -831,10 +832,32 @@
"The Devito compiler applies several optimizations before generating code.\n",
"\n",
"* Redundant halo exchanges are identified and removed. A halo exchange is redundant if a prior halo exchange carries out the same `Function` update and the data is not “dirty” yet.\n",
"* Computation/communication overlap, with explicit prodding of the asynchronous progress engine to make sure that non-blocking communications execute in background during the compute part.\n",
"* Halo exchange communications that could be fired together are preferred over being scattered all over the code.\n",
"* Halo exchanges could also be reshuffled to maximize the extension of the computation/communication overlap region.\n",
"\n",
"To run with all these optimizations enabled, instead of `DEVITO_MPI=1`, users should set `DEVITO_MPI=full`, or, equivalently"
"## Computation/communication patterns\n",
"\n",
"![mpi-modes](https://gist.githubusercontent.com/georgebisbas/aa0e6a2f658728f1bb360f328ee6984a/raw/8c625fb2216dc6f67035856e63985516bbdeb340/mpi-modes.drawio.svg)\n",
"\n",
"Additionally, the Devito compiler offers a few modes of different computation and communication strategies, each exhibiting superiority under specific conditions for a kernel, such as operational intensity, memory footprint, the number of utilized ranks, and the characteristics of the cluster’s interconnect. Some of the best patterns are namely `basic`, `diagonal`, and `full`. Those have proven to be effective in improving the efficiency and scalability of computations, under several scnarios.\n",
"\n",
"- `basic`: The basic pattern is the simplest among the methods presented in this section and targets CPUs and GPUs. This mode, illustrated in Figure 5a, involves blocking point-to-point (P2P) data exchanges perpendicular to the 2D and 3D planes of the Cartesian topology between MPI ranks. For\n",
"example, each rank issues 4 in 2D and 6 communications in 3D. While this mode benefits from fewer communications, it may encounter synchronization bottlenecks during grid updates before computing the next timestep. This method allocates the memory needed to exchange halos in C-land before every communication, only adding negligible overhead.\n",
"\n",
"- `diag2`: Compared to the `basic`, this pattern also performs diagonal data exchanges, facilitating the communication of the corner points in our domains in a single step. This results in more communications, with 8 in 2D and 26 in 3D. Although it involves more communications, they are issued\n",
"in a single step, and the messages are smaller compared to basic. Compared to basic, this mode slightly benefits from preallocated buffers in python-land, eliminating the need to allocate data in C-land before every communication. The latter is why this version is not supported on GPUs since the\n",
"mechanism of pre-allocating buffers on device memory still needs to be supported.\n",
"\n",
"- `full`: This pattern leverages communication/computation overlap. The local-per-rank domain is logically decomposed into an inner (CORE) and an outer (OWNED/remainder) area. In a 3D example, the remainder areas take the form of faces and vector-like areas along the decomposed dimensions. The number of communications is the same as in the diagonal mode. This mode benefits from overlapping\n",
"two steps: halo updating and the stencil computations in the CORE area. After this step, stencil updates are computed in the ``remainder” areas. In the ideal case, assuming that communication is perfectly hidden, the execution time should converge to the time needed to compute the CORE plus the time needed to compute the remainder. An important drawback of this mode is the slower GPts/s achieved at the remainder areas. The elements in the remainder are not contiguous; therefore,\n",
"we have less efficient memory access patterns (strides) along vectorizable dimensions. These areas have lower cache utilization and vectorization efficiency."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's see the `diag2` method:"
]
},
{
Expand All @@ -844,14 +867,31 @@
"outputs": [],
"source": [
"%%px\n",
"configuration['mpi'] = 'full'"
"configuration['mpi'] = 'diag2'\n",
"\n",
"op = Operator(Eq(u.forward, u.dx + 1))\n",
"# Uncomment below to show code (it's quite verbose)\n",
"# print(op)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could now peek at the generated code to see that things now look differently."
"The body of the time-stepping loop has slightly changed compared to `basic`:\n",
"\n",
"Some differences are:\n",
"\n",
"* The communication buffers `bufg`, `bufs` are not allocated at C-land, as this already happens in Python-land\n",
"* We now fire `ncomms` communications which are not only vertical or horizontal, but also diagonal.\n",
"This leads to more messages, but slightly smaller compared to `basic`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could now peek at the generated code of the `full` mode and see that things now look differently."
]
},
{
Expand All @@ -863,6 +903,8 @@
"outputs": [],
"source": [
"%%px\n",
"configuration['mpi'] = 'full'\n",
"\n",
"op = Operator(Eq(u.forward, u.dx + 1))\n",
"# Uncomment below to show code (it's quite verbose)\n",
"# print(op)"
Expand All @@ -879,6 +921,15 @@
"* `halowait0` wait and terminates the non-blocking communications;\n",
"* `remainder0`, which internally calls `compute0`, computes the boundary region requiring the now up-to-date halo data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"More information on Devito's MPI, can be found in this pre-print:\n",
"[Automated MPI-X code generation for scalable finite-difference solvers](https://arxiv.org/abs/2312.13094)"
]
}
],
"metadata": {
Expand Down

0 comments on commit b95b323

Please sign in to comment.