Merge pull request #2492 from devitocodes/mpi_notebook_update

examples: Enhance MPI notebook with paper material
devitocodes · Nov 22, 2024 · b95b323 · b95b323
2 parents bedb1a9 + 30755a1
commit b95b323
Show file tree

Hide file tree

Showing 3 changed files with 77 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -77,7 +77,7 @@ Key features include:
 ## Installation
 
 The easiest way to try Devito is through Docker using the following commands:
-```
+```bash
 # get the code
 git clone https://github.com/devitocodes/devito.git
 cd devito

diff --git a/benchmarks/user/README.md b/benchmarks/user/README.md
@@ -77,24 +77,28 @@ DEVITO_LANGUAGE=openmp
 ```
 One has two options: either set it explicitly or prepend it to the Python
 command. In the former case, assuming a bash shell:
-```
+```bash
 export DEVITO_LANGUAGE=openmp
 ```
 In the latter case:
-```
+```bash
 DEVITO_LANGUAGE=openmp python benchmark.py ...
 ```
 
 ## Enabling MPI
 
 To switch on MPI, one should set
-```
+```bash
 DEVITO_MPI=1
 ```
 and run with `mpirun -n number_of_processes python benchmark.py ...`
 
-Devito supports multiple MPI schemes for halo exchange. See the `Tips` section
-below.
+Devito supports multiple MPI schemes for halo exchange.
+
+* Devito's most prevalent MPI modes are three: `basic`, `diag2` and `full`.
+and are respectively activated e.g., via `DEVITO_MPI=basic`.
+These modes may perform better under different factors such as arithmetic intensity,
+or number of fields used in the computation.
 
 ## The optimization level
 
@@ -109,7 +113,7 @@ lines a few sections below.
 
 Auto-tuning can significantly improve the run-time performance of an Operator. It
 can be enabled on an Operator basis:
-```
+```python
 op = Operator(...)
 op.apply(autotune=True)
 ```
@@ -162,52 +166,41 @@ Run with `DEVITO_LOGGING=DEBUG` to find out the specific performance
 optimizations applied by an Operator, how auto-tuning is getting along, and to
 emit more performance metrics.
 
-## Tips
-
-* The most powerful MPI mode is called "full", and is activated setting
-  `DEVITO_MPI=full` instead of `DEVITO_MPI=1`. The "full" mode supports
-  computation/communication overlap.
-* When auto-tuning is enabled, one should always run in performance mode:
-  ```
-  from devito import mode_performance
-  mode_perfomance()
-  ```
-  This is automatically turned on by `benchmark.py`
 
 ## Example commands
 
 The isotropic acoustic wave forward Operator in a `512**3` grid, space order
 12, and a simulation time of 100ms:
-```
+```bash
 python benchmark.py run -P acoustic -d 512 512 512 -so 12 --tn 100
 ```
 Like before, but with auto-tuning in `basic` mode:
-```
+```bash
 python benchmark.py run -P acoustic -d 512 512 512 -so 12 -a basic --tn 100
 ```
 It is also possible to run a TTI forward operator -- here in a 512x402x890
 grid:
-```
+```bash
 python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100
 ```
 Same as before, but telling devito not to use temporaries to store the
 intermediate values which stem from mixed derivatives:
-```
+```bash
 python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 --opt
 "('advanced', {'cire-mingain: 1000000'})"
 ```
 Do not forget to pin processes, especially on NUMA systems; below, we use
 `numactl` to pin processes and threads to one specific NUMA domain.
-```
+```bash
 numactl --cpubind=0 --membind=0 python benchmark.py ...
 ```
 While a benchmark is running, you can have some useful programs running in
 background in other shells. For example, to monitor pinning:
-```
+```bash
 htop
 ```
 or to keep the memory footprint under control:
-```
+```bash
 watch numastat -m
 ```
 
@@ -218,7 +211,7 @@ This is often referred to as the ["JIT backdoor"
 mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications).
 With ``benchmark.py`` we can exploit this feature to manually hack and test the
 code generated for a given benchmark. So, we first run a problem, for example
-```
+```bash
 python benchmark.py run-jit-backdoor -P acoustic -d 512 512 512 -so 12 --tn 100
 ```
 As you may expect, the ``run-jit-backdoor`` mode accepts exactly the same arguments
@@ -235,7 +228,7 @@ you will see the performance impact of your changes.
 ## Running on HPC clusters
 
 `benchmark.py` can be used to evaluate MPI on multi-node systems:
-```
+```bash
 mpiexec python benchmark.py ...
 ```
 In `bench` mode, each MPI rank will produce a different `.json` file

diff --git a/examples/mpi/overview.ipynb b/examples/mpi/overview.ipynb
@@ -11,7 +11,7 @@
     "* Install an MPI distribution on your system, such as OpenMPI, MPICH, or Intel MPI (if not already available).\n",
     "* Install some optional dependencies, including `mpi4py` and `ipyparallel`; from the root Devito directory, run\n",
     "```bash\n",
-    "pip install -r requirements-optional.txt\n",
+    "pip install -r requirements-mpi.txt\n",
     "```\n",
     "* Create an `ipyparallel` MPI profile, by running our simple setup script. From the root directory, run\n",
     "```bash\n",
@@ -119,7 +119,8 @@
     "%%px\n",
     "# Keep generated code as simple as possible\n",
     "configuration['language'] = 'C'\n",
-    "# Fix platform so that this notebook can be tested by py.test --nbval\n",
+    "# Fix platform so that this notebook can have asserted output\n",
+    "# when tested by ``py.test --nbval\" in any platform\n",
     "configuration['platform'] = 'knl7210'"
    ]
   },
@@ -831,10 +832,32 @@
     "The Devito compiler applies several optimizations before generating code.\n",
     "\n",
     "* Redundant halo exchanges are identified and removed. A halo exchange is redundant if a prior halo exchange carries out the same `Function` update and the data is not “dirty” yet.\n",
-    "* Computation/communication overlap, with explicit prodding of the asynchronous progress engine to make sure that non-blocking communications execute in background during the compute part.\n",
+    "* Halo exchange communications that could be fired together are preferred over being scattered all over the code.\n",
     "* Halo exchanges could also be reshuffled to maximize the extension of the computation/communication overlap region.\n",
     "\n",
-    "To run with all these optimizations enabled, instead of `DEVITO_MPI=1`, users should set `DEVITO_MPI=full`, or, equivalently"
+    "## Computation/communication patterns\n",
+    "\n",
+    "![mpi-modes](https://gist.githubusercontent.com/georgebisbas/aa0e6a2f658728f1bb360f328ee6984a/raw/8c625fb2216dc6f67035856e63985516bbdeb340/mpi-modes.drawio.svg)\n",
+    "\n",
+    "Additionally, the Devito compiler offers a few modes of different computation and communication strategies, each exhibiting superiority under specific conditions for a kernel, such as operational intensity, memory footprint, the number of utilized ranks, and the characteristics of the cluster’s interconnect. Some of the best patterns are namely `basic`, `diagonal`, and `full`. Those have proven to be effective in improving the efficiency and scalability of computations, under several scnarios.\n",
+    "\n",
+    "- `basic`: The basic pattern is the simplest among the methods presented in this section and targets CPUs and GPUs. This mode, illustrated in Figure 5a, involves blocking point-to-point (P2P) data exchanges perpendicular to the 2D and 3D planes of the Cartesian topology between MPI ranks. For\n",
+    "example, each rank issues 4 in 2D and 6 communications in 3D. While this mode benefits from fewer communications, it may encounter synchronization bottlenecks during grid updates before computing the next timestep. This method allocates the memory needed to exchange halos in C-land before every communication, only adding negligible overhead.\n",
+    "\n",
+    "- `diag2`: Compared to the `basic`, this pattern also performs diagonal data exchanges, facilitating the communication of the corner points in our domains in a single step. This results in more  communications, with 8 in 2D and 26 in 3D. Although it involves more communications, they are issued\n",
+    "in a single step, and the messages are smaller compared to basic. Compared to basic, this mode slightly benefits from preallocated buffers in python-land, eliminating the need to allocate data in C-land before every communication. The latter is why this version is not supported on GPUs since the\n",
+    "mechanism of pre-allocating buffers on device memory still needs to be supported.\n",
+    "\n",
+    "- `full`: This pattern leverages communication/computation overlap. The local-per-rank domain is logically decomposed into an inner (CORE) and an outer (OWNED/remainder) area. In a 3D example, the remainder areas take the form of faces and vector-like areas along the decomposed dimensions. The number of communications is the same as in the diagonal mode. This mode benefits from overlapping\n",
+    "two steps: halo updating and the stencil computations in the CORE area. After this step, stencil updates are computed in the ``remainder” areas. In the ideal case, assuming that communication is perfectly hidden, the execution time should converge to the time needed to compute the CORE plus the time needed to compute the remainder. An important drawback of this mode is the slower GPts/s achieved at the remainder areas. The elements in the remainder are not contiguous; therefore,\n",
+    "we have less efficient memory access patterns (strides) along vectorizable dimensions. These areas have lower cache utilization and vectorization efficiency."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's see the `diag2` method:"
    ]
   },
   {
@@ -844,14 +867,31 @@
    "outputs": [],
    "source": [
     "%%px\n",
-    "configuration['mpi'] = 'full'"
+    "configuration['mpi'] = 'diag2'\n",
+    "\n",
+    "op = Operator(Eq(u.forward, u.dx + 1))\n",
+    "# Uncomment below to show code (it's quite verbose)\n",
+    "# print(op)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We could now peek at the generated code to see that things now look differently."
+    "The body of the time-stepping loop has slightly changed compared to `basic`:\n",
+    "\n",
+    "Some differences are:\n",
+    "\n",
+    "* The communication buffers `bufg`, `bufs` are not allocated at C-land, as this already happens in Python-land\n",
+    "* We now fire `ncomms` communications which are not only vertical or horizontal, but also diagonal.\n",
+    "This leads to more messages, but slightly smaller compared to `basic`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We could now peek at the generated code of the `full` mode and see that things now look differently."
    ]
   },
   {
@@ -863,6 +903,8 @@
    "outputs": [],
    "source": [
     "%%px\n",
+    "configuration['mpi'] = 'full'\n",
+    "\n",
     "op = Operator(Eq(u.forward, u.dx + 1))\n",
     "# Uncomment below to show code (it's quite verbose)\n",
     "# print(op)"
@@ -879,6 +921,15 @@
     "* `halowait0` wait and terminates the non-blocking communications;\n",
     "* `remainder0`, which internally calls `compute0`, computes the boundary region requiring the now up-to-date halo data."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "More information on Devito's MPI, can be found in this pre-print:\n",
+    "[Automated MPI-X code generation for scalable finite-difference solvers](https://arxiv.org/abs/2312.13094)"
+   ]
   }
  ],
  "metadata": {