Current CI status on develop branch :
All documentation has been recently updated for the 2.6.4 release. Documentation sometimes goes stale, thanks for your patience! If that's the case, please check the wiki.
One of the key components of the XPRESS project is a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [1] applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The standard model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation within OpenX. The approach taken in the XPRESS project is a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete academic description of APEX, see the publication "APEX: An Autonomic Performance Environment for eXascale" [2].
In short, APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not only useful for AMT/AMR runtimes - it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments.
APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX can provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task much better (such as TAU). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis.
APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the Policy Engine, which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX is integrated with Active Harmony to provide dynamic search for autotuning.
Full user documentation is available here. For hot-off-the-press changes/updates, please check the wiki.
The source code is instrumented with Doxygen comments, and the API reference manual can be generated by executing make doc
in the build directory, after CMake configuration. [A fairly recent version of the API reference documentation is also available here] (http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html).
[Full installation documentation is available here] (http://uo-oaciss.github.io/apex). Below is a quickstart for the impatient...
These instructions are for building the stand-alone APEX library. For instructions on building APEX with HPX, please see http://uo-oaciss.github.io/apex/usage
To build APEX stand-alone (to use with OpenMP, OpenACC, CUDA, Kokkos, TBB, C++ threads, etc.) do the following:
git clone https://github.com/UO-OACISS/apex.git
cd xpress-apex
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=TRUE ..
make -j
To run an example (since -DBUILD_EXAMPLES=TRUE
was set), just run the Matmult example and you should get similar output:
[khuck@eagle xpress-apex]$ ./build/src/examples/Matmult/matmult
Spawned thread 1...
Spawned thread 2...
Spawned thread 3...
Done.
Elapsed time: 0.300207 seconds
Cores detected: 128
Worker Threads observed: 4
Available CPU time: 1.20083 seconds
Counter : #samples | minimum | mean | maximum | stddev
------------------------------------------------------------------------------------------------
status:Threads : 1 6.000 6.000 6.000 0.000
status:VmData : 1 4.93e+04 4.93e+04 4.93e+04 0.000
status:VmExe : 1 64.000 64.000 64.000 0.000
status:VmHWM : 1 7808.000 7808.000 7808.000 0.000
status:VmLck : 1 0.000 0.000 0.000 0.000
status:VmLib : 1 6336.000 6336.000 6336.000 0.000
status:VmPMD : 1 16.000 16.000 16.000 0.000
status:VmPTE : 1 4.000 4.000 4.000 0.000
status:VmPeak : 1 3.80e+05 3.80e+05 3.80e+05 0.000
status:VmPin : 1 0.000 0.000 0.000 0.000
status:VmRSS : 1 7808.000 7808.000 7808.000 0.000
status:VmSize : 1 3.15e+05 3.15e+05 3.15e+05 0.000
status:VmStk : 1 192.000 192.000 192.000 0.000
status:VmSwap : 1 0.000 0.000 0.000 0.000
status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000
status:voluntary_ctxt_switches : 1 77.000 77.000 77.000 0.000
------------------------------------------------------------------------------------------------
Timer : #calls | mean | total | % total
------------------------------------------------------------------------------------------------
APEX MAIN : 1 0.300 0.300 100.000
allocateMatrix : 12 0.009 0.108 9.023
compute : 4 0.206 0.825 68.736
compute_interchange : 4 0.064 0.257 21.369
do_work : 4 0.298 1.193 99.313
freeMatrix : 12 0.000 0.000 0.025
initialize : 12 0.000 0.002 0.146
main : 1 0.299 0.299 24.930
------------------------------------------------------------------------------------------------
Total timers : 49
HPX (High Performance ParalleX) is the original implementation of the ParalleX model. Developed and maintained by the Ste||ar Group at Louisiana State University, HPX is implemented in C++. For more information, see http://stellar.cct.lsu.edu/tag/hpx/. For a tutorial on HPX with APEX (presented at SC'17, Austin TX) see http://www.nic.uoregon.edu/~khuck/SC17-HPX-APEX.pdf. The integration specification is available here.
HPX-5 (High Performance ParalleX) is a second implementation of the ParalleX model. Developed and maintained by the CREST Group at Indiana University, HPX-5 is implemented in C. For more information, see https://hpx.crest.iu.edu.
POSIX.1 specifies a set of interfaces (functions, header files) for threaded programming commonly known as POSIX threads, or Pthreads. A single process can contain multiple threads, all of which are executing the same program. These threads share the same global memory (data and heap segments), but each thread has its own stack (automatic variables). C++ threads are a language portable abstraction on top of native threading implementations. APEX supports pthreads by wrapping and capturing the pthread_create
function call. For more information, see https://man7.org/linux/man-pages/man7/pthreads.7.html and https://www.cplusplus.com/reference/thread/thread/.
The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. For more information, see http://openmp.org/.
OpenACC is a user-driven directive-based performance-portable parallel programming model. It is designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model. The OpenACC specification supports C, C++, Fortran programming languages and multiple hardware architectures including X86 & POWER CPUs, and NVIDIA GPUs. For more information, see https://www.openacc.org.
Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data management. Kokkos is designed to target complex node architectures with N-level memory hierarchies and multiple types of execution resources. It currently can use CUDA, HIP, HPX, OpenMP and Pthreads as backend programming models with several other backends in development. For more information, see https://kokkos.org.
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. APEX uses the CUPTI and NVML libraries provided by NVIDIA to gather performance information from the GPUs. For more information, see https://developer.nvidia.com/cupti and https://developer.nvidia.com/nvidia-management-library-nvml.
Heterogeneous-Computing Interface for Portability (HIP) is a C++ dialect from AMD designed to ease conversion of CUDA applications to portable C++ code. It provides a C-style API and a C++ kernel language. The C++ interface can use templates and classes across the host/kernel boundary. APEX uses the roctracer library to gather performance information from the GPUs. For more information, see https://github.com/ROCm-Developer-Tools/roctracer.
[1] Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. "SLOWER: A performance model for Exascale computing". Supercomputing Frontiers and Innovations, 1:42–57, September 2014. http://superfri.org/superfri/article/view/10
[2] Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. "An Autonomic Performance Environment for eXascale", Journal of Supercomputing Frontiers and Innovations, 2015. http://superfri.org/superfri/article/view/64