This program measures memory transfer rates in MB/s for simple computational kernels coded in C.
Since 2007, the Stream benchmark is used to test and check nodes on clusters managed by CEA/DAM during system updates or maintenances. The benchmark is useful to detect:
- Memory module failure
- Lack of memory on nodes
- Memory module performance issue
- OS regression
- Compiler regression (OpenMP)
Over the years, CEA added some features to be more efficient detecting these problems:
- The MPI version can test a whole cluster (more than 8000 nodes) with a single run only.
- An option can be used to define the amount of memory to use instead of a vector size. An array size will therefore be computed to fit this memory requirement.
- The output was updated to give a list of nodes sorted by their measured bandwidth.
To share these new features with the HPC community, CEA publishes this modified version on Github in 2022.
- McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.
Main program:
- C compiler
- OpenMP
- MPI
$ ./autogen.sh
$ ./configure # Sequential mode
$ ./configure --with-openmp # OpenMP mode
$ ./configure --with-mpi CC=mpicc # MPI mode
$ ./configure --with-openmp --with-mpi CC=mpicc # MPI/OpenMP mode
$ make
To pass options to C compiler:
$ ./configure CFLAGS="-mavx2" LDFLAGS="-mavx2"
$ mkdir build
$ cd build
$ cmake .. # Sequential mode
$ cmake -DOPENMP=ON .. # OpenMP mode
$ cmake -DMPI=ON .. # MPI mode
$ cmake -DOPENMP=ON -DMPI=ON .. # MPI/OpenMP mode
$ make VERBOSE=1
To pass options to C compiler:
$ cmake -DCMAKE_C_FLAGS="-mavx2" -DCMAKE_EXE_LINKER_FLAGS="-mavx2" ..
$ ./stream.exe -h
MPI_STREAM CEA MPI/OpenMP version $Revision: X.Y $
Usage: ./stream.exe [-h] [-n N] [-m mem] [-t ntimes] [-o offset]
Options:
-n N Size of a vector
-m mem Memory (kB) used per process
-t ntimes Number of times the computation will run
-o offset Offset
-h Print this help
You can launch the sequential mode with 1GB of memory:
$ ./stream.exe -m 1048576
-------------------------------------------------------------
MPI_STREAM CEA MPI/OpenMP version $Revision: X.Y $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 44739242, Offset = 0
Total memory required = 1048575.98 KB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 1
-------------------------------------------------------------
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 20594 microseconds.
(= 20594 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 42299.2351 0.0171 0.0169 0.0172
Scale: 42512.4562 0.0171 0.0168 0.0171
Add: 42376.8484 0.0255 0.0253 0.0256
Triad: 42440.3442 0.0255 0.0253 0.0257
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
To test 4 nodes with 128 cores and 256GB per node, you can launch the MPI/OpenMP mode with 220GB per node:
$ OMP_NUM_THREADS=128 mpirun -n 4 -cpus-per-rank 128 ./stream.exe -m 230686720
-------------------------------------------------------------
MPI_STREAM CEA MPI/OpenMP version $Revision: X.Y $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 9842633386, Offset = 0
Total memory required = 230686719.98 KB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 128
-------------------------------------------------------------
Number of MPI Processes = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 487067 microseconds.
(= 487067 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Triad Rate (MB/s):
node6201 336870.222291
node6202 336899.087888
node6203 336828.994303
node6204 336809.300049
============SUMMARY============
TRIAD_MAX = 336899.087888 MB/s on node6202
TRIAD_MIN = 336809.300049 MB/s on node6204
TRIAD_AVG = 336851.901133 MB/s
TRIAD_AVG_per_proc = 2631.655478 MB/s
TRIAD_STDD = 40.422064 MB/s
==========END SUMMARY==========
See the list of AUTHORS who participated in this project.
Laurent Nguyen - [email protected]
Copyright 2007-2022 CEA/DAM/DIF
MPI_STREAM is distributed under the original license of STREAM benchmark.
See the included files LICENSE.txt (English version).