Prototype testbed for various memory acceleration schemes focused on improving sparse memory accesses and scatter/gather performance through indirection arrays.
Client/Controller design to service sparse memory requests (0, 1, and 2 levels of indirection)
- CMake (Version >= 3.12)
- C and C++ compilers (C++ compiler must support C++ 20 for the UME submodule, i.e. gcc 12.2.0 or newer)
- OpenMP (for calibration test)
- Python3 (for scripts)
- MPI (for use with Ume client)
- Caliper (for profiling Ume client)
Included directly. Source code and build systems have been modified where needed to work with Scoria
Currently builds on Intel architectures. An Arm version is in progress.
git clone [email protected]:lanl/scoria.git
cd scoria
mkdir build
cd build
cmake ..
make
git clone [email protected]:lanl/scoria.git
cd scoria
mkdir build
cd build
cmake -DUSE_MPI=ON ..
make
Option | Description | Default | Status | Compile Definitions (Pre-Processor) |
---|---|---|---|---|
USE_MPI | Build with MPI | OFF | Complete | |
USE_CALIPER | Build with Caliper | OFF | Complete |
Option | Description | Default | Status | Compile Definitions (Pre-Processor) |
---|---|---|---|---|
Scoria_REQUIRE_AVX | Build with AVX 512 Support (-mavx512f) | OFF | Complete | USE_AVX |
Scoria_REQUIRE_SVE | Build with SVE Support (-march=armv8.2-a) | OFF | In Progress | USE_SVE |
Option | Description | Default | Status | Compile Definitions (Pre-Processor) |
---|---|---|---|---|
MAX_CLIENTS | Maximum number of clients that can simultaneously connect to the controller | 1 | Complete | MAX_CLIENTS |
REQUEST_QUEUE_SIZE | Size of the request queue for each client | 100 | Complete | REQUEST_QUEUE_SIZE |
Option | Description | Default | Status | Compile Definitions (Pre-Processor) |
---|---|---|---|---|
Scoria_REQUIRE_CLIENTS | Build example clients located in clients directory | ON | Complete | None |
Scoria_REQUIRE_TESTS | Build benchmark tests based on tests/test.c | ON | Complete | None |
Scoria_REQUIRE_CALIBRATION_TESTS | Build calibration tests based on tests/calibration.c | OFF | Complete | None |
Scoria_REQUIRE_TIMING | Build Scoria with internal timing + build tests to print internal results | OFF | Complete | Scoria_REQUIRE_TIMING |
Scoria_SCALE_BW | Build Scoria and tests to account for indirection arrays when calculating bandwidth | OFF | Complete | SCALE_BW |
Scoria_SINGLE_ALLOC | Build benchmark and calibration tests with single allocation policy | OFF | Complete | SINGLE_ALLOC |
Build Scoria with only bandwidth tests (no clients, no calibration tests and no AVX/SVE)
cmake -DScoria_REQUIRE_CLIENTS=OFF ..
make
The test
and test_client
executables should be in the tests
directory, along with the scoria
executable in the base build directory.
Build Scoria with both bandwidth and calibration tests (no clients and no AVX/SVE)
cmake -DScoria_REQUIRE_CALIBRATION_TESTS=ON -DScoria_REQUIRE_CLIENTS=OFF ..
make
The test
, test_clients
, test_calibration
, and test_calibration_client
executables should be in the tests
directory, along with the scoria
executable in the base build directory.
Build Scoria with bandwidth and calibration tests and clients (no AVX/SVE)
cmake -DScoria_REQUIRE_CALIBRATION_TESTS=ON ..
The test
, test_clients
, test_calibration
, and test_calibration_client
executables should be in the tests
directory, the simple_client
and spatter
executables should be in the clients
directory, along with the scoria
executable in the base build directory.
Build Scoria with bandwidth and calibration tests and clients with AVX intrinsics, internal timing, and bandwidth scaling enabled, along with the ability to manage 4 client simultaneously
cmake -DScoria_REQUIRE_CALIBRATION_TESTS=ON -DScoria_REQUIRE_AVX=ON -DScoria_REQUIRE_TIMING=ON -DScoria_SCALE_BW=ON -DMAX_CLIENTS=4 ..
make
Build UME with AVX Extension and Scoria
cmake -DUSE_MPI=ON -DUSE_CALIPER -DREQUIRE_AVX
make
The ume_mpi
and ume_serial
executables should be in the clients/UME/src directory
Build UME + Scoria with Caliper Profiling and AVX
mkdir caliper_build
cd caliper_build
cmake -DUSE_CALIPER=ON -DUSE_MPI=ON -DScoria_REQUIRE_AVX=ON ..
make
The ume_mpi
and ume_serial
executables should be in the clients/UME/src directory
Build baseline (non-scoria) UME for Caliper profiling without AVX
cd clients/UME
mkdir caliper_build
cd caliper_build
cmake -DUSE_CALIPER=ON -DUSE_MPI=ON ../
make
The ume_mpi
and ume_serial
executables should be in the src directory
The test
, test_clients
, test_calibration
, and test_calibration_client
executables should be in the tests
directory, the simple_client
and spatter
executables should be in the clients
directory, along with the scoria
executable in the base build directory. The test_clients
and test_calibration_client
executables, when ran with the scoria
controller, should now output both internal and external bandwidth measurements and timings.
Tests for 0, 1, and 2 levels of indirection are implemented. They come in the following flavors:
str
uses straight access, meaning indexa[i] = i
for all levels of indirection (this is the only test availalbe for 0 levels of indirection).A
ornoA
denotes if aliases are included or not. If aliases are included, they are added before the shuffle stage (see below). For each index, a random number is drawn and if it's below the alias fraction, this index is inserted at a random position in the indirection indices. This is done for all levels of indirection.F
orC
denotes full or clustered shuffle and aliases. Full shuffle means the indices are shuffled across the entire range and aliases, if used, are inserted across the entire range. In clustered mode, the shuffle and aliasing happens only within consequtive clusters of the given size. For example, say we have a cluster sizeS = 32
, then the first cluster is indices 0 - 31 and aliases are within this group are added and only these indices are shuffled amongst themselves. The next cluster is 32 - 63, and any aliases added to this cluster are all indices within this cluster before they are shuffled amongst themselves.
Under the tests
directory in the build directory, there are four executables. They are each ran by specifying the number of doubles we wish to test on: ./test 8388608
test
runs the test suite without using the client and controller infrastructure; it just tests the kernls directlytest_client
runs the tests as a client and communicates with the controller; a controller must thus be runningtest_calibrate
performs a STREAM-like benchmark for baselining and runs the 0-level indirection test without using the client and controller infrastructure; it just tests the kernels directly. Requires OpenMP for the STREAM-like benchmark.test_calibrate_client
performans a STREAM-like benchmark for baselining and runs the 0-level indirection test as a client and communcates with the controller; a controller must thus be running. Currently has experimental code to re-map pages to particular NUMA nodes. Requires OpenMP for the STREAM-like benchmark.
To add your own clients, use clients/simple/simple_client.c as a starting point. At a minimum you will need to intialize and cleanup the client as follows:
#include "scoria.h"
int main(int argc, char **argv) {
struct client client;
client.chatty = 0;
scoria_init(&client);
// Your code here
scoria_cleanup(&client);
return 0;
}
Allocate usable shared memory between the client and Scoria with shm_malloc(size_t s)
double *A = shm_malloc(1024 * sizeof(double));
The following commands can be used to perform gathers (reads) or scatters (writes) with 0, 1, or 2 levels of indirection:
void scoria_write(struct client *client, void *buffer, const size_t N, const void *input, const size_t *ind1, const size_t *ind2, size_t num_threads, i_type intrinsics, struct request *req)
void scoria_read(struct client *client, const void *buffer, const size_t N, void *output, const size_t *ind1, const size_t *ind2, size_t num_threads, i_type intrinsics, struct request *req)
void scoria_quit(struct client *client, struct request *req)
The available intrinsics are: NONE
, AVX
, and SVE
Read and Write requests are handled asynchronously by Scoria. They can be completed using:
void wait_request(struct client *client, struct request *req)
Client | Description | Directory | Status |
---|---|---|---|
Simple | Minimal client that demonstrates read/write/quit using shared memory | client/simple | Complete |
Spatter | Microbenchmark for timing Gather/Scatter kernels Spatter | client/spatter | Complete |
Minimal Spatter | Minimal Spatter client that removes argtable and other dependencies | client/minimal_spatter | In Progress |
Ume | Flag Proxy which attempts to capture memory access patterns, kernels, and mesh structure Ume | client/ume | Complete |
EAPPAT | Memory access and iterations patterns from the EAP code base with the physics removed EAP Patterns | client/eappat | Coming Soon |
Terminal window 1
./scoria
Terminal window 2
./tests/test_client 1048576
On nodes with multiple CPU sockets, bandwidth can be drastically reduced if the client and controller processes are bound to different NUMA nodes. To explicitly bind the processes to the same socket, use the following:
Terminal window 1
hwloc-bind node:0 ./scoria
Terminal window 2
hwloc-bind node:0 ./tests/test_client 1048576
Note: To use the scripts, Scoria must have been built without internal timing, i.e. -DScoria_REQUIRE_TIMING=OFF
scripts/simple_test_bw.py
contains a script to launch both Scoria and the test client. It is configurable with the following options:
Short Option | Long Option | Description | Default |
---|---|---|---|
-l | --logfile | Logfile name | client.log |
-p | --plotfile | Plot file names (see plot_test_bw.py ) |
bw.png |
-n | --size | Number of doubles to pass to test_client |
1048576 |
-s | --bindscoria | hwloc-bind options for Scoria | None |
-b | --bindclient | hwloc-bind options for test_client |
None |
The output will be a log file with the bandwidth data and bar charts of the bandwidth for each test at each thread count. If AVX or SVE is enabled, those results will be saved to an individual figure with the appropriate name.
scripts/scoria-vs-ume.sh
contains a script to build Ume + Scoria with AVX and Caliper enabled, and to build a standalone Ume executable with Caliper. It then runs both with an Ume input file of your choosing and with the specified number of ranks, and outputs profiling data in the form of a text file or a JSON file that can be read by Hatchet. It is configurable with the following options:
Short Option | Description | Default |
---|---|---|
-c | (Optional) CALI_CONFIG setting | runtime-report(output=report.log) |
-f | Absolute path to Input Deck for Ume | None |
-n | (Optional) Number of ranks to use to launch MPI run | 1 |
-s | (Optional) Scoria root directory | pwd |
-p | (Optional) List of PAPI Counters to collect | None |
cd scoria
python3 scripts/simple_test_bw.py -l output.log -p scoria.png -n 8388608 -s node:0 -b node:0
bash scripts/scoria-vs-ume.sh -c "hatchet-region-profile" -n <num-ranks> -f <absolute-path-to-input-deck>
bash scripts/scoria-vs-ume.sh -n <num-ranks> -f <absolute-path-to-input-deck> -p "PAPI_DP_OPS,PAPI_TOT_CYC,PAPI_TOT_INS,PAPI_LD_INS,PAPI_SR_INS,PAPI_BR_INS,PAPI_LST_INS"
Triad National Security, LLC (Triad) owns the copyright to Scoria. The license is BSD-ish with a "modifications must be indicated" clause. See LICENSE for the full text.
- Jered Dominguez-Trujillo, [email protected]
- Jonas Lippuner, [email protected] [email protected]
- Neel Patel, [email protected] [email protected]