Skip to content

Benchmark AMD CPU/GPU performance on GNN workload

Notifications You must be signed in to change notification settings

jinyangli/GNNBenchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark AMD CPU/GPU performance with GNN workload

This repository is for benchmarking sparse and dense kernel performance on AMD CPU and GPU with Graph Neural Network (GNN) workload.

CPU Sparse Performance Benchmark

Benchmark DGL Minigun sparse kernels and MKL sparse kernels on AMD CPU and Intel CPU.

Install system packages

Many packages need to be installed to build the tests and generate input for the tests including build-essential, make, cmake, python3 and python packages like numpy, scipy, torch, dgl.

The list above is incomplete. We provide a DockerFile to build a container with all required dependencies and we recommend you use it. Check out how to build the container and run it in docker.

Install MKL

Download and install MKL for C/C++. This benchmark repository was tested with MKL_2020.1.217

After installation, set up environment variable for MKL

export MKLROOT=/path/to/mkl
export CPATH=$CPATH:$MKLROOT/include
export LIBRARY_PATH=$LIBRARY_PATH:$MKLROOT/lib/intel64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKLROOT/lib/intel64

When doing tested on AMD CPU, run the following command to improve MKL performance as suggested in this post

export MKL_DEBUG_CPU_TYPE=5

Build tests

First, git clone this repository and change directory into it. Then initialize submodules with

git submodule update --init --recursive

Then, build the test by

cd /path/to/this/repository
mkdir build
cd build
cmake ..
make -j$(nproc)

Generate input graph

cd /path/to/this/repository
cd scripts

The following commands use DGL package to download Reddit dataset and serialize the social graph for testing

mkdir bench-graphs
python3 gen_dgl_graph.py -o bench-graphs/reddit.grh --dataset=reddit-self-loop

If you want to try more graphs, simply follow gen_dgl_graph.py to serialize your graphs. Also, DGL has built-in support for a large set of datasets. Check out the list here

Run tests

cd /path/to/this/repository

The executable we previously built is at ./build/tests/cpu_spmm. It takes in two arguments: input graph file and node feature size. The test code will convert the input graph to a sparse matrix (A) in CSR format and creates a randomly initialized node feature tensor (H) of size (num_nodes, node_feature_size), and then perform Sparse Matrix Multiplication (SPMM) between A and H and measure execution time.

If we are testing on AMD CPU, run the following:

export MKL_DEBUG_CPU_TYPE=5

Now, run the test:

./build/tests/cpu_spmm scripts/bench-graphs/reddit.grh 16

The testing code will check result correctness, warm up by executing the SPMM 10 times, and then test 10 times and report average execution time in milliseconds.

Results

The table below shows the results we got on Reddit Graph (232965 nodes, 114848857 edges) following the above steps using AWS machines. For Intel CPU we used p3.8xlarge instance, and for AMD CPU we used m5a.8xlarge. Both have 32 virtual cores.

For Minigun SPMM kernel, the execution time in milliseconds:

Feature Size AMD Intel
16 1839.530 1324.340
32 2985.770 2380.760
64 4837.950 4560.380
128 9550.330 8952.170

For MKL SPMM kernel, the execution time in milliseconds:

Feature Size AMD Intel
16 277.550 114.241
32 552.329 101.318
64 1051.990 196.756
128 1958.280 670.561

GPU Sparse Performance Benchmark

Scripts in tests-gpu benchmarks performance of Sparse Matrix Multiplication (SpMM) on AMD and NVIDIA GPU. The machines we used for benchmark are:

  • AMD:
    • CPU: AMD EPYC 7452 32-Core Processor (128 virtual cores), 1.5GHz (max 2.35GHz), 1TB memory
    • GPU: Vega 20 [Radeon VII]: single precision 13.44 TFLOPS, 16GB HBM2 memory, Bandwidth 1,024 GB/s, Memory Bus 4096 bit
  • Intel / NVIDIA:
    • CPU: Intel(R) Core(TM) i7-9700 CPU (8 physical cores, 8 virtual cores), 3.0GHz (max 4.6GHz), 32GB memroy
    • GPU: NVIDIA GeForce RTX 2080: single precision 10.07 TFLOPS, 8GB GDDR6 memory, Bandwidth 448.0 GB/s, Memory Bus 256 bit

Sparse Matrix Multiplication (SpMM) kernel

tests-gpu/bench_spmm.py benchmarks average execution time (in milliseconds) of 100 runs (after warming up with another 100 runs). Below is the result using Reddit dataset as sparse graph:

Feature Size AMD Intel
16 17.599 35.434
32 24.150 42.041
64 43.302 79.118
128 93.180 156.993

End to end training time of Graph Convolution Network

tests-gpu/gcn.py benchmarks average epoch time (in seconds) of training 2-layer GCN on Reddit Dataset with input feature size 602 and output feature size 41 and different hidden layer size. The accuracy numbers show the mean and standard deviation of 10 runs.

Hidden Size AMD epoch time AMD accuracy NIVIDA epoch time NVIDIA accuracy
16 0.0692 78.99 ± 3.43 0.1811 78.46 ± 5.31
32 0.0762 90.21 ± 1.52 0.1886 88.42 ± 3.47
64 0.0982 92.62 ± 0.27 0.2270 92.51 ± 0.59
128 0.1520 93.24 ± 0.09 0.3078 93.24 ± 0.12

Additional Tests

Dense matrix multiplication

scripts/bench_dense_mm.py benchmarks the performance of multiplication between two dense matrix of size 1000 by 1000 using PyTorch. To run this test, for Intel CPU or NVIDIA GPU, one needs to install PyTorch. On AMD machines, for CPU, use this docker file from an AMD-maintained fork of PyTorch which uses BLIS as BLAS library, and for GPU, use this recommended docker image.

We tested using the same machines mentioned in sparse kernel experiments above. Average execution time (in milliseconds) of 10 runs is shown below.

AMD Intel / NVIDIA
CPU 4.7 2.1-3.8
GPU 0.239 0.292

We suspect that the large variance of Intel CPU is due to automatic CPU clock rate adjustment.

Alternatively, one should use C++ interface of MKL, BLIS, cuBLAS, and rocBLAS to compare their performance.

About

Benchmark AMD CPU/GPU performance on GNN workload

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published