Implementation and benchmark of optimization techniques and algorithms applied to the Matrix Multiplication problem on a CPU/GPU multithreaded environment.
This was an undergraduate research project done at Federal University of Rio de Janeiro (UFRJ) August/2013 - July/2014.
- Loop unrolling
- Loop blocking
- Z-order memory layout
- SIMD (Single Instruction, Multiple Data) instructions
- OpenMP (CPU)
- Pthreads (CPU)
- OpenCL (GPU)
- Standard
- Strassen
- Recursive
- OpenBLAS (CPU)
- AMD APP SDK (GPU)
- clMath (GPU)
The program test-mult is responsible for setting up test cases, execute different implementations, retrieve timing results and generate final report. You can call './test-mult M K N' to start the program. Those parameters are the matrices dimensions [M,K] and [K,N], among other parameters:
- Elements data type (int, float, double)
- Blocking size
- Compiler optimization level (-O0, -O1, -O2, -O3, -Ofast -march=native -mfpmath=sse -frename-registers)
- Execution repeating count
- Random number generator range
- Algorithms selection
All selected algorithms/implementations will be compiled using the choosen compiler optimization level. Then they will be executed by a execv() call and the test-mult execution will resume only after its child process is finished. Timings are collected for reading, processing and writing phases. gettimeofday() is used for multithreaded and library implementations. getrusage() is used on some single thread programs for a more precise timing. Both timing options are reliable. Not all algorithms will run under some selected parameters (eg. int data type and clMath), so they won't appear on algorithms selection.
After gathering and analyzing multiple tests results, I could come to some conclusions.
-
- All algorithms benefit from compiler optimization. There is a considerable performance gain when comparing -O2 against -O0 processing timings. Minor differences from -O2 to -O3 or -Ofast.
-
- Small gain when changing cache blocking size from 32 to 64. 64 to 128 gives slightly improvements, no effects or worsen performance depending on the algorithm.
-
- The best algorithm to use to multiply matrices up to dimensions [2048,2048] is the OpenBLAS implementation using multiple threads. Bigger matrices will perform better on GPU environment, where the clMath program gets the best processing time.
This chart represents the total timing for each algorithm execution. OpenBLAS got the fastest runtime, but in fact it is slower than clMath in processing time. The last 3 GPU implementations got a slower writing time, the reason is still unknown and weird for me because in the other tests there isn't this slowdown on writing phase for GPU. You can check all the benchmark files on 'bench' folder.
19 programs.
standardMult:
Basic 3 nested loops.
optimizedMult:
Cache blocking (parameter) and loop unrolling x5.
optimizedMult2:
Cache blocking (parameter) and SIMD instruction (x8 float, x4 double).
optimizedMultParallel:
Cache blocking (parameter), loop unrolling x5 and OpenMP 'parallel for' on outermost loop.
optimizedMultParallel2:
Cache blocking (parameter),SIMD instruction (x8 float, x4 double) and OpenMP 'parallel for' on outermost loop.
strassenMult:
Cache blocking (parameter) and loop unrolling x5. Only works on square matrices with dimension as a power of two. Memory saving implementation by using final matrix as temporary matrix.
strassenMult2:
Cache blocking (parameter) and SIMD instruction (x8 float, x4 double). Only works on square matrices with dimension as a power of two. Memory saving implementation by using final matrix as temporary matrix.
strassenMultParallel:
Cache blocking (parameter), loop unrolling x5 and Pthreads parallel execution on each matrix quadrant. Only works on square matrices with dimension as a power of two. Memory saving implementation by using final matrix as temporary matrix.
strassenMultParallel2:
Cache blocking (parameter), SIMD instruction (x8 float, x4 double) and Pthreads parallel execution on each matrix quadrant. Only works on square matrices with dimension as a power of two. Memory saving implementation by using final matrix as temporary matrix.
recursiveMult:
Cache blocking (parameter) and loop unrolling x5. Z-order memory layout for better cache usage.
recursiveMult2:
Cache blocking (parameter) and SIMD instruction (x8 float, x4 double). Z-order memory layout for better cache usage.
recursiveMultParallel:
Cache blocking (parameter), loop unrolling x5 and Pthreads parallel execution for generating tasks and executing them. Task Scheduler and Z-order memory layout for better cache usage.
recursiveMultParallel2:
Cache blocking (parameter), SIMD instruction (x8 float, x4 double) and Pthreads parallel execution for generating tasks and executing them. Task Scheduler and Z-order memory layout for better cache usage.
openBlas:
OpenBLAS implementation using one thread.
More Info: http://www.openblas.net/
openBlasParallel:
OpenBLAS implementation using multiple threads.
More Info: http://www.openblas.net/
standardOpenCL:
Standard OpenCL implementation. Each thread will compute the result for one single matrix cell.
optimizedOpenCL:
OpenCL implementation, blocking and use of group local memory.
AMDopenCL:
Implementation that came with the AMD APP SDK v2.8.1
More info: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/
clMath:
AMD Accelerated Parallel Processing Math Library.
More info: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-math-libraries/