Skip to content

⚡️ An educational ConvNet inference framework designed for x86 architectures

License

Notifications You must be signed in to change notification settings

jssonx/lightneuron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightNeuron

GitHub Actions status Codacy Badge License

lightneuron

LightNeuron is a highly efficient, educational neural network library designed for x86-64 architectures in C. It aims to provide insights into neural network mechanics, profiling, and optimization, with a special focus on the efficiency of General Matrix Multiply (GEMM) operations.

Overview

Targeted primarily at students, researchers, and developers, LightNeuron offers a CNN inference framework capable of processing HDF5 model files. This facilitates the integration with models trained on frameworks like PyTorch and TensorFlow. Key features include:

  • Convolutional Layer Computation (conv())
  • Matrix Multiplication (matmul())
  • Activation Functions (relu())
  • Pooling (pooling())
  • Forward Pass Operations (forwardPass())
  • Feature Extraction and Interpretation
  • Prediction (softmax(), predict())

framework

Development Environment Specifications

LightNeuron is optimized for x86-64 architectures, ensuring compatibility and efficiency on a wide range of systems. Below are the specifications of the primary development environment, which can serve as a benchmark for expected performance:

Prerequisites

Ensure your system is ready for LightNeuron by installing the perf tool:

sudo apt-get install linux-tools-$(uname -r) linux-cloud-tools-$(uname -r)

Configure your system by editing /etc/sysctl.conf:

kernel.perf_event_paranoid = -1
kernel.nmi_watchdog = 0

Activate the changes:

sudo sysctl -p

Getting Started

  1. Clone the Repository:

    git clone [repository-url]
  2. Download MNIST Dataset:

    python get_data.py
  3. Compile and Run Labs:

    make lab && ./lab

Performance Profiling

Profile GEMM operations with specific targets and cache levels:

make perf TARGET=[your-target] CACHE_LEVEL=[your-cache-level] USE_PMU=1
  • Replace TARGET with the GEMM implementation (e.g., matmul_naive).
  • Set CACHE_LEVEL to desired cache level (e.g., L1, L2, L3).

Example:

make perf TARGET=matmul_naive CACHE_LEVEL=L1 USE_PMU=1

USE_PMU=1 activates the Performance Monitoring Unit for detailed hardware-level performance insights.

GEMM Optimization

LightNeuron places a strong emphasis on optimizing General Matrix Multiply (GEMM) operations. This optimization leads to significant performance improvements, as measured in GFLOPS (Giga Floating Point Operations Per Second), particularly noticeable across a range of matrix dimensions. Key strategies employed in this optimization include:

  • Loop Interchange: Reorders nested loops to enhance memory access patterns and improve cache performance, eg. ijk -> kji.
  • Compiler Optimization Flags: Employs -O2/-O3 levels for code efficiency.
  • Parallel Loops: Uses OpenMP directives to distribute loop execution across multiple CPU threads.
  • Loop Tiling (Blocking): Optimizes spatial and temporal locality for caches.
  • Divide-and-Conquer: Splits large matrices into smaller sub-matrices for better cache performance.
  • SIMD Intrinsics with Data Alignment: Uses AVX2 instructions and aligns data to boost vectorized operations and memory throughput.

The result of these enhancements is a notable increase in CPU computational efficiency, boosting the performance of matrix multiplication operations considerably.

Implementation Cache References (millions) L1-d Cache Misses (millions) LL Cache Misses (millions)
+parallel loops 4934.44 406.47 404.9
+tiling 5010.46 620.66 13.29
+parallel divide-and-conquer 1881.06 152.97 5.21

Tiling achieves a 96% reduction in last-level cache misses, and parallel divide-and-conquer further lowers overall cache references and minimizes cache misses.

Performance Benchmark

The following table showcases the GFLOPs performance of various kernels compared to Intel MKL, at a matrix size of 1200x1200.

Version Implementation Running Times (ms) Relative Speedup Absolute Speedup GFLOPS Percent of Peak Percent of Intel MKL
naive 11190.93 1.00 1.00 0.19 0.19% 0.25%
naive + interchange loops 4267.47 2.62 2.62 0.50 0.49% 0.65%
naive + interchange loops + optimization flags 675.76 6.32 16.56 3.18 3.10% 4.08%
naive + interchange loops + optimization flags + parallel loops 147.87 4.57 75.68 14.52 14.18% 18.62%
naive + interchange loops + optimization flags + parallel tiling 101.3 1.46 110.47 21.20 20.70% 27.19%
naive + interchange loops + optimization flags + parallel divide-and-conquer 89.52 1.13 125.01 23.99 23.43% 30.76%
naive + interchange loops + optimization flags + parallel divide-and-conquer + avx2 intrinsics + data alignment 71.11 1.26 157.37 30.20 29.49% 38.73%
naive + interchange loops + optimization flags + parallel tiling + avx2 intrinsics + data alignment 62.41 1.14 179.31 34.41 33.60% 44.13%
naive + interchange loops + optimization flags + parallel divide-and-conquer + avx2 intrinsics + data alignment + coarsening 43.62 1.43 256.56 49.23 48.08% 63.14%
Intel MKL 27.54 1.58 406.35 77.98 76.15% 100.00%

benchmark

v1 denotes the naive implementation, while v2 through v10 sequentially represent the advanced enhancements detailed in the above table.

About

⚡️ An educational ConvNet inference framework designed for x86 architectures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages