Skip to content

Winograd based convolution kernels for GPUs written in CUDA C++.

Notifications You must be signed in to change notification settings

Sha-x2-nk/WinogradConvolution-CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Non-Fused Winograd Convolution in CUDA C++

This project implements non-fused Winograd convolution for two configurations: 4x4 with 3x3 filters and 2x2 with 3x3 filters, where the first refers to the output tile size. The implementation is done in CUDA C++ and includes support for padding. Stride is not supported in this version. The project also includes autotuning to optimize hyperparameters for performance.

Features

  • Winograd Convolution: Implements Winograd convolution for 4x4 and 2x2 output tile sizes with 3x3 filters.
  • Padding Support: Includes support for padding the input.
  • 4 Phases of Computation:
    1. Filter Transform
    2. Input Transform
    3. Batched GEMM for Hadamard Product
    4. Inverse Transform

Comparison Against cuDNN

See comparison at BENCHMARKS.md

Usage

While the 4x4 approach is faster, it loses precision, with difference increasing with increasing parameters.

Nvidia cuDNN's winograd somehow has higher precision (we think they also use a 4x4_3x3 implementation).

Although convolution does not demand such high precision, but if you need high high precision use 2x2 implementation.

To use this implementation, follow the syntax and steps outlined below:

Syntax

#include "winograd_4x4_3x3.cuh"
#include "winograd_2x2_3x3.cuh"

const int N = 32,
          C = 128,
          H = 112,
          W = 112,
          K = 128,
          padding = 1;

float *img = new float[N * C * H * W];
float *filter = new float[K * C * 3 * 3];

float *out = convWinograd_2x2_3x3(img, N, C, H, W, f, K, padding); 
float *out = convWinograd_4x4_3x3(img, N, C, H, W, f, K, padding);
// out [N, K, H, W]

free(img); free(filter); free(out);

Phases

Implementation can be seen at PHASES.md

Autotuning

We auto tuned the launch configuration for our hardware (RTX 3070Ti). Other devices may need another round of autotuning to get comparable performance.

Building and Running

To build the project, ensure you have CUDA installed and properly configured. Use CMake to build the project.

mkdir build
cd build 
cmake ..

Run binary (main) from bin folder.

Acknowledgements

About

Winograd based convolution kernels for GPUs written in CUDA C++.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published