This project implements non-fused Winograd convolution for two configurations: 4x4 with 3x3 filters and 2x2 with 3x3 filters, where the first refers to the output tile size. The implementation is done in CUDA C++ and includes support for padding. Stride is not supported in this version. The project also includes autotuning to optimize hyperparameters for performance.
- Winograd Convolution: Implements Winograd convolution for 4x4 and 2x2 output tile sizes with 3x3 filters.
- Padding Support: Includes support for padding the input.
- 4 Phases of Computation:
- Filter Transform
- Input Transform
- Batched GEMM for Hadamard Product
- Inverse Transform
See comparison at BENCHMARKS.md
While the 4x4 approach is faster, it loses precision, with difference increasing with increasing parameters.
Nvidia cuDNN's winograd somehow has higher precision (we think they also use a 4x4_3x3 implementation).
Although convolution does not demand such high precision, but if you need high high precision use 2x2 implementation.
To use this implementation, follow the syntax and steps outlined below:
#include "winograd_4x4_3x3.cuh"
#include "winograd_2x2_3x3.cuh"
const int N = 32,
C = 128,
H = 112,
W = 112,
K = 128,
padding = 1;
float *img = new float[N * C * H * W];
float *filter = new float[K * C * 3 * 3];
float *out = convWinograd_2x2_3x3(img, N, C, H, W, f, K, padding);
float *out = convWinograd_4x4_3x3(img, N, C, H, W, f, K, padding);
// out [N, K, H, W]
free(img); free(filter); free(out);
Implementation can be seen at PHASES.md
We auto tuned the launch configuration for our hardware (RTX 3070Ti). Other devices may need another round of autotuning to get comparable performance.
To build the project, ensure you have CUDA installed and properly configured. Use CMake to build the project.
mkdir build
cd build
cmake ..
Run binary (main) from bin folder.
- NVIDIA for their CUDA framework.
- Fast Algorithms for Convolutional Neural Networks