GitHub Repo for CUDA Course on FreeCodeCamp
Note: This course is designed for Ubuntu Linux. Windows users can use Windows Subsystem for Linux or Docker containers to simulate the ubuntu Linux environment.
- The Deep Learning Ecosystem
- Setup/Installation
- C/C++ Review
- Gentle Intro to GPUs
- Writing Your First Kernels
- CUDA APIs (cuBLAS, cuDNN, etc)
- Optimizing Matrix Multiplication
- Triton
- PyTorch Extensions (CUDA)
- Final Project
- Extras
This course aims to:
- Lower the barrier to entry for HPC jobs
- Provide a foundation for understanding projects like Karpathy's llm.c
- Consolidate scattered CUDA programming resources into a comprehensive, organized course
- Focus on GPU kernel optimization for performance improvement
- Cover CUDA, PyTorch, and Triton
- Emphasis on technical details of writing faster kernels
- Tailored for NVIDIA GPUs
- Culminates in a simple MLP MNIST project in CUDA
- Python programming (required)
- Basic differentiation and vector calculus for backprop (recommended)
- Linear algebra fundamentals (recommended)
- Optimizing existing implementations
- Building CUDA kernels for cutting-edge research
- Understanding GPU performance bottlenecks, especially memory bandwidth
- Any NVIDIA GTX, RTX, or datacenter level GPU
- Cloud GPU options available for those without local hardware
- Deep Learning (primary focus of this course)
- Graphics and Ray-tracing
- Fluid Simulation
- Video Editing
- Crypto Mining
- 3D modeling
- Anything that requires parallel processing with large arrays
- GitHub repo (this repository)
- Stack Overflow
- NVIDIA Developer Forums
- NVIDIA and PyTorch documentation
- LLMs for navigating the space
- Cheatsheet here
- https://github.com/CoffeeBeforeArch/cuda_programming
- https://www.youtube.com/@GPUMODE
- https://discord.com/invite/gpumode
- How do GPUs works? Exploring GPU Architecture
- But how do GPUs actually work?
- Getting Started With CUDA for Python Programmers
- Transformers Explained From The Atom Up
- How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA
- Parallel Computing with Nvidia CUDA - NeuralNine
- CPU vs GPU vs TPU vs DPU vs QPU
- Nvidia CUDA in 100 Seconds
- How AI Discovered a Faster Matrix Multiplication Algorithm
- The fastest matrix multiplication algorithm
- From Scratch: Cache Tiled Matrix Multiplication in CUDA
- From Scratch: Matrix Multiplication in CUDA
- Intro to GPU Programming
- CUDA Programming
- Intro to CUDA (part 1): High Level Concepts
- Intro to GPU Hardware