Useful resources for learning CUDA and its Python variants.
Numba provides a framework for quickly implementing GPU acceleration. Numba does not support all CUDA functionality but has a simpler implementation than PyCUDA. Kernels are written in Python (technically the CUDA-Python variant) and converted by the Numba compiler to PTX code. http://numba.pydata.org/numba-doc/0.13/CUDAJit.html
PyCUDA framework provides complete CUDA functionality, however this may require writing kernels in CUDA C. https://documen.tician.de/pycuda/
NVIDIA blog with excellent discussion of various parallel programming concepts, bottlenecks, and optimizationss. https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/ https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
An excellent resource produced by NVIDIA for understanding reduction algorithms, and the different sources of both acceleration and bottlenecks in a parallel processing framework. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
http://people.cs.pitt.edu/~melhem/courses/xx45p/cuda_examples.pdf
https://mc.stanford.edu/cgi-bin/images/5/55/Darve_cme343_cuda_4.pdf