This is a work log and self-study journey on how to optimize single precision general matrix multiplication on RTX4060 GPU. I explained some basic cuda concepts and profiler tricks during kernel optimization. Currently I am not very satisfied with the results, although I did put a lot of effort.
mkdir build
cd build
cmake ..
cmake --build . --config Release
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Beating cuBLAS in Single-Precision General Matrix Multiplication
nvidia cuda c++ programming guide
-
kernel 8 performs worse than kernel 7, I wonder why.
-
kernel 9 double buffering now is right, but I am not satisfied with the performance.
-
For now, my kernels only deal with perfect square matrix with no tile quantization.
-
Maybe I will integrate PTX code into cuda code with asm().
-
warp level matmul, tensor core
-
Hopper features: TMA, Asynchrony, check Hopper for details.
-
chcp.com 65001 to avoid garbled in windows cmd.
-
ncu is not happy with cloud gpu.

