Skip to content
This repository was archived by the owner on Sep 5, 2025. It is now read-only.

InhabitancyCocoon/feather_gemm

Repository files navigation

logo

Feather GEMM: Toward cuBLAS performance

Introduction

This is a work log and self-study journey on how to optimize single precision general matrix multiplication on RTX4060 GPU. I explained some basic cuda concepts and profiler tricks during kernel optimization. Currently I am not very satisfied with the results, although I did put a lot of effort.

roofline_model

Build

mkdir build
cd build
cmake ..
cmake --build . --config Release

Thanks

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Beating cuBLAS in Single-Precision General Matrix Multiplication

nvidia cuda c++ programming guide

nvidia ncu document

professional cuda c programming

triton document

efficient gemm

gpu mode

TODO

  • kernel 8 performs worse than kernel 7, I wonder why.

  • kernel 9 double buffering now is right, but I am not satisfied with the performance.

  • For now, my kernels only deal with perfect square matrix with no tile quantization.

  • Maybe I will integrate PTX code into cuda code with asm().

  • warp level matmul, tensor core

  • Hopper features: TMA, Asynchrony, check Hopper for details.

TIPS

  • chcp.com 65001 to avoid garbled in windows cmd.

  • ncu is not happy with cloud gpu.

About

Optimizing matmul to cuBLAS performance level as close as possible.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors