References

Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU { Matsumoto et al., (2012) }
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems { Garg and Hendren, (2014) }
CLTune: A Generic Auto-Tuner for OpenCL Kernels { Nugteren and Codreanu, (2015) }
CLBlast: A Tuned OpenCL BLAS Library { git and arXiv }
Accelerating GPU kernels for dense linear algebra { Nath et al. (2011) }
Accelerating cuBLAS/cuDNN using Input-Aware Auto-Tuning: The ISAAC library { code and slides }
A three-dimensional approach to parallel matrix multiplication { Agarwal et al. (1995) }
Matrix multiplication beyond auto-tuning: Rewrite-based gpu code generation. { Steuwer et al.(2016) }

More coming soon.

Provide feedback